"Rich Tannins."
"Peppery finish."
"Afternotes of loamy soil."
Who writes wine descriptions, anyways? Wine reviews are practically a genre of their own, with a specific vocabulary and its own set of phrases and that I basically never see in any other context.
In this projet we will make a very simple model that randomly generates new wine reviews. I will walk through each step in designing the model and implementing it!
The model we will be using is a very simple Markov chain model. The model consists of a single rule which is used to generate the next word given the preceding word as input. We simply look through a dataset of real wine reviews and find all occurences of the preceding word, then randomly pick one of them and use whatever word followed it in that context.
Here's the algorithm for generating the n-th word $w_n$ given the preceding word $w_{n-1}$ and a dataset $D$:
Algorithm $g(w_n | w_{n-1}, D)$:
Because the generation of each word depends only on the previous word, it is completely independent of all the other preceding words in the description so far. In other words, $P(w_n | w_{n-1}) = P(w_n | w_{n-1}, w_{n-2}, \dots, w_{1})$ This means that our model is a Markovian process.
Of course this is probably not going to be a great model, since it does not consider any of the context besides the immediately preceding word. But it is still gives surprisingly good results!
Now let's take a look at implementing this model.
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import spacy
Luckily, someone has already gone through the effort of creating a dataset of more than 280,000 real wine descriptions! These were scraped from Wine Enthusiast and the dataset is hosted on Kaggle. The data have been downloaded and placed in the ./data
folder. The data are split into two files.
# first load data
data1 = pd.read_csv('./data/winemag-data-130k-v2.csv')
data2 = pd.read_csv('./data/winemag-data_first150k.csv')
print(data1.shape)
print(data2.shape)
Let's take a quick look at the datasets:
data1.head(1)
data2.head(1)
For this model, we are only interested in the descriptions, so let's pull those out and combine all the descriptions from both files:
descriptions = list(data1["description"].values) + list(data2["description"].values)
# strip any leading or trailing whitespace if any
descriptions = [string.strip() for string in descriptions]
print("Total number of descriptions: ", len(descriptions))
Let's take a look at a few examples:
for item in np.random.choice(descriptions, size = 3): print(item, "\n")
Now we need to process the data to get ready for our model. But what is the best way to do this?
First we need to choose the data structure we will use. At its heart, our model relies on consectutive word pairs. So we could parse our dataset into a list of all word pairs, and then generate by filtering the list and randomly choosing.
However, we know that many of the word pairs will appear quite frequently! If we just parse into a list of all word pairs, we might have 100 identical entries on our list for "rich tannins." We can instead count how times a word pair occurs, and keep track of the counts of all the tokens. When it comes time to sample the next word, we can simply use probabilities proportional to the counts instead of uniformly sampling! This will let us generate words without having to process the entire set of all the token pairs in our entire dataset.
In python, we will implement this as a dictionary, where each key is a token. I'll call this our vocabulary. The corresponding values are dictionaries themselves containing counts of all the tokens that followed.
Each descriptions in the dataset is a single string. We need to divide the descriptions into their individual words, so we can count the word pairs. This process is called tokenization, where we divide the input into a set of tokens.
Rather than doing this from scratch, we will use a pre-made tokenizer from Spacy. The advantage of this is that the pre-made tokenizer is smart enough to handle things like puncuation.
%%time
# use pre-made tokenizer from spacy
nlp = spacy.load("en_core_web_sm")
# a dictionary will be used to hold the vocabulary
# each item in the vocabulary will have a counter to track which words follow it
pair_freq = defaultdict(Counter)
# make a special end of sentence token
end_token = "END_TOKEN"
# process all the descriptions
# disabling unneeded components in the pipeline to speed it up
for description in nlp.pipe(descriptions, disable=["tagger", "parser", "ner"]):
# for each token, update the counts of the following word
for token in description:
# get the following token
try:
neighbor = token.nbor().text
except IndexError:
neighbor = end_token
pair_freq[token.text][neighbor] += 1
vocab = list(pair_freq.keys())
print("Total number of words:", len(vocab))
Our vocabulary consists of more than 45,000 unique words!
Let's look at some random examples of word pairs:
for token1 in np.random.choice(vocab, size = 10):
all_following = list(pair_freq[token1].keys())
token2 = np.random.choice(all_following)
print(token1, token2)
First, we implement our function to generate the next word. Because we preprocessed the data in a smart way, this is actually very simple!
# functions to generate text
def gen_next_word(word):
"""Generate the next word given the preceding word"""
# Get the counter for the following words
all_following = pair_freq[word]
# Get the words themselves, and corresponding counts
following_words = list(all_following.keys())
counts = np.array(list(all_following.values()))
# Randomly sample the next word
weights = counts / np.sum(counts)
return np.random.choice(following_words, p = weights)
Now to generate a description from scratch, we just use a loop to continuously generate the next word! The loop stops when we either hit the special end-os-sentence token, or when we reach a maximum description length.
def generate_description(prompt):
"""Generate a wine descriptions given a prompt"""
prompt_doc = nlp(prompt)
# set up the while loop
current_text = prompt
last_word = prompt_doc[-1].text
not_end_token = True
max_desc_length = 100
c = 0
while not_end_token and c < max_desc_length:
next_word = gen_next_word(last_word)
if next_word == end_token:
not_end_token = False
else:
current_text += " "+next_word
last_word = next_word
c += 1
return current_text
Now we can generate our own wine reviews! Let's look at a few examples:
generate_description("A fruity merlot, with a smoky")
generate_description("A full bodied cabernet")
generate_description("Spicy")
generate_description("This wine is terrible")
There we have it! A (very rudimentary) text generation model!
The descriptions certainly aren't great - I don't think any human would be fooled! However, given how rudimentary our model is, I think that the results are surprisingly good. The sentences are mostly coherent, and they also do quite well at capturing the vocabulary and phrases distinctive of the wine description genre! This shows how even the simplest model can "learn" features distinctive of the dataset it was trained on.
Of course the field of natural language processing has methods that are much better than Markov chains!! Recurrent neural networks, transformers, etc... Maybe we'll look at those in a future notebook.
In the meantime, enjoy this Markovian Sommelier!