The Markovian Sommelier¶

Robot Sommelier

"Rich Tannins."
"Peppery finish."
"Afternotes of loamy soil."

Who writes wine descriptions, anyways? Wine reviews are practically a genre of their own, with a specific vocabulary and its own set of phrases and that I basically never see in any other context.

In this projet we will make a very simple model that randomly generates new wine reviews. I will walk through each step in designing the model and implementing it!

The model we will be using is a very simple Markov chain model. The model consists of a single rule which is used to generate the next word given the preceding word as input. We simply look through a dataset of real wine reviews and find all occurences of the preceding word, then randomly pick one of them and use whatever word followed it in that context.

Here's the algorithm for generating the n-th word $w_n$ given the preceding word $w_{n-1}$ and a dataset $D$:

Algorithm $g(w_n | w_{n-1}, D)$:

Find $O = \{o_1, o_2, \dots, o_m\}$, the set of all $m$ occurences of $w_{n-1}$ in $D$
Randomly choose an occurence $o_k \in O$
Return the word immediately following $o_k$ in its original context

Because the generation of each word depends only on the previous word, it is completely independent of all the other preceding words in the description so far. In other words, $P(w_n | w_{n-1}) = P(w_n | w_{n-1}, w_{n-2}, \dots, w_{1})$ This means that our model is a Markovian process.

Of course this is probably not going to be a great model, since it does not consider any of the context besides the immediately preceding word. But it is still gives surprisingly good results!

Now let's take a look at implementing this model.

Imports¶

import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import spacy

Loading the Data¶

Luckily, someone has already gone through the effort of creating a dataset of more than 280,000 real wine descriptions! These were scraped from Wine Enthusiast and the dataset is hosted on Kaggle. The data have been downloaded and placed in the ./data folder. The data are split into two files.

# first load data
data1 = pd.read_csv('./data/winemag-data-130k-v2.csv')
data2 = pd.read_csv('./data/winemag-data_first150k.csv')

print(data1.shape)
print(data2.shape)

(129971, 14)
(150930, 11)

Let's take a quick look at the datasets:

data1.head(1)

data2.head(1)

For this model, we are only interested in the descriptions, so let's pull those out and combine all the descriptions from both files:

descriptions = list(data1["description"].values) + list(data2["description"].values)

# strip any leading or trailing whitespace if any
descriptions = [string.strip() for string in descriptions]

print("Total number of descriptions: ", len(descriptions))

Total number of descriptions:  280901

Let's take a look at a few examples:

for item in np.random.choice(descriptions, size = 3): print(item, "\n")

This new release of Riot changes the grapes around, with Sangiovese now comprising half of the wine and Mourvèdre and Grenache filling in the rest of the mix. It's a fun blend, fruity and broad, with a tart raspberry kick and a plummy, spicy finish. 

Muted aromatics frame the entry of this wine, a concentration of baked raspberry and blackberry at its core. Smooth and soft, with minimal presence of oak, it's a fairly elegant expression of the grape. 

Made using grapes sourced from a high valley between the Russian River and Napa valleys, this wine is rich in tropical fruit, lemon, honeysuckle and vanilla flavors. Brisk acidity helps to keep it from seeming cloying.

Preprocessing the Data¶

Now we need to process the data to get ready for our model. But what is the best way to do this?

Data Structure¶

First we need to choose the data structure we will use. At its heart, our model relies on consectutive word pairs. So we could parse our dataset into a list of all word pairs, and then generate by filtering the list and randomly choosing.

However, we know that many of the word pairs will appear quite frequently! If we just parse into a list of all word pairs, we might have 100 identical entries on our list for "rich tannins." We can instead count how times a word pair occurs, and keep track of the counts of all the tokens. When it comes time to sample the next word, we can simply use probabilities proportional to the counts instead of uniformly sampling! This will let us generate words without having to process the entire set of all the token pairs in our entire dataset.

In python, we will implement this as a dictionary, where each key is a token. I'll call this our vocabulary. The corresponding values are dictionaries themselves containing counts of all the tokens that followed.

Tokenizing¶

Each descriptions in the dataset is a single string. We need to divide the descriptions into their individual words, so we can count the word pairs. This process is called tokenization, where we divide the input into a set of tokens.

Rather than doing this from scratch, we will use a pre-made tokenizer from Spacy. The advantage of this is that the pre-made tokenizer is smart enough to handle things like puncuation.

%%time

# use pre-made tokenizer from spacy
nlp = spacy.load("en_core_web_sm")

# a dictionary will be used to hold the vocabulary
# each item in the vocabulary will have a counter to track which words follow it
pair_freq = defaultdict(Counter)

# make a special end of sentence token
end_token = "END_TOKEN"

# process all the descriptions
# disabling unneeded components in the pipeline to speed it up
for description in nlp.pipe(descriptions, disable=["tagger", "parser", "ner"]):
    # for each token, update the counts of the following word
    for token in description:
        # get the following token
        try:
            neighbor = token.nbor().text
        except IndexError:
            neighbor = end_token
        
        pair_freq[token.text][neighbor] += 1

vocab = list(pair_freq.keys())
print("Total number of words:", len(vocab))

Total number of words: 45481
CPU times: user 1min 28s, sys: 755 ms, total: 1min 29s
Wall time: 1min 31s

Our vocabulary consists of more than 45,000 unique words!

Let's look at some random examples of word pairs:

for token1 in np.random.choice(vocab, size = 10): 
    all_following = list(pair_freq[token1].keys())
    token2 = np.random.choice(all_following)
    print(token1, token2)

dottings of
ripper ,
colada ,
assemblng quite
blackberry clusters
gallo salsa
Barefoot sparkling
sections that
Carpoli has
sauvage wildness

Implementing the model¶

First, we implement our function to generate the next word. Because we preprocessed the data in a smart way, this is actually very simple!

# functions to generate text
def gen_next_word(word):
    """Generate the next word given the preceding word"""
    # Get the counter for the following words
    all_following = pair_freq[word]
    # Get the words themselves, and corresponding counts
    following_words = list(all_following.keys())
    counts = np.array(list(all_following.values()))
    # Randomly sample the next word 
    weights = counts / np.sum(counts)
    return np.random.choice(following_words, p = weights)

Now to generate a description from scratch, we just use a loop to continuously generate the next word! The loop stops when we either hit the special end-os-sentence token, or when we reach a maximum description length.

def generate_description(prompt):  
    """Generate a wine descriptions given a prompt"""
    prompt_doc = nlp(prompt)
    
    # set up the while loop
    current_text = prompt
    last_word = prompt_doc[-1].text
    not_end_token = True
    max_desc_length = 100
    c = 0
    
    while not_end_token and c < max_desc_length:
        next_word = gen_next_word(last_word)
        if next_word == end_token:
            not_end_token = False
        else:
            current_text += " "+next_word
            last_word = next_word
            c += 1
    
    return current_text

Trying it out!¶

Now we can generate our own wine reviews! Let's look at a few examples:

generate_description("A fruity merlot, with a smoky")

"A fruity merlot, with a smoky oak . The black tea and toasty oak , apricot , allied to the next six years of lively , it 's an apéritif wine very tight and soft , it too extracted Malbec . Best now . Now–2014 ."

generate_description("A full bodied cabernet")

"A full bodied cabernet sauvignon . It has honey , it 's a delicious , and berry fruits and rich future ."

generate_description("Spicy")

'Spicy cinnamon , it would pair with hearty mouthful of Pinot they are tougher , currants , cherries lead to the finish .'

generate_description("This wine is terrible")

'This wine is terrible flaws here . In the black fruit . It feels tight tannins , luscious and fresh and sophisticated notes , this wine offers aromas emerge with ample cherry flavors . The finish is very impressive is a bit of cherry , which offers a shame to soften . In the ripe and Mourvèdre , with suggesting wet cement , juicy and bitter , this 100 % Syrah with just yearning to say that will put in French oak flavors are certified - dimensional in the perfumes , packed with mixed with mature fruit and minerality and a final indication of'

Conclusion¶

There we have it! A (very rudimentary) text generation model!

The descriptions certainly aren't great - I don't think any human would be fooled! However, given how rudimentary our model is, I think that the results are surprisingly good. The sentences are mostly coherent, and they also do quite well at capturing the vocabulary and phrases distinctive of the wine description genre! This shows how even the simplest model can "learn" features distinctive of the dataset it was trained on.

Of course the field of natural language processing has methods that are much better than Markov chains!! Recurrent neural networks, transformers, etc... Maybe we'll look at those in a future notebook.

In the meantime, enjoy this Markovian Sommelier!