Jacob’s Blog - The Markovian Sommelier

“Rich Tannins.”
“Peppery finish.”
“Afternotes of loamy soil.”

Who writes wine descriptions, anyways? Wine reviews are practically a genre of their own, with a specific vocabulary and its own set of phrases and that I basically never see in any other context.

In this projet we will make a very simple model that randomly generates new wine reviews. I will walk through each step in designing the model and implementing it!

Defining the model

The model we will be using is a very simple Markov chain model. First, we model each wine review as a sequence of word pairs (i.e. bigrams). Then, we create new reviews by chaining together word pairs using a single rule which is used to generate the next word given the preceding word as input. We simply look through a dataset of real wine reviews and find all occurences of the preceding word, then randomly pick one of them and use whatever word followed it in that context.

Here’s the algorithm for generating the n-th word \(w_n\) given the preceding word \(w_{n-1}\) and a dataset \(D\):

Algorithm \(g(w_n | w_{n-1}, D)\):

Find \(O = \{o_1, o_2, \dots, o_m\}\), the set of all \(m\) occurences of \(w_{n-1}\) in \(D\)
Randomly choose an occurence \(o_k \in O\)
Return the word immediately following \(o_k\) in its original context

Because the generation of each word depends only on the previous word, it is completely independent of all the other preceding words in the description so far. In other words, \(P(w_n | w_{n-1}) = P(w_n | w_{n-1}, w_{n-2}, \dots, w_{1})\). This means that our model is a Markovian process. The transition probabilities between bigrams are empirically determined from our corpus.

Of course this is probably not going to be a great model, since it does not consider any of the context besides the immediately preceding word. But it can still give surprisingly good results, as it lets us capture many of the common two-word phrases which define the genre of wine reviews.

Now let’s take a look at implementing this model.

Loading the Data

Luckily, someone has already gone through the effort of creating a dataset of more than 280,000 real wine descriptions! These were scraped from Wine Enthusiast and the dataset is hosted on Kaggle. The data have been downloaded and placed in the ./data folder. The data are split into two files.

import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import spacy

# first load data
data1 = pd.read_csv('./data/winemag-data-130k-v2.csv')
data2 = pd.read_csv('./data/winemag-data_first150k.csv')

print(data1.shape)
print(data2.shape)

(129971, 14)
(150930, 11)

Let’s take a quick look at the datasets:

data1.head(1)

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia

data2.head(1)

	Unnamed: 0	country	description	designation	points	price	province	region_1	region_2	variety	winery
0	0	US	This tremendous 100% varietal wine hails from ...	Martha's Vineyard	96	235.0	California	Napa Valley	Napa	Cabernet Sauvignon	Heitz

For this model, we are only interested in the descriptions, so let’s pull those out and combine all the descriptions from both files:

descriptions = list(data1["description"].values) + list(data2["description"].values)

# strip any leading or trailing whitespace if any
descriptions = [string.strip() for string in descriptions]

print("Total number of descriptions: ", len(descriptions))

Total number of descriptions:  280901

Let’s take a look at a few examples:

for item in np.random.choice(descriptions, size = 3): print(item, "\n")

Sweet mocha and coffee notes overwhelm the bouquet of this Pinot, with red raspberry and cherry skin notes providing support. Lively acidity and a satiny texture fill the mouth, while white pepper spice lingers on the finish. 

Hints of nail polish and flavors of hard citrus candy, with grainy honey and sugar. This is not a shy Riesling; it's intense, rich with peach and apricot, and pushed just a bit too far for some tastes. 

Produced by the owners of Châteauneuf-du-Pape estate Château Mont-Redon, this is a full and fruity wine. It has a good balance between acidity and red berry fruits that give a rich character. Packed with flavor, it's ready to drink.

Preprocessing the Data

Now we need to process the data to get ready for our model. But what is the best way to do this?

Data Structure

First we need to choose the data structure we will use. At its heart, our model relies on consectutive word pairs. So we could parse our dataset into a list of all word pairs, and then generate by filtering the list and randomly choosing.

However, we know that many of the word pairs will appear quite frequently! If we just parse into a list of all word pairs, we might have 100 identical entries on our list for “rich tannins.” We can instead count how times a word pair occurs, and keep track of the counts of all the tokens. When it comes time to sample the next word, we can simply use probabilities proportional to the counts instead of uniformly sampling! This will let us generate words without having to process the entire set of all the token pairs in our entire dataset.

In python, we will implement this as a dictionary, where each key is a token. I’ll call this our vocabulary. The corresponding values are dictionaries themselves containing counts of all the tokens that followed.

Tokenizing

Each descriptions in the dataset is a single string. We need to divide the descriptions into their individual words, so we can count the word pairs. This process is called tokenization, where we divide the input into a set of tokens.

Rather than doing this from scratch, we will use a pre-made tokenizer from Spacy. The advantage of this is that the pre-made tokenizer is smart enough to handle things like puncuation.

%%time

# use pre-made tokenizer from spacy
nlp = spacy.load("en_core_web_sm")

# a dictionary will be used to hold the vocabulary
# each item in the vocabulary will have a counter to track which words follow it
pair_freq = defaultdict(Counter)

# make a special end of sentence token
end_token = "END_TOKEN"

# process all the descriptions
# disabling unneeded components in the pipeline to speed it up
for description in nlp.pipe(descriptions, disable=["tagger", "parser", "ner"]):
    # for each token, update the counts of the following word
    for token in description:
        # get the following token
        try:
            neighbor = token.nbor().text
        except IndexError:
            neighbor = end_token
        
        pair_freq[token.text][neighbor] += 1

vocab = list(pair_freq.keys())
print("Total number of words:", len(vocab))

Total number of words: 45481
CPU times: user 1min 25s, sys: 404 ms, total: 1min 26s
Wall time: 1min 27s

import json
with open('robosomm_data.json', 'w') as fp:
    json.dump(pair_freq, fp)

import pickle
with open('robosomm_data.pickle', 'wb') as handle:
    pickle.dump(pair_freq, handle)

Our vocabulary consists of more than 45,000 unique words!

Let’s look at some random examples of word pairs:

for token1 in np.random.choice(vocab, size = 10): 
    all_following = list(pair_freq[token1].keys())
    token2 = np.random.choice(all_following)
    print(token1, token2)

dottings of
ripper ,
colada ,
assemblng quite
blackberry clusters
gallo salsa
Barefoot sparkling
sections that
Carpoli has
sauvage wildness

Implementing the model

First, we implement our function to generate the next word. Because we preprocessed the data in a smart way, this is actually very simple!

# functions to generate text
def gen_next_word(word):
    """Generate the next word given the preceding word"""
    # Get the counter for the following words
    all_following = pair_freq[word]
    # Get the words themselves, and corresponding counts
    following_words = list(all_following.keys())
    counts = np.array(list(all_following.values()))
    # Randomly sample the next word 
    weights = counts / np.sum(counts)
    return np.random.choice(following_words, p = weights)

Now to generate a description from scratch, we just use a loop to continuously generate the next word! The loop stops when we either hit the special end-os-sentence token, or when we reach a maximum description length.

def generate_description(prompt):  
    """Generate a wine descriptions given a prompt"""
    prompt_doc = nlp(prompt)
    
    # set up the while loop
    current_text = prompt
    last_word = prompt_doc[-1].text
    not_end_token = True
    max_desc_length = 100
    c = 0
    
    while not_end_token and c < max_desc_length:
        next_word = gen_next_word(last_word)
        if next_word == end_token:
            not_end_token = False
        else:
            current_text += " "+next_word
            last_word = next_word
            c += 1
    
    return current_text

Trying it out!

Now we can generate our own wine reviews! Let’s look at a few examples:

generate_description("A fruity merlot, with a smoky")

"A fruity merlot, with a smoky oak . The black tea and toasty oak , apricot , allied to the next six years of lively , it 's an apéritif wine very tight and soft , it too extracted Malbec . Best now . Now–2014 ."

generate_description("A full bodied cabernet")

"A full bodied cabernet sauvignon . It has honey , it 's a delicious , and berry fruits and rich future ."

generate_description("Spicy")

'Spicy cinnamon , it would pair with hearty mouthful of Pinot they are tougher , currants , cherries lead to the finish .'

generate_description("This wine is terrible")

'This wine is terrible flaws here . In the black fruit . It feels tight tannins , luscious and fresh and sophisticated notes , this wine offers aromas emerge with ample cherry flavors . The finish is very impressive is a bit of cherry , which offers a shame to soften . In the ripe and Mourvèdre , with suggesting wet cement , juicy and bitter , this 100 % Syrah with just yearning to say that will put in French oak flavors are certified - dimensional in the perfumes , packed with mixed with mature fruit and minerality and a final indication of'

Conclusion

There we have it! A (very rudimentary) text generation model!

The descriptions certainly aren’t great - I don’t think any human would be fooled! However, given how rudimentary our model is, the results are surprisingly good. The sentences are mostly coherent, and they also do well at capturing the vocabulary and phrases distinctive of the wine description genre! This shows how even the simplest model can “learn” features distinctive of the dataset it was trained on.

Of course we could improve on this model by using 3-grams or 4-grams instead of bigrams, which would let us capture more context. Or, we could use NLP methods that are much better than Markov chains! Recurrent neural networks, transformers, etc… Maybe we’ll look at those in a future notebook.

In the meantime, enjoy this Markovian Sommelier!