A Hangman Game Solver Based on Language Models

I've recently wrote a program for solving sentence-based hangman game automatically. Here's a post explaining my approach using the British National Corpus, srilm language model and hidden Markov model.

The Hangman Game

Last week, my friend Zero Cho was asked to write a program that solves the game of hangman automatically. The rules are simple: the questions are short sentences or phrases with their characters concealed, i.e. replaced with an underscore, while spaces and punctuations are revealed. The solving process is an interactive process in which the solver program will guess a letter, and the question system will reveal its locations. If you guessed 3 letters that are not in the sentence, you fail the question.

For example, the initial hint for the answer it's a sunny day is __'_ _ _____ ___, if a solver program guesses y, the system will respond with __'_ _ ____y __y.

My friend Zero used a statistical approach that first guesses vows with highest frequencies until the first correct guess. With one vow revealed, he then repeatedly choose the word with the most portion revealed to continue guessing. His approach to guessing a half-revealed word is also statistical, namely to find the most frequent English word that matches the pattern. His method is explained in detail in his recent blog post.

The shortcoming of this approach is that it lacks consideration for correlations between words in the sentences. For example, even the word sorry has a higher frequency than sunny, the latter is more likely to appear before the word day. In my attempt, I trained a Language Model to capture such correlations, and use hidden Markov model to guess every word in the sentence simultaneously.

Language Model

I use the state-of-the-art srilm package to train my language model on a portion of the British National Corpus (BNC). For those who are familiar with srilm, the parameters I've used include order-5, Kneser-Ney discount, and interpolation. I use SWIG to integrate srilm (in C) with my main program written in Python.

A language model is a statistical model that defines the probability of a sequence of words using probability distribution trained on a free text corpus.

For example, given a sequence of words $w_1, w_2, w_3, …, w_n$, the unigram (one-word) language model calculates the probability of such sequence as $$P(w_{1}w_{2}w_{3}…w_{n}) := P(w_1) * P(w_2) * P(w_3) * … * P(w_n)$$ this is sometimes called an order-1 language model. The word probability can be calculated by counting its appearances in the corpus: $$P(w) := \frac { count(w) } { |\ corpus\ | }$$ An order-2 language model, or bigram language model, defines the sequence probability using conditional probability: $$P(w_{1}w_{2}w_{3}…w_{n}) := P(w_1) * P(w_2|w_1) * P(w_3|w_2) * … * P(w_n|w_{n-1})$$ intuitively, conditional probability can be calculated as: $$P(w_a|w_b) := \frac { count(w_{b}w_{a}) } { count(w_b) }$$ Generally, higher order produces better results but also suffers more severely on the out-of-vocabulary problem and therefore sometimes needs to fall back to lower order conditional probabilities. Many theories have been proposed to improve the performance of language models, including absolute discount estimation, good turing, Kneser-Ney smoothing. Formal definition of language model can be found on Wikipedia.

Hidden Markov Model

Hidden Markov model is a statistical model that solves a sequence of hidden states given a sequence of observations, state transition probability distribution, and emission probability distribution. For example, in speech recognition, the recorded audio speech is the observation, the actual text are the hidden states. Another example is the optical character recognition (OCR) where the scanned images are the observations. Emission probability distribution indicates the probability from state to observation, i.e., from text to audio or image. Language model is commonly used for state transition probability distribution.

In the case solving the Hangman game, the hidden states are the English words in the answer sentences, e.g., it's, a, sunny, day, and the observations are some half-revealed sequences, e.g., __'s, _, s_nny, __y. The state transition probability is calculated by the trained language model, e.g., $P(day | sunny)$ in a order-1 language model. I define the emission probability as the normalize probability of all the English words that matches the half-revealed pattern, e.g., normalized $P(day), P(pay), P(bay), P(hey), P(may), …, etc$. for state __y, and zero otherwise. With the transition and emission probabilities defined, we can use the Viterbi dynamic programming algorithm to calculate the optimal state sequence.

Formal definition and more information on HMM can be found on Wikipedia pages here and here, or in a slide I made when I was TA for a NLP course here (in Chinese).

Algorithm

It is obvious that to run the hidden Markov model on a fully concealed sequence will yield bad results. Therefore, at stage one my solver also start by guessing high frequency letters. Instead of guessing vows only, I also use consonants, hoping that revealing some consonants will help the second stage. The sorted list begins with e, a, i, s, r, n, t, o, …, etc. This process stops until it had made one wrong guesses or have revealed more than four letters.

In the second stage, the trained hidden Markov model will tag the half-revealed words with English word sequence that yield the highest language model probability and word frequencies. The solver will then guess the letter that appears in most words. This process repeats until all characters are revealed, and the number of failed guesses are recorded.

Results

Here are three examples and results. The interactive guessing progress and some internal states are shown in the three tables below.

"describe what is needed"

stage errors hint HMM prediction guess
1 0 ________ ____ __ ______ N/A e
1 0 _e_____e ____ __ _ee_e_ N/A a
1 0 _e_____e __a_ __ _ee_e_ N/A i
1 0 _e___i_e __a_ i_ _ee_e_ N/A s
1 0 _es__i_e __a_ is _ee_e_ N/A r
2 0 _es_ri_e __a_ is _ee_e_ describe what is needed d
2 0 des_ri_e __a_ is _eeded describe what is needed n
2 0 des_ri_e __a_ is needed describe what is needed t
2 0 des_ri_e __at is needed describe what is needed t
2 0 descri_e __at is needed describe what is needed hbw (abrv.)
2 0 describe what is needed (solved)

"an honor roll of online options"

stage errors hint HMM prediction guesses
1 0 __ _____ ____ __ ______ _______ N/A eaisr
2 0 a_ ____r r___ __ ___i_e ___i__s at honor roll of police options o
2 0 a_ _o_or ro__ o_ o__i_e o__io_s at honor roll of office options n
2 0 an _onor ro__ o_ on_ine o__ions an honor roll of online options l
2 0 an _onor roll o_ online o__ions an honor roll of online options t
2 0 an _onor roll o_ online o_tions an honor roll of online options hpf
2 0 an honor roll of online options (solved)

This one shows HMM produces different results as more characters are revealed.

"members of the supreme court"

stage errors hint HMM prediction guesses
1 0 _______ __ ___ _______ _____ N/A e
1 0 _e__e__ __ __e ____e_e _____ N/A a
2 1 _e__e__ __ __e ____e_e _____ members of the supreme court rstoumchpbf
2 1 members of the supreme court

This last one really shows the power of language model. With only e revealed, the HMM process was able to guess the correct result.

Improvements

For stage one, instead of using just the highest frequency letters, conditional probability can also be a factor for choosing the next letter. For example, if s is revealed as the second to last character in some words, t and e may likely to be the last letter of such words.

For the two stages, we can decide to advance from stage 1 to 2 base on different factors instead of heuristically. These may include the confident of the HMM results, the percentage of letters revealed, the before mentioned conditional probabilities and more. Global optimization approaches can be used to find good parameters to determine the timing to start stage two. Alternatively, we can treat the two stages as different strategies and use genetic algorithm to apply them alternately.