Markov chains n-gram model implementation

A year or so ago while reading the book “Programming in Lua” I reached the chapter where it is described an implementation of a random text generation algorithm using Markov chains and n-grams. I implemented a similar algorithm 10 years ago. My first encounter with this algorithm was when I was a teenager — I read a description of it in Martin Gardner‘s column in the Russian translation of “Scientific American”. I productized the implementation into a package (NGramMarkovChains.m) that is now available at the MathematicaForPrediction project at GitHub.

Implementing this algorithm in Mathematica turned out to be surprisingly easy using multi-dimensional sparse arrays (SparseArray) and weighted random sampling (RandomSample). The code I wrote can be run with N-grams of any N; the implementation in the book “Programming in Lua” 2nd ed. is for 2-grams — see Chapter 10.2 of the book.

Below are given snippets of texts generated by the algorithm using Darwin’s “The Origin of Species.” (Readily available within Mathematica.) From them we can see that the 5-gram generated text makes more sense than the 2-gram one. All 4 randomly generated texts start from the same place in the book.

Different variations of the generated texts can be derived by using the word count argument and the options WordSeparators. Note that ToLowerCase was applied to the text.

Markov chains n-grams over Origin of species

Here is one more example with Shakespeare’s play “Hamlet”:
Markov chains n-grams over Hamlet

The follow up of this post is my next post: “Classification of genome data with n-gram models”.

Advertisements

2 thoughts on “Markov chains n-gram model implementation

    • Yes, we can use the N-gram models to find out the most common phrases. We can just take the top k N-grams with the highest probability, or the N-grams with probabilities above a certain threshold. As for most important phrases, it really depends how we define phrase and word importance. If we take the definition “important words are in important sentences, and important sentences have important words” then we can use matrix factorization methods to find the important words and then see which N-grams have those words.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s