A year or so ago while reading the book “Programming in Lua” I reached the chapter where it is described an implementation of a random text generation algorithm using Markov chains and n-grams. I implemented a similar algorithm 10 years ago. My first encounter with this algorithm was when I was a teenager — I read a description of it in Martin Gardner‘s column in the Russian translation of “Scientific American”. I productized the implementation into a package (NGramMarkovChains.m) that is now available at the MathematicaForPrediction project at GitHub.

Implementing this algorithm in *Mathematica* turned out to be surprisingly easy using multi-dimensional sparse arrays (SparseArray) and weighted random sampling (RandomSample). The code I wrote can be run with N-grams of any N; the implementation in the book “Programming in Lua” 2nd ed. is for 2-grams — see Chapter 10.2 of the book.

Below are given snippets of texts generated by the algorithm using Darwin’s “The Origin of Species.” (Readily available within *Mathematica*.) From them we can see that the 5-gram generated text makes more sense than the 2-gram one. All 4 randomly generated texts start from the same place in the book.

Different variations of the generated texts can be derived by using the word count argument and the options `WordSeparators`

. Note that `ToLowerCase`

was applied to the text.

Here is one more example with Shakespeare’s play “Hamlet”:

The follow up of this post is my next post: “Classification of genome data with n-gram models”.

### Like this:

Like Loading...

*Related*

Is this useful for just pulling our a list of the most common and most important phrases from a body of text?

Yes, we can use the N-gram models to find out the most common phrases. We can just take the top

kN-grams with the highest probability, or the N-grams with probabilities above a certain threshold. As for most important phrases, it really depends how we define phrase and word importance. If we take the definition “important words are in important sentences, and important sentences have important words” then we can use matrix factorization methods to find the important words and then see which N-grams have those words.