# Classification of genome data with n-gram models

In this blog post we consider the following problem:

Gene Sequence Classification Problem (GSCP): Given two genes, G1 and G2, and a (relatively short) sub-sequence S from one of them tell from which gene the sub-sequence S is part of.

One way to derive a solution for this problem is to model each gene with an n-gram model and classify the sub-sequences according to the risk ratio statistic or odds ratio statistic based on the Markov chain state transition probabilities. We are going to make classification experiments for different n’s of the n-gram model and use a type of Receiver Operating Characteristic (ROC) plot to judge which n is the best.

This approach can be also applied to text classification. I consider genes for conciseness and clarity.

The computations for the plots shown below were done with the Mathematica package for n-gram Markov chain models NGramMarkovChains.m provided by the MathematicaForPrediction project at GitHub.

### Data and models

Mathematica has the function GenomeData that gives the DNA sequences for specified genes.

I made classification experiments for the following pairs of genes:
1. F2 and ETS1 The classification is done for sequences like this:
``` In:= StringTake[GenomeData["F2", "FullSequence"], {11200, 11200 + 100}] Out= "AGCAGGCATATCTAGGGGAGGAGCACCGCAGGCTGGGGGCATGGCAGGCACTAAGGCCCTGAGGTGGGA\ GCACTCTTGGCTTGTCTGGGGAGCAGTAGGGA" ```

### Approach

Let us provide more details for the approach outlined at the beginning of the post.
First we define the classification algorithm NGramClassifier(k): (I was too lazy to type the algorithm in here so I used a screenshot of the description I wrote in Mathematica.)

In order to find the best k for NGramClassifier(k) we make the following sets of experiments:

1. For each k, 1<=k<=10 calculate k-gram models for G1 and G2. For these models use only the first 80% of the gene sequences. (In other words, for each Gi the training set is only 80% of Gi’s length.)

2. Using the last 20% of G1 and G2 derive a set of test sequences T by randomly picking Gi and the sub-sequence length within the range {minLen, maxLen}.

3. For each k, 1<=k<=10 calculate the classifications of the elements of T using NGramClassifier(k) with the models calculated in step 1.

4. Calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for each k in step 3.

5. Plot the points with coordinates {{TPR(k),TPR(k)}}, k in [1 10] and select the best k.

### ROC

With the classification results of the tuning experiments we make the following ROC point plots
1. F2 vs. ETS1 ### Conclusions and extensions

From the ROC plots we can see that the best classification results for GSCP with the genes “F2” and “ETS1” are obtained with Markov chain order 3 (or with 4-grams). Using order 2 (or 3-grams) is almost as good.

The experiments for the genes “F2” and “PAH” showed that using 3-grams gives best results for GSCP.

A natural extension of the experiments described is to repeat them for other pairs of genes and across different lengths of sub-sequences. In this way we can derive more general conclusions for the best n-gram length in the algorithm NGramClassifier.

# Markov chains n-gram model implementation

A year or so ago while reading the book “Programming in Lua” I reached the chapter where it is described an implementation of a random text generation algorithm using Markov chains and n-grams. I implemented a similar algorithm 10 years ago. My first encounter with this algorithm was when I was a teenager — I read a description of it in Martin Gardner‘s column in the Russian translation of “Scientific American”. I productized the implementation into a package (NGramMarkovChains.m) that is now available at the MathematicaForPrediction project at GitHub.

Implementing this algorithm in Mathematica turned out to be surprisingly easy using multi-dimensional sparse arrays (SparseArray) and weighted random sampling (RandomSample). The code I wrote can be run with N-grams of any N; the implementation in the book “Programming in Lua” 2nd ed. is for 2-grams — see Chapter 10.2 of the book.

Below are given snippets of texts generated by the algorithm using Darwin’s “The Origin of Species.” (Readily available within Mathematica.) From them we can see that the 5-gram generated text makes more sense than the 2-gram one. All 4 randomly generated texts start from the same place in the book.

Different variations of the generated texts can be derived by using the word count argument and the options `WordSeparators`. Note that `ToLowerCase` was applied to the text. Here is one more example with Shakespeare’s play “Hamlet”: The follow up of this post is my next post: “Classification of genome data with n-gram models”.