Classification of genome data with n-gram models

In this blog post we consider the following problem:

Gene Sequence Classification Problem (GSCP): Given two genes, G1 and G2, and a (relatively short) sub-sequence S from one of them tell from which gene the sub-sequence S is part of.

One way to derive a solution for this problem is to model each gene with an n-gram model and classify the sub-sequences according to the risk ratio statistic or odds ratio statistic based on the Markov chain state transition probabilities. We are going to make classification experiments for different n’s of the n-gram model and use a type of Receiver Operating Characteristic (ROC) plot to judge which n is the best.

This approach can be also applied to text classification. I consider genes for conciseness and clarity.

The full article with Mathematica code for the experiments described in the post can be downloaded from here.

The computations for the plots shown below were done with the Mathematica package for n-gram Markov chain models NGramMarkovChains.m provided by the MathematicaForPrediction project at GitHub.

Data and models

Mathematica has the function GenomeData that gives the DNA sequences for specified genes.

I made classification experiments for the following pairs of genes:
1. F2 and ETS1
Genes F2 and ETS1

  1. F2 and PAH
    Genes F2 and PAH

The classification is done for sequences like this:

In[812]:= StringTake[GenomeData["F2", "FullSequence"], {11200, 11200 + 100}]


Let us provide more details for the approach outlined at the beginning of the post.
First we define the classification algorithm NGramClassifier(k):

(I was too lazy to type the algorithm in here so I used a screenshot of the description I wrote in Mathematica.)

In order to find the best k for NGramClassifier(k) we make the following sets of experiments:

  1. For each k, 1<=k<=10 calculate k-gram models for G1 and G2. For these models use only the first 80% of the gene sequences. (In other words, for each Gi the training set is only 80% of Gi’s length.)

  2. Using the last 20% of G1 and G2 derive a set of test sequences T by randomly picking Gi and the sub-sequence length within the range {minLen, maxLen}.

  3. For each k, 1<=k<=10 calculate the classifications of the elements of T using NGramClassifier(k) with the models calculated in step 1.

  4. Calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for each k in step 3.

  5. Plot the points with coordinates {{TPR(k),TPR(k)}}, k in [1 10] and select the best k.


With the classification results of the tuning experiments we make the following ROC point plots
1. F2 vs. ETS1
ROCs F2 vs ETS1

  1. F2 vs. PAH
    ROCs F2 vs PAH

Conclusions and extensions

From the ROC plots we can see that the best classification results for GSCP with the genes “F2” and “ETS1” are obtained with Markov chain order 3 (or with 4-grams). Using order 2 (or 3-grams) is almost as good.

The experiments for the genes “F2” and “PAH” showed that using 3-grams gives best results for GSCP.

A natural extension of the experiments described is to repeat them for other pairs of genes and across different lengths of sub-sequences. In this way we can derive more general conclusions for the best n-gram length in the algorithm NGramClassifier.

Markov chains n-gram model implementation

A year or so ago while reading the book “Programming in Lua” I reached the chapter where it is described an implementation of a random text generation algorithm using Markov chains and n-grams. I implemented a similar algorithm 10 years ago. My first encounter with this algorithm was when I was a teenager — I read a description of it in Martin Gardner‘s column in the Russian translation of “Scientific American”. I productized the implementation into a package (NGramMarkovChains.m) that is now available at the MathematicaForPrediction project at GitHub.

Implementing this algorithm in Mathematica turned out to be surprisingly easy using multi-dimensional sparse arrays (SparseArray) and weighted random sampling (RandomSample). The code I wrote can be run with N-grams of any N; the implementation in the book “Programming in Lua” 2nd ed. is for 2-grams — see Chapter 10.2 of the book.

Below are given snippets of texts generated by the algorithm using Darwin’s “The Origin of Species.” (Readily available within Mathematica.) From them we can see that the 5-gram generated text makes more sense than the 2-gram one. All 4 randomly generated texts start from the same place in the book.

Different variations of the generated texts can be derived by using the word count argument and the options WordSeparators. Note that ToLowerCase was applied to the text.

Markov chains n-grams over Origin of species

Here is one more example with Shakespeare’s play “Hamlet”:
Markov chains n-grams over Hamlet

The follow up of this post is my next post: “Classification of genome data with n-gram models”.

Estimation of conditional density distributions

Assume we have temperature data for a given location and we want to predict today’s temperature at that location using yesterday’s temperature. More generally, the problem discussed in this blog post can be stated as “How to estimate the conditional density of the predicted variable given a value of the conditioning covariate?”

One way to answer this question is to provide a family of regression quantiles and from them to estimate the Cumulative Distribution Function (CDF) for a given value of the covariate. In other words, given data pairs Subsuperscript[{Subscript[x, i],Subscript[y, i]}, i=1, n] and a value Subscript[x, 0] we find Subscript[CDF, Subscript[x, 0]](y). From the estimated CDF we can estimate the Probability Density Function (PDF). We can go further and use Monte Carlo type of simulations with the obtained PDF’s (this will be discussed in another post).

The experiments in this blog post follow the example sub-sub-section “Daily Melbourne Temperatures” of the book “Quantile regression” by Roger Koenker.

Consider the temperature time series of Atlanta, GA, USA from 2006.01.01 to 2014.01.13 :

location = {"Atlanta", "GA"};
tempData = WeatherData[location, "Temperature", {{2006, 1, 1}, {2014, 1, 12}, "Day"}];
tempData // Length

Atlanta temperature time series 2006-2014

We can see that this time series is heteroscedastic — the range of temperatures is wider in the winter months.

Using the time series data let us make pairs of yesterday and today data:

tempPData = Partition[tempData[[All, 2]], 2, 1];

(We pair up two consecutive temperatures.)

Atlanta yesterday-vs-today temperatures

Let us calculate the regression quantiles for a number of quantiles. (The package QuantileRegression.m used is discussed in previous posts of this blog, and it can be downloaded from MathematicaForPrediction at GitHub.)

The considered set of quantiles is qs:={0.02,0.05,0.01,…,0.95,0.98}:

In[397]:= qs = N@Join[{0.02}, FindDivisions[{0, 1}, 20][[2 ;; -2]], {0.98}]
qs // Length

Out[397]= {0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98}
Out[398]= 21

Next we calculate the regression quantiles using 3rd order B-spline basis functions over 5 knots.

In[399]:= AbsoluteTiming[qFuncs = QuantileRegression[tempPData, 5, qs];]

Atlanta regression quantiles for yesterday-vs-today temperatures

Given a temperature value, say t0=2°C, we can estimate CDF(t) using the quantiles and the values at t0 of the corresponding regression quantile functions.

t0 = 2;
xs = Through[qFuncs[t0]];
cdfPairs = Transpose[{xs, qs}];

We can get a first order (linear) approximation by simply connecting the consecutive points of {x[i],y[i]}, 1 <= i <= |qs| as it is shown on the following plot.

Atlanta estimated CDF for 2C

On the plot the dashed vertical grid lines are for the quantiles {0.05,0.25,0.5,0.75,0.95}; the solid vertical gray grid line is for Subscript[t, 0].

Let us define a functions of the CDF and PDF approximation and plots:

CDFEstimate[t0_] := CDFEstimate[qs, qFuncs, t0];
CDFEstimate[qs_, qFuncs_, t0_] :=
Interpolation[Transpose[{Through[qFuncs[t0]], qs}], InterpolationOrder -> 1];

CDFPDFPlot[t0_?NumberQ, qCDFInt_InterpolatingFunction,
qs : {_?NumericQ ..} : qs,
opts : OptionsPattern[]] :=
Block[{qsGL = {0.05, 0.25, 0.5, 0.75, 0.95}, xsGL},
xsGL = Pick[Flatten@qCDFInt["Grid"], MemberQ[qsGL, #] & /@ qs];
Plot[{qCDFInt[x], qCDFInt'[x]}, {x, qCDFInt["Domain"][[1, 1]], qCDFInt["Domain"][[1, 2]]},
GridLines -> {Prepend[
MapThread[{#2, {Dashed, Blend[{Blue, Green, Pink}, #1]}} &, {qsGL, xsGL}], {t0, GrayLevel[0.6]}], None},
Axes -> False, Frame -> True,
PlotLabel -> "Estimated CDF and PDF for " ToString[t0] "\[Degree]C",opts]

Using these definitions let us plot the estimated CDF’s and PDF’s for a collection of temperatures. Note that the quantiles are given with vertical dashed grid lines; the median is colored with green. The values of the conditioning covariate are given on the plot labels and are marked with a solid, gray vertical line.
Atlanta estimated CDF's and PDF's

In the grid of CDF-and-PDF plots we can see that:
1. when the temperature is lower there is higher probability that the next day the temperature is going to be higher,
2. when yesterday’s temperature t0 is within the range [10,25]C the median almost coincides with t0,
3. for medium and high temperatures we have distributions similar to the skew normal distribution.

As I mentioned in the beginning of the post I followed the exposition for Melbourne’s temperatures in Koenker’s book. I did the calculations described above with temperature time series data for Melbourne. Here is the plot with the fitted regression quantiles:
Melbourne regression quantiles for yesterday-vs-today temperatures

Here is the grid of plots over a collection of temperatures. Note that they bring different conclusions that the ones for Atlanta, GA, USA.
Melbourne estimated CDF's and PDF's

OPL code for quantile regression with B-splines

I found it to be an interesting exercises to write an optimization model using the Optimization Programming Language (OPL) provided by IBM ILOG CPLEX.

The quantile regression OPL code is provided in the sub-directory OPL in MathematicaForPrediction project at GitHub. The code can use only first order B-splines. The number of knots and the quantile are parameters to be specified. The reason to have only first order splines is because OPL provides only one type of piecewise functions — linear piecewise functions. OPL piecewise functions are somewhat awkward to interpret, but relatively easy to specify. (I will update this post with Mathematica code to parse and interpret them.) Since, of course, OPL’s piecewise{...} can be used to specify step functions I could have programmed zero order splines, but that would be only useful for educational purposes.

The code is short:

tuple Point {
float x;
float y;

// Data points provided by an external file
{Point} data = ...;

// Number of knots parameter
int k = 7;

// Quantile parameter
float q = 0.75;

float dataMin = min(d in data) d.x;
float dataMax = max(d in data) d.x;
float cLen = ( dataMax - dataMin ) / (k-1);

dvar float+ u[data];
dvar float+ v[data];
dvar float+ bs[1..k];

minimize sum(d in data) q*u[d] + sum(d in data) (1-q)*v[d];

subject to {

forall ( d in data ) {
(sum ( i in 1..k ) bs[i] * (piecewise{ 0-> dataMin + (i-2)*cLen; 1/cLen -> dataMin + (i-1)*cLen; -1/cLen -> dataMin + i*cLen; 0}( dataMin + (i-2)*cLen, 0) d.x) ) + u[d] - v[d] == d.y;

The code mentioned earlier provides print-outs of the piecewise, B-spline basis functions and the weights found for them. (The weighted sum of the B-spline basis functions gives the regression quantile.)

Quantile regression with B-splines

Last weekend I made two implementations of Quantile Regression (QR) calculation with B-spline bases. The first implementation is based on the Linear Programming (LP) formulation of the quantile minimization problem. The second implementation is a direct translation of the non-LP minimization formulation. (Mathematica’s functions LinearProgramming and Minimize are used respectively.)

The implementations are in the package QuantileRegression.m provided by the project MathematicaForPrediction at GitHub.

Because of the signature required by the QR implementation with B-splines and the returned results, I had to rename the previous QR implementation to QuantileRegressionFit and I named the new one (with the B-splines) QuantileRegression. I think the new names reflect better what the functions are doing.

Right now, to me the most interesting application of QR is detection of outliers in time series and reconstruction of conditional density functions. Below is a demonstration of using QR for outlier detection.

Consider this data:
Atlanta temperature 2006-2013 changed with y = y+Sqrt[x]

which was generated with the code:

tempData = WeatherData["Atlanta", "Temperature", {{2006, 1, 1}, {2013, 12, 31}, "Day"}];
tempPData = tempData;
tempPData[[All, 1]] = DateDifference[tempData[[1, 1]], #, "Day"] & /@ tempData[[All, 1]];
tempPData[[All, 1]] = tempPData[[All, 1, 1]];
tempPData[[All, 2]] = Sqrt[tempPData[[All, 1]]] + tempPData[[All, 2]];

(I took the temperature in Atlanta, GA, USA for the last 7 years and modified the data with the formula Yi = sqrt(Xi) + Yi .)

Here is a plot with five regression quantiles calculated using B-splines for the quantiles {0.02, 0.2, 0.4, 0.6, 0.8, 0.98}:
Regression quantiles for Atlanta temperature 2006-2013 changed with y = y+Sqrt[x]

Here is another plot with the 0.02 and 0.98 regression quantiles outlining the data:
Outline with regression quantiles for Atlanta temperature 2006-2013 changed with y = y+Sqrt[x]

The regression quantiles were computed with the command:

qFuncs = QuantileRegression[tempPData, 44, {0.02, 0.2, 0.4, 0.6, 0.8, 0.98}]

Note that in order to get satisfactory results, the only tuning value I had to choose is for the second parameter, which in this case specifies the number of B-spline basis knots. The second parameter can also be a list of knots. The option InterpolationOrder specifies the splines order. (By default InterpolationOrder->3.)

Update, 2015.07.04

This blog post was written with an older version of Mathematica, version 9. (The current is 10.1.)  Mathematica 10 uses different format for the WeatherData results — TimeSeries object are returned, which use QuantityArray and/or Quantity around the numerical values like temperature, wind speed, etc. 

Below is given code that runs with Mathematica 10.1. (The only change is the second line of the data loading part.)

1. Data load:

tempData = WeatherData["Atlanta", "Temperature", {{2006, 1, 1}, {2013, 12, 31}, "Day"}];
tempData = Transpose[{Normal[tempData["Dates"]], Normal[tempData["Values"]] /. Quantity[q_, _] :> q}];
tempPData = tempData;
tempPData[[All, 1]] = DateDifference[tempData[[1, 1]], #, "Day"] & /@ tempData[[All, 1]];
tempPData[[All, 1]] = tempPData[[All, 1, 1]];
tempPData[[All, 2]] = Sqrt[tempPData[[All, 1]]] + tempPData[[All, 2]];

2. Regression quantiles computation:

qFuncs = QuantileRegression[tempPData, 44, {0.02, 0.2, 0.4, 0.6, 0.8, 0.98}];

3. Plots:

gr1 = ListPlot[tempPData];
gr2 = Plot[
Evaluate[Through[qFuncs[x]]], {x, Min[tempPData[[All, 1]]],
Max[tempPData[[All, 1]]]}, PlotPoints -> 1200,
PerformanceGoal -> "Speed"];
jan1Pos = Position[DateList /@ tempData[[All, 1]], {_, 1, 1, ___}][[All, 1]];
Show[{gr1, gr2}, PlotRange -> All, Axes -> False, Frame -> True
, GridLines -> {jan1Pos, FindDivisions[{##}, 10] &},
GridLinesStyle -> Directive[GrayLevel[0.8], Dashed],
AspectRatio -> 1/3]