Find Fit for Non-linear data

Three weeks ago I replied to a question in the Wolfram Community site about finding a function fit to points of non-linear data.

Here is the data:

Non-linear data

I proposed a different way of doing the non-linear fitting using Quantile regression with B-splines. The advantages of the approach are that it does not need guessing of the fit functions and it requires less experimentation.

Load the package QuantileRegression.m:

Needs["QuantileRegression`"]

After several experiments with the number of knots we can find a regression quantile using 3d order B-splines of 18 knots that approximates the data well.

qfunc = Simplify[QuantileRegression[data, 18, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-18RegularKnots

Here is the relative error estimate:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]

RelativeErrorNonLinearFitWithBSplineQuantileRegression-18RegularKnots

I would like to point out that if we put large number of knots we can get aliasing errors effect:

qfunc = Simplify[
QuantileRegression[data, 40, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-40RegularKnots

If we select a non-uniform grid of knots according to gradients of the data we get a good approximation with one attempt.


In[107]:= knots =
Join[
Rescale[Range[0, 1, 0.2], {0, 1}, {Min[data[[All, 1]]],
Max[data[[All, 1]]] 3/4}],
Rescale[Range[0, 1, 0.1], {0, 1}, {Max[data[[All, 1]]] 3/4,
Max[data[[All, 1]]]}]
]

Out[107]= {0., 3.75, 7.5, 11.25, 15., 18.75, 18.75, 19.375, 20., 20.625, 21.25, 21.875, 22.5, 23.125, 23.75, 24.375, 25.}

In[108]:= qfunc = QuantileRegression[data, knots, {0.5}][[1]];

The grid lines in the plot below are made with the knots.

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}];
Show[{dGr, qfGr}, GridLines -> {knots, None},
GridLinesStyle -> Directive[GrayLevel[0.8], Dashed], Frame -> True,
ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-17NonRegularKnots

Here are the relative errors:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]

RelativeErrorNonLinearFitWithBSplineQuantileRegression-17NonRegularKnots

Here is the approximation function after using PiecewiseExpand and Simplify:

qfunc = Evaluate[Simplify[PiecewiseExpand[qfunc[[1]]]]] &;
qfunc

NonLinearFitWithBSplineQuantileRegression-17NonRegularKnotsFunction

Follow-up

If it is desired to make the fitting with a certain, known model function then the function QuantileRegressionFit can be used. QuantileRegressionFit is provided by the same package, QuantileRegression.m, used in the blog post.

This older blog post of mine discusses the use of QuantileRegressionFit and points to a PDF with further details (theory and Mathematica commands).

Classification and association rules for census income data

Introduction

In this blog post I am going to show (some) analysis of census income data — the so called “Adult” data set, [1] — using three types of algorithms: decision tree classification, naive Bayesian classification, and association rules learning. Mathematica packages for all three algorithms can be found at the project MathematicaForPrediction hosted at GitHub, [2,3,4].

(The census income data set is also used in the description of the R package “arules”, [7].)

In the census data every records represents a person with 14 attributes, the last element of a record is one of the labels {“>50K”,”50K”. The relationships between the categorical variables in that data set was described in my previous blog post, “Mosaic plots for data visualization”.

For this data the questions I am most interested in are:
Question 1: Which of the variables (age, occupation, sex, etc.) are most decisive for determining the income of a person?
Question 2: Which values for which variables form conditions that would imply high income or low income? (I.e. “>50K” or “<=50K".)
Question 3: What conclusions or confirmations we can get from answering the previous two questions?

One way to answer Question 1 is to use following steps, [8].
1. Build a classifier with the training set.
2. Verify using the test set that good classification results are obtained.
3. If the number of variables (attributes) is k for each i, 1<=i<=k :
3.1. Shuffle the values of the i-th column of the test data and find the classification success rates.
4. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.
5. The variables for which the classification success rates are the worst are the most decisive.

Following these steps with a decision tree classifier, [2], I found that “marital-status” and “education-num” (years of education) are most decisive to give good prediction for the “>50K” label. Using a naive Bayesian classifier, [3], the most significant variables are “marital-status” and “relationship”. (More details are given in the sections “Application of decision trees” and “Application of naive Bayesian classifier”.)

One way to answer Question 2 is to find which values of the variables (e.g. “Wife”, “Peru”, “HS-grad”, “Exec-managerial”) associate most frequently with “>50K” and “<=50K" respectively and apply different Bayesian probability statistics on them. This is what the application of Associative rules learning gives, [9]. Another way is to use mosaic plots, [5,9], and prefix trees (also known as "tries") [6,11,12].

In order to apply Association rule learning we need to make the numerical variables categorical — we need to partition them into non-overlapping intervals. (This derived, "all categorical" data is also amenable to be input data for mosaic plots and prefix trees.)

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income, [1].

(This data set was also used in my previous blog post “Mosaic plots for data visualization”.)

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is a summary of the data:
Adult census income data summary

As it was mentioned in the introduction, only 24% of the labels are “>50K”. Also note that 2/3 of the records are for males.

Scatter plots and mosaic plots

Often scatter plots and mosaic plots can give a good idea of the general patterns that hold in the data. This sub-section has a couple of examples, but presenting extensive plots is beyond the scope of this blog post. Let me point out that it might be very beneficial to use these kind of plots with Mathematica‘s dynamic features (like Manipulate and Tooltip), or make a grid of mosaic plots.

Mosaic plots of the categorical variables of the data can be seen in my previous blog post “Mosaic plots for data visualization”.

Here is a table of the histograms for “age”, “education-num”, and “hours-per-week”:
adult-data-scatter-plots-age-education-num-hours-per-week

Here is a table with scatter plots for all numerical variables of the data:
adult-data-scatter-plots-age-education-num-capital-gain-capital-loss-hours-per-week

Application of decision trees

The building and classification with decision trees is straightforward. Since the label “>50K” is only a quarter of the records I consider the classification success rates for “>50K” to be more important.

adult-data-Decision-tree-classification-success-rates

I experimented with several sets of parameters for decision tree building. I did not get a classification success rate for “>50K” better than 0.644 . Using pruning based on the Minimal Description Length (MDL) principle did not give better results. (I have to say I find MDL pruning to be an elegant idea, but I am not convinced that it works that
well. I believe decision tree pruning based on test data would produce much better results. Only the MDL decision tree pruning is implemented in [2].)

The overall classification success rate is in line with the classification success ratios listed in explanation of the data set; see the file “adult.names” in [1].

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-Decision-tree-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-Decision-tree-classification-shuffled-success-rates-plot

We can see from the table and the plot that variables “marital-status”, “education-num”, “capital-gain”, “age”, and “occupation” are very decisive when it comes to determining high income. The variable “marital-status” is significantly more decisive than the others.

While considering the decisiveness of the variable “marital-status” we can bring the following questions:
1. Do people find higher paying jobs after they get married?
2. Are people with high paying jobs more likely to marry and stay married?

Both questions are probably answered with “Yes” and probably that is why “marital-status” is so decisive. It is hard to give quantified answers to these questions just using decision trees on this data — we would need to know the salary and marital status history of the individuals (different data) or to be able to imply it (different algorithm).

We can see the decisiveness of “age”, “education-num”, “occupation”, and “hours-per-week” as natural. Of course one is expected to receive a higher pay if he has studied longer, has a high paying occupation, is older (more experienced), and works more hours per week. Note that this statement explicitly states the direction of the correlation: we do assume that longer years of study bring higher pay. It is certainly a good idea to consider the alternative direction of the correlation, that people first get high paying jobs and that these high paying jobs allow them to get older and study longer.

Application of naive Bayesian classifiers

The naive Bayesian classifier, [3], produced better classification results than the decision trees for the label “>50K”:
adult-data-NBC-classification-success-rates

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-NBC-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-NBC-classification-shuffled-success-rates-plot

In comparison with the decision tree importance of variables experiments we can notice that:
1. “marital-status” is very decisive and it is the second most decisive variable;
2. the most decisive variable is “relationship” but it correlates with “marital-status”;
3. “age”, “occupation”, “hours-per-week”, “capital-gain”, and “sex” are decisive.

Shuffled classification rates plots comparison

Here are the two shuffled classification rates plots stacked together for easier comparison:
adult-data-Decision-tree-and-NBC-classification-shuffled-success-rates-plots

Data modification

In order to apply the association rules finding algorithm Apriori, [4], the data set have to be modified. The modification is to change the numerical variables “age”, “education-num”, and “age” into categorical. I just partitioned them into non-overlapping intervals, labeled the intervals, and assigned the labels according the variable values. Here is the summary of the modified data for just these variables:
adault-data-numerical-to-categorical-columns-summary

Finding association rules

Using the modified data I found a large number of association rules with the Apriori algorithm, [4]. I used the measure called “confidence” to extract the most significant rules. The confidence of an association rule AC with antecedent A and consequent C is defined to be the ratio P(AC)/P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, CA.)

Here is a table showing the rules with highest confidence for the consequent being “>50K”:
adult-data-association-rules-more-than-50K

From the table we can see for example that 2.1% of the data records (or 693 records) show that for a married man who has studied 14 years and originally from USA there is a 0.79 probability that he earns more than $50000.

Here is a table showing the rules with highest confidence for the consequent being “<=50K":
adult-data-association-rules-less-than-50K

The association rules in these tables confirm the findings with the classifiers: marital status, age, and education are good predictors of income labels “>50K” and “<=50K".

Conclusion

The analysis confirmed (and quantified) what is considered common sense:

Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).

Using the association rules we see for example that
(1) if a person earns more than $50000 he is very likely to be a married man with large number of years of education;
(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than $50000.

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Census Income Data Set, URL: http://archive.ics.uci.edu/ml/datasets/Census+Income .

[2] Antonov, A., Decision tree and random forest implementations in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AVCDecisionTreeForest.m, (2013).

[3] Antonov, A., Implementation of naive Bayesian classifier generation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package NaiveBayesianClassifier.m, (2013).

[4] Antonov, A., Implementation of the Apriori algorithm in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AprioriAlgorithm.m, (2013).

[5] Antonov, A., Mosaic plot for data visualization implementation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package MosaicPlot.m, (2014).

[6] Antonov, A., Tries with frequencies Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package TriesWithFrequencies.m, (2013).

[7] Hahsler, M. et al., Introduction to arules \[Dash] A computational environment for mining association rules and frequent item sets, (2012).

[8] Breiman, L. et al., Classification and regression trees, Chapman & Hall, 1984.

[9] Wikipedia, Association rules learning, http://en.wikipedia.org/wiki/Association_rule_learning .

[10] Antonov, A., Mosaic plots for data visualization, (March, 2014), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Mosaic%20plots%20for%20data%20visualization.pdf .

[11] Wikipedia, Trie, http://en.wikipedia.org/wiki/Trie .

[12] Antonov, A., Tries, (December, 2013), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Tries.pdf .

Enhancements of MosaicPlot

I made the following enhancements of the function MosaicPlot which I described (and proclaimed the implementation of) in my previous blog post:

1. Tooltips with precise contingency statistics.
2. If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics.
3. Coloring of the rectangles according to a list of index->color rules.

The document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub, combines the information of this blog post and previous one. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.

Tooltips with precise contingency statistics

I already proclaimed in my previous blog post the tooltips functionality — when hovering with the mouse over the rectangles then MosaicPlot, using Tooltip, gives a table with the exact co-occurrence (contingency) values. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Visualizing categorical columns + a numerical column

If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics. This functionality is specified with the option “ExpandLastColumn”->True.

In order to explain the functionality we are going to use following interpretation. If the last of column of the data is numerical then we can treat the data as a contracted version of a longer list of records made only of the categorical columns. For example, consider the following table with observations of people’s hair and eyes color:
Hair and eyes color number of observations

The table above can be considered as a contracted version of this table:
Hair and eyes color observations

Setting the option “ExpandLastColumn” to True gives a mosaic plot corresponding to that latter, observations-expanded table:
Hair and eyes color mosaic plot

The last data column (which is numerical) does not need to be made of integers:
Hair and eyes color mosaic plot Mathematica code

Rectangle coloring

The rectangles can be colored using the option ColorRules which specifies how the colors of the rectangles are determined from the indices of the data columns.

More precisely, the values of the option ColorRules should be a list of rules, {i1->c1,i2->c2,…}, matching the form

{(_Integer->(_RGBColor|_GrayLevel))..} .

If coloring for only one column index is specified the value of ColorRules can be of the form

{_Integer->{(_RGBColor|_GrayLevel)..}} .

The colors are used with Blend in order to color the rectangles according to the order of the unique values of the specified data columns.

The default value for ColorRules is Automatic. When Automatic is given to ColorRules, MosaicPlot finds the data column with the largest number of unique values and colors them according to their order using ColorData[7,"ColorList"].

The grid of plots below shows mosaic plots of the same data with different values for the option ColorRules (given as plot labels).
Grid of mosaic plots for ColorRules values

Mosaic plots for data visualization

Introduction

This blog post has description and examples of using the function MosaicPlot of the Mathematica package MosaicPlot.m provided by the project MathematicaForPrediction at GitHub. (Also see the document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.)

The function MosaicPlot summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values. (Note, that if a column is numerical but has a small number of different values it can be seen as categorical.)

I have read the descriptions of mosaic plots in the book “R in Action” by Robert Kabakoff and one of the references provided in the book (“What is a mosaic plot?” by Steve Simon). I was impressed how informative mosaic plots are and I figured they can be relatively easily implemented using Prefix trees (also known as “Tries”). I implemented MosaicPlot while working on a document analyzing the census income data from 1998, [6]. This is the reason that data set is used in this blog post. A good alternative set provided by ExampleData is {“Statistics”,”USCars1993″}.

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income.

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is the summary table of the data:
Adult census income data summary

On the summary table the numerical variables are described with min, max, and quartiles. The category variables are described with the tallies of their values. The tallies of values are ordered in decreasing order. The tallies of truncated values are summed under the value “(Other)”.

Note that:
– only 24% of the labels are “>50K”;
– 2/3 of the records are for males;
– “capital-gain” and “capital-loss” are very skewed.

Mosaic plot explanations

If we pick a categorical variable, say “sex”, we can visualize the frequencies of the appearance of the variable values with the following plot:
Adult census income data sex mosaic plot

The size of the rectangles depends on the frequencies of appearance of the values “Male” and “Female” in the data records. From the rectangle sizes we can see what we already knew from the data summary table: approximately 2/3 of the records are about males.

We can subdivide every rectangle r according to the frequencies of co-occurrence of r’s value with the values of a second categorical variable, say “relationship”:
Adult census income data sex-relationship mosaic plot

The labels corresponding to the values of “relationship” are rotated for legibility. The “relationship” labels are placed according to the co-occurrence with the value “Male” of the variable “sex”. The correspondent fractions of the pairs (“Female”,”Husband”), (“Female”,”Not-in-family”), etc., are deduced from order of the “relationship” labels.

Using colored mosaic plots can help distinguishing which rectangles correspond to which values. Here is the last plot with rectangles colored across the “relationship” data variable:
Adult census income data sex-relationship colored mosaic plot

From the visual representations of the “sex vs. relationship” mosaic plot we can see that large fraction of the males are husbands, none (or a very small fraction) of them are wives. We can also see that none (or a very small fraction) of the females are husbands, the largest fraction of them are “Not-in-family”, and they are approximately three times more than the females that are wives.

Let us make another mosaic plot of a different kind of relationship “sex vs. education”:
Adult census income data sex-education colored mosaic plot

By comparing the sizes of the rectangles corresponding to the values “Bachelors”, “Doctorate”, “Masters”, and “Some-college” on the “sex vs. education” mosaic plot we can see that the fraction of men that have finished college is larger than the fraction of women that have finished college.

We can further subdivide the rectangles according to the co-occurrence frequencies with a third categorical variable. We are going to choose that third variable to be “income”, the values of which can be seen as outcomes or consequents of the values of the first two variables of the mosaic plot.
Adult census income data sex-education-income colored mosaic plot

From the mosaic plot “sex vs. education vs. income” we can make the following observations.
1. Approximately 75% of the males with doctorate degrees or with a professional school degree earn more than $50000 per year.
2. Approximately 60% of the females with a doctorate degree earn more than $50000 per year.
3. Approximately 45% of the females with a professional school degree earn more than $50000.
4. Across all education type females are (much) less likely to earn more than $50000 per year.

Although I mentioned earlier that the “outcome” variable should be the last variable in the mosaic plot, it is also useful to start with the outcome variable to get an attribute breakdown perspective (using a different color scheme):
Adult census income data income-relationship-sex colored mosaic plot

Signature of MosaicPlot

MosaicPlot takes various options for tweaking the labels placement and style. Here is the Mathematica command:

MosaicPlot[censusData[[All, {9, 3, 5, 14}]], "Gap" -> 0.014,
"ColumnNamesOffset" -> 0.07,
"ColumnNames" ->
Map[Style[#, Blue, FontSize -> 15] &, columnNames[[{9, 3, 5, 14}]]],
"LabelRotation" -> {{3, 1}, {1, 1}}, ImageSize -> 900]

with which the following mosaic plot was made:
Adult census income data sex-education-maritalStatus-income mosaic plot colored

The option “Gap” used to regulate the gaps between the rectangle. The options “ColumnNames” and “ColumnNamesOffset” are for the specification of the variable names (in blue in the plot). The option “LabelRotation” specifies the rotation of the labels that correspond to the individual values of the variables. Also, MosaicPlot takes all the options of Graphics (since it is based on it).

Tooltip tables

The function MosaicPlot has an interactive feature using Tooltip that gives a table with the exact co-occurrence (contingency) values when hovering with the mouse over the rectangles. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Future plans

The current implementation of MosaicPlot uses coloring of the rectangles for easier plot reading. An alternative is to use coloring based on correlations statistics. I think though that the tooltip contingency tables with flexible coloring specification make the correlation coloring less needed.

Natural language processing with functional parsers

Natural language Processing (NLP) can be done with a structural approach using grammar rules. (The other type of NLP is using statistical methods.) In this post I discuss the use of functional parsers for the parsing and interpretation of small sets of natural language sentences within specific contexts. Functional parsing is also known as monadic parsing and parsing combinators.

Generally, I am using functional parsers to make Domain-Specific Languages (DSL’s). I use DSL’s to make command interfaces to search and recommendation engines and also to design and prototype conversational engines. I use extensively the so called Backus-Naur Form (BNF) for the grammar specifications. Clearly a DSL can be very close to natural language and provide sufficient means for interpretation within a given (narrow) context. (Like function integration in Calculus, smart phone directory browsing and selection, or search for something to eat nearby.)

I implemented and uploaded a package for construction of functional parsers: see FunctionalParsers.m hosted by the project MathematicaForPrediction at GitHub.

The package provides ability to quickly program parsers using a core system of functional parsers as described in the article “Functional parsers” by Jeroen Fokker .

The parsers (in both the package and the article) are categorized in the groups: basic, combinators, and transformers. Immediate interpretation can be done with transformer parsers, but the package also provides functions for evaluation of parser output within a context of data and functions.

Probably most importantly, the package provides functions for automatic generation of parsers from grammars in EBNF.

Here is an example of parsing the sentences of an integration requests language:

Interpretation of integration requests

Here is the grammar:
Integration requests EBNF grammar

The grammar can be visualized with this mind map:

Integration command

The mind map was hand made with MindNode Pro. Generally, the branches represent alternatives, but if two branches are connected the direction of the arrow connecting them shows a sequence combination.

With the FunctionalParsers.m package we can automatically generate a
mind map parsing the string of the grammar in EBNF to OMPL:

IntegrationRequestsGenerated

(And here is a PDF of the automatically generated mind map: IntegrationRequestsGenerated . )

I also made a slide show that gives an introduction to how the package is used: “Functional parsers for an integration requests language grammar”.

A more complicated example is this conversational engine for manipulation of time series data. (Data loading, finding outliers and trends. More details in the next blog post.)

Classification of genome data with n-gram models

In this blog post we consider the following problem:

Gene Sequence Classification Problem (GSCP): Given two genes, G1 and G2, and a (relatively short) sub-sequence S from one of them tell from which gene the sub-sequence S is part of.

One way to derive a solution for this problem is to model each gene with an n-gram model and classify the sub-sequences according to the odds ratio test based on the Markov chain state transition probabilities. We are going to make classification experiments for different n’s of the n-gram model and use a type of Receiver Operating Characteristic (ROC) plot to judge which n is the best.

This approach can be also applied to text classification. I consider genes for conciseness and clarity.

The full article with Mathematica code for the experiments described in the post can be downloaded from here.

The computations for the plots shown below were done with the Mathematica package for n-gram Markov chain models NGramMarkovChains.m provided by the MathematicaForPrediction project at GitHub.

Data and models

Mathematica has the function GenomeData that gives the DNA sequences for specified genes.

I made classification experiments for the following pairs of genes:
1. F2 and ETS1
Genes F2 and ETS1

2. F2 and PAH
Genes F2 and PAH

The classification is done for sequences like this:

In[812]:= StringTake[GenomeData["F2", "FullSequence"], {11200, 11200 + 100}]
Out[812]= "AGCAGGCATATCTAGGGGAGGAGCACCGCAGGCTGGGGGCATGGCAGGCACTAAGGCCCTGAGGTGGGA\
GCACTCTTGGCTTGTCTGGGGAGCAGTAGGGA"

Approach

Let us provide more details for the approach outlined at the beginning of the post.
First we define the classification algorithm NGramClassifier(k):
NGramClassifier(k)
(I was too lazy to type the algorithm in here so I used a screenshot of the description I wrote in Mathematica.)

In order to find the best k for NGramClassifier(k) we make the following sets of experiments:

1. For each k, 1<=k<=10 calculate k-gram models for G1 and G2. For these models use only the first 80% of the gene sequences. (In other words, for each Gi the training set is only 80% of Gi’s length.)

2. Using the last 20% of G1 and G2 derive a set of test sequences T by randomly picking Gi and the sub-sequence length within the range {minLen, maxLen}.

3. For each k, 1<=k<=10 calculate the classifications of the elements of T using NGramClassifier(k) with the models calculated in step 1.

4. Calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for each k in step 3.

5. Plot the points with coordinates {{TPR(k),TPR(k)}}, k in [1 10] and select the best k.

ROC

With the classification results of the tuning experiments we make the following ROC point plots
1. F2 vs. ETS1
ROCs F2 vs ETS1

2. F2 vs. PAH
ROCs F2 vs PAH

Conclusions and extensions

From the ROC plots we can see that the best classification results for GSCP with the genes “F2″ and “ETS1″ are obtained with Markov chain order 3 (or with 4-grams). Using order 2 (or 3-grams) is almost as good.

The experiments for the genes “F2″ and “PAH” showed that using 3-grams gives best results for GSCP.

A natural extension of the experiments described is to repeat them for other pairs of genes and across different lengths of sub-sequences. In this way we can derive more general conclusions for the best n-gram length in the algorithm NGramClassifier.

Markov chains n-gram model implementation

A year or so ago while reading the book “Programming in Lua” I reached the chapter where it is described an implementation of a random text generation algorithm using Markov chains and n-grams. I implemented a similar algorithm 10 years ago. My first encounter with this algorithm was when I was a teenager — I read a description of it in Martin Gardner‘s column in the Russian translation of “Scientific American”. I productized the implementation into a package (NGramMarkovChains.m) that is now available at the MathematicaForPrediction project at GitHub.

Implementing this algorithm in Mathematica turned out to be surprisingly easy using multi-dimensional sparse arrays (SparseArray) and weighted random sampling (RandomSample). The code I wrote can be run with N-grams of any N; the implementation in the book “Programming in Lua” 2nd ed. is for 2-grams — see Chapter 10.2 of the book.

Below are given snippets of texts generated by the algorithm using Darwin’s “The Origin of Species.” (Readily available within Mathematica.) From them we can see that the 5-gram generated text makes more sense than the 2-gram one. All 4 randomly generated texts start from the same place in the book.

Different variations of the generated texts can be derived by using the word count argument and the options WordSeparators. Note that ToLowerCase was applied to the text.

Markov chains n-grams over Origin of species

Here is one more example with Shakespeare’s play “Hamlet”:
Markov chains n-grams over Hamlet

The follow up of this post is my next post: “Classification of genome data with n-gram models”.