Removal of sub-trees in tries

I recently updated the package TriesWithFrequencies.m with two new functions TrieRemoval and TrieNodeFrequencies. The function TrieNodeFrequencies reverses the outcomes of TrieNodeProbabilities; it was made to facilitate the implementation and usage of TrieRemoval.

The removal of a sub-tree from a trie is useful when we want to alternate the learning stored in the trie.

The new functions are illustrated with examples below. (These examples reuse and extend the ones in the blog post “Tries with frequencies for data mining”.)
Consider the trie made with the list of words

{"war", "ward", "warden", "work", "car", "cars", "carbs"}



The function TrieNodeFrequencies can be used to reverse the effect of TrieNodeProbabilities. Its first argument is a trie, the second one is a scaling parameter (which is 1 by default). Here is an example:

TrieForm[TrieNodeFrequencies[TrieNodeProbabilities[mytrie], 14]]

The function TrieRemove can be used to remove sub-parts of a trie that correspond to a specified “word”. Here is an example using a trie with frequencies:

TrieForm[TrieRemove[mytrie, "car"]]


Here is an example using a trie with probabilities:

TrieForm[TrieRemove[TrieNodeProbabilities[mytrie], "car"]]


TrieRemove has two options “HowManyTimes” and “FrequencyValues” with default values Automatic. The option “HowManyTimes” takes a number or Automatic and it is still under development (its functionality is partially correct). The option “FrequencyValues” takes True | False | Automatic and it is used for the interpretation of the sub-tree removal. If the values are frequencies the removal process is more straightforward. If the values are probabilities the removal process has to take into account that the values on the same level of the sub-tree are conditional probabilities that sum to 1. TrieRemove has two different implementations one for tries with frequencies and one for tries with probabilities. The option setting "FrequencyValues"->True|False is used to redirect to the correct implementation. When "FrequencyValues"->Automatic is given a simple heuristic is used to determine is the trie argument a trie with frequencies or a trie with probabilities.

Find Fit for Non-linear data

Three weeks ago I replied to a question in the Wolfram Community site about finding a function fit to points of non-linear data.

Here is the data:

Non-linear data

I proposed a different way of doing the non-linear fitting using Quantile regression with B-splines. The advantages of the approach are that it does not need guessing of the fit functions and it requires less experimentation.

Load the package QuantileRegression.m:


After several experiments with the number of knots we can find a regression quantile using 3d order B-splines of 18 knots that approximates the data well.

qfunc = Simplify[QuantileRegression[data, 18, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]


Here is the relative error estimate:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]


I would like to point out that if we put large number of knots we can get aliasing errors effect:

qfunc = Simplify[
QuantileRegression[data, 40, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]


If we select a non-uniform grid of knots according to gradients of the data we get a good approximation with one attempt.

In[107]:= knots =
Rescale[Range[0, 1, 0.2], {0, 1}, {Min[data[[All, 1]]],
Max[data[[All, 1]]] 3/4}],
Rescale[Range[0, 1, 0.1], {0, 1}, {Max[data[[All, 1]]] 3/4,
Max[data[[All, 1]]]}]

Out[107]= {0., 3.75, 7.5, 11.25, 15., 18.75, 18.75, 19.375, 20., 20.625, 21.25, 21.875, 22.5, 23.125, 23.75, 24.375, 25.}

In[108]:= qfunc = QuantileRegression[data, knots, {0.5}][[1]];

The grid lines in the plot below are made with the knots.

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}];
Show[{dGr, qfGr}, GridLines -> {knots, None},
GridLinesStyle -> Directive[GrayLevel[0.8], Dashed], Frame -> True,
ImageSize -> 700]


Here are the relative errors:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]


Here is the approximation function after using PiecewiseExpand and Simplify:

qfunc = Evaluate[Simplify[PiecewiseExpand[qfunc[[1]]]]] &;



If it is desired to make the fitting with a certain, known model function then the function QuantileRegressionFit can be used. QuantileRegressionFit is provided by the same package, QuantileRegression.m, used in the blog post.

This older blog post of mine discusses the use of QuantileRegressionFit and points to a PDF with further details (theory and Mathematica commands).

Classification and association rules for census income data


In this blog post I am going to show (some) analysis of census income data — the so called “Adult” data set, [1] — using three types of algorithms: decision tree classification, naive Bayesian classification, and association rules learning. Mathematica packages for all three algorithms can be found at the project MathematicaForPrediction hosted at GitHub, [2,3,4].

(The census income data set is also used in the description of the R package “arules”, [7].)

In the census data every record represents a person with 14 attributes, the last element of a record is one of the labels {“>=50K”,”<50K"}. The relationships between the categorical variables in that data set was described in my previous blog post, "Mosaic plots for data visualization".

For this data the questions I am most interested in are:
Question 1: Which of the variables (age, occupation, sex, etc.) are most decisive for determining the income of a person?
Question 2: Which values for which variables form conditions that would imply high income or low income? (I.e. ">50K" or "<=50K".)
Question 3: What conclusions or confirmations we can get from answering the previous two questions?

One way to answer Question 1 is to use following steps, [8].
1. Build a classifier with the training set.
2. Verify using the test set that good classification results are obtained.
3. If the number of variables (attributes) is k for each i, 1<=i<=k :
3.1. Shuffle the values of the i-th column of the test data and find the classification success rates.
4. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.
5. The variables for which the classification success rates are the worst are the most decisive.

Following these steps with a decision tree classifier, [2], I found that “marital-status” and “education-num” (years of education) are most decisive to give good prediction for the “>50K” label. Using a naive Bayesian classifier, [3], the most significant variables are “marital-status” and “relationship”. (More details are given in the sections “Application of decision trees” and “Application of naive Bayesian classifier”.)

One way to answer Question 2 is to find which values of the variables (e.g. “Wife”, “Peru”, “HS-grad”, “Exec-managerial”) associate most frequently with “>50K” and “<=50K" respectively and apply different Bayesian probability statistics on them. This is what the application of Associative rules learning gives, [9]. Another way is to use mosaic plots, [5,9], and prefix trees (also known as "tries") [6,11,12].

In order to apply Association rule learning we need to make the numerical variables categorical — we need to partition them into non-overlapping intervals. (This derived, "all categorical" data is also amenable to be input data for mosaic plots and prefix trees.)

Data set

The data set can be found and taken from, [1].

(This data set was also used in my previous blog post “Mosaic plots for data visualization”.)

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is a summary of the data:
Adult census income data summary

As it was mentioned in the introduction, only 24% of the labels are “>50K”. Also note that 2/3 of the records are for males.

Scatter plots and mosaic plots

Often scatter plots and mosaic plots can give a good idea of the general patterns that hold in the data. This sub-section has a couple of examples, but presenting extensive plots is beyond the scope of this blog post. Let me point out that it might be very beneficial to use these kind of plots with Mathematica‘s dynamic features (like Manipulate and Tooltip), or make a grid of mosaic plots.

Mosaic plots of the categorical variables of the data can be seen in my previous blog post “Mosaic plots for data visualization”.

Here is a table of the histograms for “age”, “education-num”, and “hours-per-week”:

Here is a table with scatter plots for all numerical variables of the data:

Application of decision trees

The building and classification with decision trees is straightforward. Since the label “>50K” is only a quarter of the records I consider the classification success rates for “>50K” to be more important.


I experimented with several sets of parameters for decision tree building. I did not get a classification success rate for “>50K” better than 0.644 . Using pruning based on the Minimal Description Length (MDL) principle did not give better results. (I have to say I find MDL pruning to be an elegant idea, but I am not convinced that it works that
well. I believe decision tree pruning based on test data would produce much better results. Only the MDL decision tree pruning is implemented in [2].)

The overall classification success rate is in line with the classification success ratios listed in explanation of the data set; see the file “adult.names” in [1].

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):

Here is a plot of the “>50K” success rates from the table above:

We can see from the table and the plot that variables “marital-status”, “education-num”, “capital-gain”, “age”, and “occupation” are very decisive when it comes to determining high income. The variable “marital-status” is significantly more decisive than the others.

While considering the decisiveness of the variable “marital-status” we can bring the following questions:
1. Do people find higher paying jobs after they get married?
2. Are people with high paying jobs more likely to marry and stay married?

Both questions are probably answered with “Yes” and probably that is why “marital-status” is so decisive. It is hard to give quantified answers to these questions just using decision trees on this data — we would need to know the salary and marital status history of the individuals (different data) or to be able to imply it (different algorithm).

We can see the decisiveness of “age”, “education-num”, “occupation”, and “hours-per-week” as natural. Of course one is expected to receive a higher pay if he has studied longer, has a high paying occupation, is older (more experienced), and works more hours per week. Note that this statement explicitly states the direction of the correlation: we do assume that longer years of study bring higher pay. It is certainly a good idea to consider the alternative direction of the correlation, that people first get high paying jobs and that these high paying jobs allow them to get older and study longer.

Application of naive Bayesian classifiers

The naive Bayesian classifier, [3], produced better classification results than the decision trees for the label “>50K”:

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):

Here is a plot of the “>50K” success rates from the table above:

In comparison with the decision tree importance of variables experiments we can notice that:
1. “marital-status” is very decisive and it is the second most decisive variable;
2. the most decisive variable is “relationship” but it correlates with “marital-status”;
3. “age”, “occupation”, “hours-per-week”, “capital-gain”, and “sex” are decisive.

Shuffled classification rates plots comparison

Here are the two shuffled classification rates plots stacked together for easier comparison:

Data modification

In order to apply the association rules finding algorithm Apriori, [4], the data set have to be modified. The modification is to change the numerical variables “age”, “education-num”, and “age” into categorical. I just partitioned them into non-overlapping intervals, labeled the intervals, and assigned the labels according the variable values. Here is the summary of the modified data for just these variables:

Finding association rules

Using the modified data I found a large number of association rules with the Apriori algorithm, [4]. I used the measure called “confidence” to extract the most significant rules. The confidence of an association rule AC with antecedent A and consequent C is defined to be the ratio P(AC)/P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, CA.)

Here is a table showing the rules with highest confidence for the consequent being “>50K”:

From the table we can see for example that 2.1% of the data records (or 693 records) show that for a married man who has studied 14 years and originally from USA there is a 0.79 probability that he earns more than $50000.

Here is a table showing the rules with highest confidence for the consequent being “<=50K":

The association rules in these tables confirm the findings with the classifiers: marital status, age, and education are good predictors of income labels “>50K” and “<=50K".


The analysis confirmed (and quantified) what is considered common sense:

Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).

Using the association rules we see for example that
(1) if a person earns more than $50000 he is very likely to be a married man with large number of years of education;
(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than $50000.


[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science. Census Income Data Set, URL: .

[2] Antonov, A., Decision tree and random forest implementations in Mathematica, source code at, package AVCDecisionTreeForest.m, (2013).

[3] Antonov, A., Implementation of naive Bayesian classifier generation in Mathematica, source code at GitHub,, package NaiveBayesianClassifier.m, (2013).

[4] Antonov, A., Implementation of the Apriori algorithm in Mathematica, source code at, package AprioriAlgorithm.m, (2013).

[5] Antonov, A., Mosaic plot for data visualization implementation in Mathematica, source code at GitHub,, package MosaicPlot.m, (2014).

[6] Antonov, A., Tries with frequencies Mathematica package, source code at GitHub,, package TriesWithFrequencies.m, (2013).

[7] Hahsler, M. et al., Introduction to arules \[Dash] A computational environment for mining association rules and frequent item sets, (2012).

[8] Breiman, L. et al., Classification and regression trees, Chapman & Hall, 1984.

[9] Wikipedia, Association rules learning, .

[10] Antonov, A., Mosaic plots for data visualization, (March, 2014), URL: .

[11] Wikipedia, Trie, .

[12] Antonov, A., Tries, (December, 2013), URL: .

Enhancements of MosaicPlot

I made the following enhancements of the function MosaicPlot which I described (and proclaimed the implementation of) in my previous blog post:

1. Tooltips with precise contingency statistics.
2. If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics.
3. Coloring of the rectangles according to a list of index->color rules.

The document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub, combines the information of this blog post and previous one. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.

Tooltips with precise contingency statistics

I already proclaimed in my previous blog post the tooltips functionality — when hovering with the mouse over the rectangles then MosaicPlot, using Tooltip, gives a table with the exact co-occurrence (contingency) values. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Visualizing categorical columns + a numerical column

If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics. This functionality is specified with the option “ExpandLastColumn”->True.

In order to explain the functionality we are going to use following interpretation. If the last of column of the data is numerical then we can treat the data as a contracted version of a longer list of records made only of the categorical columns. For example, consider the following table with observations of people’s hair and eyes color:
Hair and eyes color number of observations

The table above can be considered as a contracted version of this table:
Hair and eyes color observations

Setting the option “ExpandLastColumn” to True gives a mosaic plot corresponding to that latter, observations-expanded table:
Hair and eyes color mosaic plot

The last data column (which is numerical) does not need to be made of integers:
Hair and eyes color mosaic plot Mathematica code

Rectangle coloring

The rectangles can be colored using the option ColorRules which specifies how the colors of the rectangles are determined from the indices of the data columns.

More precisely, the values of the option ColorRules should be a list of rules, {i1->c1,i2->c2,…}, matching the form

{(_Integer->(_RGBColor|_GrayLevel))..} .

If coloring for only one column index is specified the value of ColorRules can be of the form

{_Integer->{(_RGBColor|_GrayLevel)..}} .

The colors are used with Blend in order to color the rectangles according to the order of the unique values of the specified data columns.

The default value for ColorRules is Automatic. When Automatic is given to ColorRules, MosaicPlot finds the data column with the largest number of unique values and colors them according to their order using ColorData[7,"ColorList"].

The grid of plots below shows mosaic plots of the same data with different values for the option ColorRules (given as plot labels).
Grid of mosaic plots for ColorRules values

Mosaic plots for data visualization


This blog post has description and examples of using the function MosaicPlot of the Mathematica package MosaicPlot.m provided by the project MathematicaForPrediction at GitHub. (Also see the document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.)

The function MosaicPlot summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values. (Note, that if a column is numerical but has a small number of different values it can be seen as categorical.)

I have read the descriptions of mosaic plots in the book “R in Action” by Robert Kabakoff and one of the references provided in the book (“What is a mosaic plot?” by Steve Simon). I was impressed how informative mosaic plots are and I figured they can be relatively easily implemented using Prefix trees (also known as “Tries”). I implemented MosaicPlot while working on a document analyzing the census income data from 1998, [6]. This is the reason that data set is used in this blog post. A good alternative set provided by ExampleData is {“Statistics”,”USCars1993″}.

Data set

The data set can be found and taken from

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is the summary table of the data:
Adult census income data summary

On the summary table the numerical variables are described with min, max, and quartiles. The category variables are described with the tallies of their values. The tallies of values are ordered in decreasing order. The tallies of truncated values are summed under the value “(Other)”.

Note that:
— only 24% of the labels are “>50K”;
— 2/3 of the records are for males;
— “capital-gain” and “capital-loss” are very skewed.

Mosaic plot explanations

If we pick a categorical variable, say “sex”, we can visualize the frequencies of the appearance of the variable values with the following plot:
Adult census income data sex mosaic plot

The size of the rectangles depends on the frequencies of appearance of the values “Male” and “Female” in the data records. From the rectangle sizes we can see what we already knew from the data summary table: approximately 2/3 of the records are about males.

We can subdivide every rectangle r according to the frequencies of co-occurrence of r’s value with the values of a second categorical variable, say “relationship”:
Adult census income data sex-relationship mosaic plot

The labels corresponding to the values of “relationship” are rotated for legibility. The “relationship” labels are placed according to the co-occurrence with the value “Male” of the variable “sex”. The correspondent fractions of the pairs (“Female”,”Husband”), (“Female”,”Not-in-family”), etc., are deduced from order of the “relationship” labels.

Using colored mosaic plots can help distinguishing which rectangles correspond to which values. Here is the last plot with rectangles colored across the “relationship” data variable:
Adult census income data sex-relationship colored mosaic plot

From the visual representations of the “sex vs. relationship” mosaic plot we can see that large fraction of the males are husbands, none (or a very small fraction) of them are wives. We can also see that none (or a very small fraction) of the females are husbands, the largest fraction of them are “Not-in-family”, and they are approximately three times more than the females that are wives.

Let us make another mosaic plot of a different kind of relationship “sex vs. education”:
Adult census income data sex-education colored mosaic plot

By comparing the sizes of the rectangles corresponding to the values “Bachelors”, “Doctorate”, “Masters”, and “Some-college” on the “sex vs. education” mosaic plot we can see that the fraction of men that have finished college is larger than the fraction of women that have finished college.

We can further subdivide the rectangles according to the co-occurrence frequencies with a third categorical variable. We are going to choose that third variable to be “income”, the values of which can be seen as outcomes or consequents of the values of the first two variables of the mosaic plot.
Adult census income data sex-education-income colored mosaic plot

From the mosaic plot “sex vs. education vs. income” we can make the following observations.
1. Approximately 75% of the males with doctorate degrees or with a professional school degree earn more than $50000 per year.
2. Approximately 60% of the females with a doctorate degree earn more than $50000 per year.
3. Approximately 45% of the females with a professional school degree earn more than $50000.
4. Across all education type females are (much) less likely to earn more than $50000 per year.

Although I mentioned earlier that the “outcome” variable should be the last variable in the mosaic plot, it is also useful to start with the outcome variable to get an attribute breakdown perspective (using a different color scheme):
Adult census income data income-relationship-sex colored mosaic plot

Signature of MosaicPlot

MosaicPlot takes various options for tweaking the labels placement and style. Here is the Mathematica command:

MosaicPlot[censusData[[All, {9, 3, 5, 14}]], "Gap" -> 0.014,
"ColumnNamesOffset" -> 0.07,
"ColumnNames" ->
Map[Style[#, Blue, FontSize -> 15] &, columnNames[[{9, 3, 5, 14}]]],
"LabelRotation" -> {{3, 1}, {1, 1}}, ImageSize -> 900]

with which the following mosaic plot was made:
Adult census income data sex-education-maritalStatus-income mosaic plot colored

The option “Gap” used to regulate the gaps between the rectangle. The options “ColumnNames” and “ColumnNamesOffset” are for the specification of the variable names (in blue in the plot). The option “LabelRotation” specifies the rotation of the labels that correspond to the individual values of the variables. Also, MosaicPlot takes all the options of Graphics (since it is based on it).

Tooltip tables

The function MosaicPlot has an interactive feature using Tooltip that gives a table with the exact co-occurrence (contingency) values when hovering with the mouse over the rectangles. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Future plans

The current implementation of MosaicPlot uses coloring of the rectangles for easier plot reading. An alternative is to use coloring based on correlations statistics. I think though that the tooltip contingency tables with flexible coloring specification make the correlation coloring less needed.

Natural language processing with functional parsers

Natural language Processing (NLP) can be done with a structural approach using grammar rules. (The other type of NLP is using statistical methods.) In this post I discuss the use of functional parsers for the parsing and interpretation of small sets of natural language sentences within specific contexts. Functional parsing is also known as monadic parsing and parsing combinators.

Generally, I am using functional parsers to make Domain-Specific Languages (DSL’s). I use DSL’s to make command interfaces to search and recommendation engines and also to design and prototype conversational engines. I use extensively the so called Backus-Naur Form (BNF) for the grammar specifications. Clearly a DSL can be very close to natural language and provide sufficient means for interpretation within a given (narrow) context. (Like function integration in Calculus, smart phone directory browsing and selection, or search for something to eat nearby.)

I implemented and uploaded a package for construction of functional parsers: see FunctionalParsers.m hosted by the project MathematicaForPrediction at GitHub.

The package provides ability to quickly program parsers using a core system of functional parsers as described in the article “Functional parsers” by Jeroen Fokker .

The parsers (in both the package and the article) are categorized in the groups: basic, combinators, and transformers. Immediate interpretation can be done with transformer parsers, but the package also provides functions for evaluation of parser output within a context of data and functions.

Probably most importantly, the package provides functions for automatic generation of parsers from grammars in EBNF.

Here is an example of parsing the sentences of an integration requests language:

Interpretation of integration requests

Here is the grammar:
Integration requests EBNF grammar

The grammar can be visualized with this mind map:

Integration command

The mind map was hand made with MindNode Pro. Generally, the branches represent alternatives, but if two branches are connected the direction of the arrow connecting them shows a sequence combination.

With the FunctionalParsers.m package we can automatically generate a
mind map parsing the string of the grammar in EBNF to OMPL:


(And here is a PDF of the automatically generated mind map: IntegrationRequestsGenerated . )

I also made a slide show that gives an introduction to how the package is used: “Functional parsers for an integration requests language grammar”.

A more complicated example is this conversational engine for manipulation of time series data. (Data loading, finding outliers and trends. More details in the next blog post.)

Classification of genome data with n-gram models

In this blog post we consider the following problem:

Gene Sequence Classification Problem (GSCP): Given two genes, G1 and G2, and a (relatively short) sub-sequence S from one of them tell from which gene the sub-sequence S is part of.

One way to derive a solution for this problem is to model each gene with an n-gram model and classify the sub-sequences according to the odds ratio test based on the Markov chain state transition probabilities. We are going to make classification experiments for different n’s of the n-gram model and use a type of Receiver Operating Characteristic (ROC) plot to judge which n is the best.

This approach can be also applied to text classification. I consider genes for conciseness and clarity.

The full article with Mathematica code for the experiments described in the post can be downloaded from here.

The computations for the plots shown below were done with the Mathematica package for n-gram Markov chain models NGramMarkovChains.m provided by the MathematicaForPrediction project at GitHub.

Data and models

Mathematica has the function GenomeData that gives the DNA sequences for specified genes.

I made classification experiments for the following pairs of genes:
1. F2 and ETS1
Genes F2 and ETS1

2. F2 and PAH
Genes F2 and PAH

The classification is done for sequences like this:

In[812]:= StringTake[GenomeData["F2", "FullSequence"], {11200, 11200 + 100}]


Let us provide more details for the approach outlined at the beginning of the post.
First we define the classification algorithm NGramClassifier(k):
(I was too lazy to type the algorithm in here so I used a screenshot of the description I wrote in Mathematica.)

In order to find the best k for NGramClassifier(k) we make the following sets of experiments:

1. For each k, 1<=k<=10 calculate k-gram models for G1 and G2. For these models use only the first 80% of the gene sequences. (In other words, for each Gi the training set is only 80% of Gi’s length.)

2. Using the last 20% of G1 and G2 derive a set of test sequences T by randomly picking Gi and the sub-sequence length within the range {minLen, maxLen}.

3. For each k, 1<=k<=10 calculate the classifications of the elements of T using NGramClassifier(k) with the models calculated in step 1.

4. Calculate the True Positive Rate (TPR) and False Positive Rate (FPR) for each k in step 3.

5. Plot the points with coordinates {{TPR(k),TPR(k)}}, k in [1 10] and select the best k.


With the classification results of the tuning experiments we make the following ROC point plots
1. F2 vs. ETS1
ROCs F2 vs ETS1

2. F2 vs. PAH
ROCs F2 vs PAH

Conclusions and extensions

From the ROC plots we can see that the best classification results for GSCP with the genes “F2″ and “ETS1″ are obtained with Markov chain order 3 (or with 4-grams). Using order 2 (or 3-grams) is almost as good.

The experiments for the genes “F2″ and “PAH” showed that using 3-grams gives best results for GSCP.

A natural extension of the experiments described is to repeat them for other pairs of genes and across different lengths of sub-sequences. In this way we can derive more general conclusions for the best n-gram length in the algorithm NGramClassifier.