# Introduction

In this MathematicaVsR at GitHub project we show how to do Progressive machine learning using two types of classifiers based on:

• Tries with Frequencies, [AAp2, AAp3, AA1],

• Sparse Matrix Recommender framework [AAp4, AA2].

Progressive learning is a type of Online machine learning. For more details see [Wk1]. The Progressive learning problem is defined as follows.

Problem definition:

• Assume that the data is sequentially available.
• Meaning, at a given time only part of the data is available, and after a certain time interval new data can be obtained.

• In view of classification, it is assumed that at a given time not all class labels are presented in the data already obtained.

• Let us call this a data stream.

• Make a machine learning algorithm that updates its model continuously or sequentially in time over a given data stream.

• Let us call such an algorithm a Progressive Learning Algorithm (PLA).

In comparison, the typical (classical) machine learning algorithms assume that representative training data is available and after training that data is no longer needed to make predictions. Progressive machine learning has more general assumptions about the data and its problem formulation is closer to how humans learn to classify objects.

Below we are shown the applications of two types of classifiers as PLA’s. One is based on Tries with Frequencies (TF), [AAp2, AAp3, AA1], the other on an Item-item Recommender (IIR) framework [AAp4, AA2].

Remark: Note that both TF and IIR come from tackling Unsupervised machine learning tasks, but here they are applied in the context of Supervised machine learning.

# General workflow

The Mathematica and R notebooks follow the steps in the following flow chart. For detailed explanations see any of the notebooks.

# Example runs

(For details see Progressive-machine-learning-examples.md.)

### Using Tries with Frequencies

Here is an example run with Tries with Frequencies, [AAp2, AA1]: Here are the obtained ROC curves: We can see that with the Progressive learning process does improve its success rates in time.

### Using an Item-item recommender system

Here is an example run with an Item-item recommender system, [AAp4, AA2]: Here are the obtained ROC curves: ## Packages

[AAp1] Anton Antonov, Obtain and transform Mathematica machine learning data-sets, GetMachineLearningDataset.m, (2018), MathematicaVsR at GitHub.

[AAp2] Anton Antonov, Java tries with frequencies Mathematica package, JavaTriesWithFrequencies.m, (2017), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, Tries with frequencies R package, TriesWithFrequencies.R, (2014), MathematicaForPrediction at GitHub.

[AAp4] Anton Antonov, Sparse matrix recommender framework in Mathematica, SparseMatrixRecommenderFramework.m, (2014), MathematicaForPrediction at GitHub.

## Articles

[Wk1] Wikipedia entry, Online machine learning.

[AA1] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, "A Fast and Agile Item-Item Recommender: Design and Implementation", (2011), Wolfram Technology Conference 2011.

# Tries with frequencies in Java

## Introduction

This blog post describes the installation and use in Mathematica of Tries with frequencies  implemented in Java  through a corresponding Mathematica package .

Prefix tree or Trie, , is a tree data structure that stores a set of "words" that consist of "characters" — each element can be seen as a key to itself. The article  and packages [2,3,4] extend that data structure to have additional data (frequencies) associated with each key.

The packages [2,3] work with lists of strings only. The package  can work with more general data but it is much slower.

The main motivation to create the package  was to bring the fast Trie functions implementations of  into Mathematica in order to prototype, implement, and experiment with different text processing algorithms. (Like, inductive grammar parsers generation and entity name recognition.) The speed of combining  and  is evaluated in the section "Performance tests" below.

## Set-up

This following directory path has to have the jar file "TriesWithFrequencies.jar".

``````\$JavaTriesWithFrequenciesPath =
"/Users/antonov/MathFiles/MathematicaForPrediction/Java/TriesWithFrequencies";
FileExistsQ[
FileNameJoin[{\$JavaTriesWithFrequenciesPath, "TriesWithFrequencies.jar"}]]

(* True *)``````

For more details see the explanations in the README file in the GitHub directory of .

The following directory is expected to have the Mathematica package .

``````dirName = "/Users/antonov/MathFiles/MathematicaForPrediction";
FileExistsQ[FileNameJoin[{dirName, "JavaTriesWithFrequencies.m"}]]

(* True *)

AppendTo[\$Path, dirName];
Needs["JavaTriesWithFrequencies`"]``````

This commands installs Java (via JLink`) and loads the necessary Java libraries.

``JavaTrieInstall[\$JavaTriesWithFrequenciesPath]``

## Basic examples

For brevity the basic examples are not included in this blog post. Here is album of images that shows the `"JavaTrie.*"` commands with their effects:

More detailed explanations can be found in the Markdown document, :

Next, we are going to look into performance evaluation examples (also given in .)

## Membership of words

Assume we want find the words of "Hamlet" that are not in the book "Origin of Species". This section shows that the Java trie creation and query times for this task are quite small.

The following code reads the words in the texts. We get 33000 words from "Hamlet" and 151000 words from "Origin of Species".

``````hWords =
Block[{words},
words =
StringSplit[
ExampleData[{"Text", "Hamlet"}], {Whitespace,
PunctuationCharacter}];
words = Select[ToLowerCase[words], StringLength[#] > 0 &]
];
Length[hWords]

(* 32832 *)

osWords =
Block[{words},
words =
StringSplit[
ExampleData[{"Text", "OriginOfSpecies"}], {Whitespace,
PunctuationCharacter}];
words = Select[ToLowerCase[words], StringLength[#] > 0 &]
];
Length[osWords]

(* 151205 *)``````

### Membership

First we create trie with "Origin of species" words:

``````AbsoluteTiming[
jOStr = JavaTrieCreateBySplit[osWords];
]

(* {0.682531, Null} *)``````

Sanity check — the "Origin of species" words are in the trie:

``````AbsoluteTiming[
And @@ JavaObjectToExpression[
JavaTrieContains[jOStr, Characters /@ osWords]]
]

(* {1.32224, True} *)``````

Membership of "Hamlet" words into "Origin of Species":

``````AbsoluteTiming[
res = JavaObjectToExpression[
JavaTrieContains[jOStr, Characters /@ hWords]];
]

(* {0.265307, Null} *)``````

Tallies of belonging:

``````Tally[res]

(* {{True, 24924}, {False, 7908}} *)``````

Sample of words from "Hamlet" that do not belong to "Origin of Species":

``````RandomSample[Pick[hWords, Not /@ res], 30]

(* {"rosencrantz", "your", "mar", "airy", "rub", "honesty", \
"ambassadors", "oph", "returns", "pale", "virtue", "laertes", \
"villain", "ham", "earnest", "trail", "unhand", "quit", "your", \
"your", "fishmonger", "groaning", "your", "wake", "thou", "liest", \
"polonius", "upshot", "drowned", "grosser"} *)``````

Common words sample:

``````RandomSample[Pick[hWords, res], 30]

(* {"as", "indeed", "it", "with", "wild", "will", "to", "good", "so", \
"dirt", "the", "come", "not", "or", "but", "the", "why", "my", "to", \
"he", "and", "you", "it", "to", "potent", "said", "the", "are", \
"question", "soft"} *)``````

### Statistics

The node counts statistics calculation is fast:

``````AbsoluteTiming[
JavaTrieNodeCounts[jOStr]
]

(* {0.002344, <|"total" -> 20723, "internal" -> 15484, "leaves" -> 5239|>} *)``````

The node counts statistics computation after shrinking is comparably fast :

``````AbsoluteTiming[
JavaTrieNodeCounts[JavaTrieShrink[jOStr]]
]

(* {0.00539, <|"total" -> 8918,  "internal" -> 3679, "leaves" -> 5239|>} *)``````

The conversion of a large trie to JSON and computing statistics over the obtained tree is reasonably fast:

``````AbsoluteTiming[
res = JavaTrieToJSON[jOStr];
]

(* {0.557221, Null} *)

AbsoluteTiming[
Quantile[
Cases[res, ("value" -> v_) :> v, \[Infinity]],
Range[0, 1, 0.1]]
]

(* {0.019644, {1., 1., 1., 1., 2., 3., 5., 9., 17., 42., 151205.}} *)``````

## Dictionary infixes

Get all words from a dictionary:

``````allWords =  DictionaryLookup["*"];
allWords // Length

(* 92518 *)``````

Trie creation and shrinking:

``````AbsoluteTiming[
jDTrie = JavaTrieCreateBySplit[allWords];
jDShTrie = JavaTrieShrink[jDTrie];
]

(* {0.30508, Null} *)``````

JSON form extraction:

``````AbsoluteTiming[
jsonRes = JavaTrieToJSON[jDShTrie];
]

(* {3.85955, Null} *)``````

Here are the node statistics of the original and shrunk tries: Find the infixes that have more than three characters and appear more than 10 times:

``````Multicolumn[#, 4] &@
Select[SortBy[
Tally[Cases[
jsonRes, ("key" -> v_) :> v, Infinity]], -#[[-1]] &], StringLength[#[]] > 3 && #[] > 10 &]`````` ## Unit tests

Many of example shown in this document have corresponding tests in the file JavaTriesWithFrequencies-Unit-Tests.wlt hosted at GitHub.

``````tr = TestReport[
dirName <> "/UnitTests/JavaTriesWithFrequencies-Unit-Tests.wlt"]`````` ## References

 Anton Antonov, Tries with frequencies in Java, (2017), source code at MathematicaForPrediction at GitHub, project Java/TriesWithFrequencies.

 Anton Antonov, Java tries with frequencies Mathematica package, (2017), source code at MathematicaForPrediction at GitHub, package JavaTriesWithFrequencies.m .

 Anton Antonov, Tries with frequencies Mathematica package, (2013), source code at MathematicaForPrediction at GitHub, package TriesWithFrequencies.m .

 Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), source code at MathematicaForPrediction at GitHub, unit tests file JavaTriesWithFrequencies-Unit-Tests.wlt .

 Wikipedia, Trie, https://en.wikipedia.org/wiki/Trie .

 Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at GitHub.

# Removal of sub-trees in tries

I recently updated the package TriesWithFrequencies.m with two new functions `TrieRemoval` and TrieNodeFrequencies. The function `TrieNodeFrequencies` reverses the outcomes of `TrieNodeProbabilities`; it was made to facilitate the implementation and usage of `TrieRemoval`.

The removal of a sub-tree from a trie is useful when we want to alternate the learning stored in the trie.

The new functions are illustrated with examples below. (These examples reuse and extend the ones in the blog post “Tries with frequencies for data mining”.)
Consider the trie made with the list of words

`{"war", "ward", "warden", "work", "car", "cars", "carbs"}` The function `TrieNodeFrequencies` can be used to reverse the effect of `TrieNodeProbabilities`. Its first argument is a trie, the second one is a scaling parameter (which is 1 by default). Here is an example:

`TrieForm[TrieNodeFrequencies[TrieNodeProbabilities[mytrie], 14]]` The function `TrieRemove` can be used to remove sub-parts of a trie that correspond to a specified “word”. Here is an example using a trie with frequencies:

`TrieForm[TrieRemove[mytrie, "car"]]` Here is an example using a trie with probabilities:

`TrieForm[TrieRemove[TrieNodeProbabilities[mytrie], "car"]]` `TrieRemove` has two options “HowManyTimes” and “FrequencyValues” with default values `Automatic`. The option “HowManyTimes” takes a number or `Automatic` and it is still under development (its functionality is partially correct). The option “FrequencyValues” takes `True | False | Automatic` and it is used for the interpretation of the sub-tree removal. If the values are frequencies the removal process is more straightforward. If the values are probabilities the removal process has to take into account that the values on the same level of the sub-tree are conditional probabilities that sum to 1. `TrieRemove` has two different implementations one for tries with frequencies and one for tries with probabilities. The option setting `"FrequencyValues"->True|False` is used to redirect to the correct implementation. When `"FrequencyValues"->Automatic` is given a simple heuristic is used to determine is the trie argument a trie with frequencies or a trie with probabilities.

# Mosaic plots for data visualization

### Introduction

This blog post has description and examples of using the function `MosaicPlot` of the Mathematica package MosaicPlot.m provided by the project MathematicaForPrediction at GitHub. (Also see the document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub. The document also has Mathematica code examples of usage and description of `MosaicPlot`‘s options.)

The function `MosaicPlot` summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values. (Note, that if a column is numerical but has a small number of different values it can be seen as categorical.)

I have read the descriptions of mosaic plots in the book “R in Action” by Robert Kabakoff and one of the references provided in the book (“What is a mosaic plot?” by Steve Simon). I was impressed how informative mosaic plots are and I figured they can be relatively easily implemented using Prefix trees (also known as “Tries”). I implemented MosaicPlot while working on a document analyzing the census income data from 1998, . This is the reason that data set is used in this blog post. A good alternative set provided by ExampleData is {“Statistics”,”USCars1993″}.

### Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income.

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is the summary table of the data: On the summary table the numerical variables are described with min, max, and quartiles. The category variables are described with the tallies of their values. The tallies of values are ordered in decreasing order. The tallies of truncated values are summed under the value “(Other)”.

Note that:
— only 24% of the labels are “>50K”;
— 2/3 of the records are for males;
— “capital-gain” and “capital-loss” are very skewed.

### Mosaic plot explanations

If we pick a categorical variable, say “sex”, we can visualize the frequencies of the appearance of the variable values with the following plot: The size of the rectangles depends on the frequencies of appearance of the values “Male” and “Female” in the data records. From the rectangle sizes we can see what we already knew from the data summary table: approximately 2/3 of the records are about males.

We can subdivide every rectangle r according to the frequencies of co-occurrence of r’s value with the values of a second categorical variable, say “relationship”: The labels corresponding to the values of “relationship” are rotated for legibility. The “relationship” labels are placed according to the co-occurrence with the value “Male” of the variable “sex”. The correspondent fractions of the pairs (“Female”,”Husband”), (“Female”,”Not-in-family”), etc., are deduced from order of the “relationship” labels.

Using colored mosaic plots can help distinguishing which rectangles correspond to which values. Here is the last plot with rectangles colored across the “relationship” data variable: From the visual representations of the “sex vs. relationship” mosaic plot we can see that large fraction of the males are husbands, none (or a very small fraction) of them are wives. We can also see that none (or a very small fraction) of the females are husbands, the largest fraction of them are “Not-in-family”, and they are approximately three times more than the females that are wives.

Let us make another mosaic plot of a different kind of relationship “sex vs. education”: By comparing the sizes of the rectangles corresponding to the values “Bachelors”, “Doctorate”, “Masters”, and “Some-college” on the “sex vs. education” mosaic plot we can see that the fraction of men that have finished college is larger than the fraction of women that have finished college.

We can further subdivide the rectangles according to the co-occurrence frequencies with a third categorical variable. We are going to choose that third variable to be “income”, the values of which can be seen as outcomes or consequents of the values of the first two variables of the mosaic plot. From the mosaic plot “sex vs. education vs. income” we can make the following observations.
1. Approximately 75% of the males with doctorate degrees or with a professional school degree earn more than \$50000 per year.
2. Approximately 60% of the females with a doctorate degree earn more than \$50000 per year.
3. Approximately 45% of the females with a professional school degree earn more than \$50000.
4. Across all education type females are (much) less likely to earn more than \$50000 per year.

Although I mentioned earlier that the “outcome” variable should be the last variable in the mosaic plot, it is also useful to start with the outcome variable to get an attribute breakdown perspective (using a different color scheme): ### Signature of MosaicPlot

`MosaicPlot` takes various options for tweaking the labels placement and style. Here is the Mathematica command:
``` MosaicPlot[censusData[[All, {9, 3, 5, 14}]], "Gap" -> 0.014, "ColumnNamesOffset" -> 0.07, "ColumnNames" -> Map[Style[#, Blue, FontSize -> 15] &, columnNames[[{9, 3, 5, 14}]]], "LabelRotation" -> {{3, 1}, {1, 1}}, ImageSize -> 900] ```
with which the following mosaic plot was made: The option “Gap” used to regulate the gaps between the rectangle. The options “ColumnNames” and “ColumnNamesOffset” are for the specification of the variable names (in blue in the plot). The option “LabelRotation” specifies the rotation of the labels that correspond to the individual values of the variables. Also, `MosaicPlot` takes all the options of `Graphics` (since it is based on it).

### Tooltip tables

The function `MosaicPlot` has an interactive feature using `Tooltip` that gives a table with the exact co-occurrence (contingency) values when hovering with the mouse over the rectangles. Here is an example: ### Future plans

The current implementation of `MosaicPlot` uses coloring of the rectangles for easier plot reading. An alternative is to use coloring based on correlations statistics. I think though that the tooltip contingency tables with flexible coloring specification make the correlation coloring less needed.

# Tries with frequencies for data mining

In computer science a trie (or a prefix tree) is a data structure that allows quick retrieval of associations between a set of keys and a set of values. It is assumed that each key is a sequence of elements. (Most often the keys are considered to be strings i.e. sequences of characters.) The trie nodes do not store the keys. Instead the position of a node in a trie corresponds to a key.

For example, if we have the list of words

{“war”, “ward”, “warden”, “work”, “car”, “cars”, “carbs”}

We can further shrink the trie in order to find the internal “prefixes”: Another transformation of a trie is to replace the numbers of appearances with probabilities — each value of a node is the probability to arrive to that node from the parent: Using this last trie we can take a Bayesian perspective and ask questions like: Given that a word starts with “ca” what is the probability the word to end with “s” or “r”? (Within the set of words we built the trie with.)

This document has an introduction to tries with frequencies and it is also a guide to using the package described below. The document has lots of examples and illustrative graph plots; it also discusses how to answer the Bayesian questions.

Several years ago I started experimenting with applying tries to data mining problems. During the last few months I productized my Mathematica trie implementations into a package — see the package TriesWithFrequencies.m provided by the MathematicaForPrediction project at GitHub. The package is for creation, manipulation, and visualization of tries in which the key values are frequencies of appearance in the data that is data mined. The package can also be used to build and use tries as associative arrays.

Using tries for data mining comes from the general idea of utilizing NLP techniques and algorithms in non-language problems. Such an utilization is possible because, philosophically speaking, in the current Big Data + Facebook phase of our civilization every entity can be, and it is, characterzied with tags. If we have a collection of events and they are tagged properly with a good system of tags we can data mine sequences of events with prefix trees.