Tries with frequencies in Java

Introduction

This blog post describes the installation and use in Mathematica of Tries with frequencies [1] implemented in Java [2] through a corresponding Mathematica package [3].

Prefix tree or Trie, [6], is a tree data structure that stores a set of "words" that consist of "characters" — each element can be seen as a key to itself. The article [1] and packages [2,3,4] extend that data structure to have additional data (frequencies) associated with each key.

The packages [2,3] work with lists of strings only. The package [4] can work with more general data but it is much slower.

The main motivation to create the package [3] was to bring the fast Trie functions implementations of [2] into Mathematica in order to prototype, implement, and experiment with different text processing algorithms. (Like, inductive grammar parsers generation and entity name recognition.) The speed of combining [2] and [3] is evaluated in the section "Performance tests" below.

Set-up

This following directory path has to have the jar file "TriesWithFrequencies.jar".

$JavaTriesWithFrequenciesPath = 
  "/Users/antonov/MathFiles/MathematicaForPrediction/Java/TriesWithFrequencies";
FileExistsQ[
 FileNameJoin[{$JavaTriesWithFrequenciesPath, "TriesWithFrequencies.jar"}]]

(* True *)

For more details see the explanations in the README file in the GitHub directory of [2].

The following directory is expected to have the Mathematica package [3].

dirName = "/Users/antonov/MathFiles/MathematicaForPrediction";
FileExistsQ[FileNameJoin[{dirName, "JavaTriesWithFrequencies.m"}]]

(* True *)

AppendTo[$Path, dirName];
Needs["JavaTriesWithFrequencies`"]

This commands installs Java (via JLink`) and loads the necessary Java libraries.

JavaTrieInstall[$JavaTriesWithFrequenciesPath]

Basic examples

For brevity the basic examples are not included in this blog post. Here is album of images that shows the "JavaTrie.*" commands with their effects:

"JavaTrieExample" .

More detailed explanations can be found in the Markdown document, [7]:

Next, we are going to look into performance evaluation examples (also given in [7].)

Membership of words

Assume we want find the words of "Hamlet" that are not in the book "Origin of Species". This section shows that the Java trie creation and query times for this task are quite small.

Read words

The following code reads the words in the texts. We get 33000 words from "Hamlet" and 151000 words from "Origin of Species".

hWords =
  Block[{words},
   words = 
    StringSplit[
     ExampleData[{"Text", "Hamlet"}], {Whitespace, 
      PunctuationCharacter}];
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]
   ];
Length[hWords]

(* 32832 *)

osWords =
  Block[{words},
   words = 
    StringSplit[
     ExampleData[{"Text", "OriginOfSpecies"}], {Whitespace, 
      PunctuationCharacter}];
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]
   ];
Length[osWords]

(* 151205 *)

Membership

First we create trie with "Origin of species" words:

AbsoluteTiming[
 jOStr = JavaTrieCreateBySplit[osWords];
]

(* {0.682531, Null} *)

Sanity check — the "Origin of species" words are in the trie:

AbsoluteTiming[
 And @@ JavaObjectToExpression[
   JavaTrieContains[jOStr, Characters /@ osWords]]
]

(* {1.32224, True} *)

Membership of "Hamlet" words into "Origin of Species":

AbsoluteTiming[
 res = JavaObjectToExpression[
    JavaTrieContains[jOStr, Characters /@ hWords]];
]

(* {0.265307, Null} *)

Tallies of belonging:

Tally[res]

(* {{True, 24924}, {False, 7908}} *)

Sample of words from "Hamlet" that do not belong to "Origin of Species":

RandomSample[Pick[hWords, Not /@ res], 30]

(* {"rosencrantz", "your", "mar", "airy", "rub", "honesty", \
"ambassadors", "oph", "returns", "pale", "virtue", "laertes", \
"villain", "ham", "earnest", "trail", "unhand", "quit", "your", \
"your", "fishmonger", "groaning", "your", "wake", "thou", "liest", \
"polonius", "upshot", "drowned", "grosser"} *)

Common words sample:

RandomSample[Pick[hWords, res], 30]

(* {"as", "indeed", "it", "with", "wild", "will", "to", "good", "so", \
"dirt", "the", "come", "not", "or", "but", "the", "why", "my", "to", \
"he", "and", "you", "it", "to", "potent", "said", "the", "are", \
"question", "soft"} *)

Statistics

The node counts statistics calculation is fast:

AbsoluteTiming[
 JavaTrieNodeCounts[jOStr]
]

(* {0.002344, <|"total" -> 20723, "internal" -> 15484, "leaves" -> 5239|>} *)

The node counts statistics computation after shrinking is comparably fast :

AbsoluteTiming[
 JavaTrieNodeCounts[JavaTrieShrink[jOStr]]
]

(* {0.00539, <|"total" -> 8918,  "internal" -> 3679, "leaves" -> 5239|>} *)

The conversion of a large trie to JSON and computing statistics over the obtained tree is reasonably fast:

AbsoluteTiming[
 res = JavaTrieToJSON[jOStr];
]

(* {0.557221, Null} *)

AbsoluteTiming[
 Quantile[
  Cases[res, ("value" -> v_) :> v, \[Infinity]], 
  Range[0, 1, 0.1]]
]

(* {0.019644, {1., 1., 1., 1., 2., 3., 5., 9., 17., 42., 151205.}} *)

Dictionary infixes

Get all words from a dictionary:

allWords =  DictionaryLookup["*"];
allWords // Length

(* 92518 *)

Trie creation and shrinking:

AbsoluteTiming[
 jDTrie = JavaTrieCreateBySplit[allWords];
 jDShTrie = JavaTrieShrink[jDTrie];
]

(* {0.30508, Null} *)

JSON form extraction:

AbsoluteTiming[
 jsonRes = JavaTrieToJSON[jDShTrie];
]

(* {3.85955, Null} *)

Here are the node statistics of the original and shrunk tries:

"Orginal-trie-vs-Shrunk-trie-Node-Counts"

Find the infixes that have more than three characters and appear more than 10 times:

Multicolumn[#, 4] &@
 Select[SortBy[
   Tally[Cases[
     jsonRes, ("key" -> v_) :> v, Infinity]], -#[[-1]] &], StringLength[#[[1]]] > 3 && #[[2]] > 10 &]
"Long-infixes-in-shrunk-dictionary-trie"

Unit tests

Many of example shown in this document have corresponding tests in the file JavaTriesWithFrequencies-Unit-Tests.wlt hosted at GitHub.

tr = TestReport[
  dirName <> "/UnitTests/JavaTriesWithFrequencies-Unit-Tests.wlt"]
"TestReport"

References

[1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2013/12/06/tries-with-frequencies-for-data-mining/ .

[2] Anton Antonov, Tries with frequencies in Java, (2017), source code at MathematicaForPrediction at GitHub, project Java/TriesWithFrequencies.

[3] Anton Antonov, Java tries with frequencies Mathematica package, (2017), source code at MathematicaForPrediction at GitHub, package JavaTriesWithFrequencies.m .

[4] Anton Antonov, Tries with frequencies Mathematica package, (2013), source code at MathematicaForPrediction at GitHub, package TriesWithFrequencies.m .

[5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), source code at MathematicaForPrediction at GitHub, unit tests file JavaTriesWithFrequencies-Unit-Tests.wlt .

[6] Wikipedia, Trie, https://en.wikipedia.org/wiki/Trie .

[7] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at GitHub.

Advertisements

Removal of sub-trees in tries

I recently updated the package TriesWithFrequencies.m with two new functions TrieRemoval and TrieNodeFrequencies. The function TrieNodeFrequencies reverses the outcomes of TrieNodeProbabilities; it was made to facilitate the implementation and usage of TrieRemoval.

The removal of a sub-tree from a trie is useful when we want to alternate the learning stored in the trie.

The new functions are illustrated with examples below. (These examples reuse and extend the ones in the blog post “Tries with frequencies for data mining”.)
Consider the trie made with the list of words

{"war", "ward", "warden", "work", "car", "cars", "carbs"}

Trie

 

The function TrieNodeFrequencies can be used to reverse the effect of TrieNodeProbabilities. Its first argument is a trie, the second one is a scaling parameter (which is 1 by default). Here is an example:

TrieForm[TrieNodeFrequencies[TrieNodeProbabilities[mytrie], 14]]
TrieNodeFrequenciesTrieNodeProbabilities14

The function TrieRemove can be used to remove sub-parts of a trie that correspond to a specified “word”. Here is an example using a trie with frequencies:

TrieForm[TrieRemove[mytrie, "car"]]

TrieRemoveOfcar

Here is an example using a trie with probabilities:

TrieForm[TrieRemove[TrieNodeProbabilities[mytrie], "car"]]

TrieRemoveOfTrieNodeProbabilitiesOfcar

TrieRemove has two options “HowManyTimes” and “FrequencyValues” with default values Automatic. The option “HowManyTimes” takes a number or Automatic and it is still under development (its functionality is partially correct). The option “FrequencyValues” takes True | False | Automatic and it is used for the interpretation of the sub-tree removal. If the values are frequencies the removal process is more straightforward. If the values are probabilities the removal process has to take into account that the values on the same level of the sub-tree are conditional probabilities that sum to 1. TrieRemove has two different implementations one for tries with frequencies and one for tries with probabilities. The option setting "FrequencyValues"->True|False is used to redirect to the correct implementation. When "FrequencyValues"->Automatic is given a simple heuristic is used to determine is the trie argument a trie with frequencies or a trie with probabilities.

Mosaic plots for data visualization

Introduction

This blog post has description and examples of using the function MosaicPlot of the Mathematica package MosaicPlot.m provided by the project MathematicaForPrediction at GitHub. (Also see the document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.)

The function MosaicPlot summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values. (Note, that if a column is numerical but has a small number of different values it can be seen as categorical.)

I have read the descriptions of mosaic plots in the book “R in Action” by Robert Kabakoff and one of the references provided in the book (“What is a mosaic plot?” by Steve Simon). I was impressed how informative mosaic plots are and I figured they can be relatively easily implemented using Prefix trees (also known as “Tries”). I implemented MosaicPlot while working on a document analyzing the census income data from 1998, [6]. This is the reason that data set is used in this blog post. A good alternative set provided by ExampleData is {“Statistics”,”USCars1993″}.

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income.

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is the summary table of the data:
Adult census income data summary

On the summary table the numerical variables are described with min, max, and quartiles. The category variables are described with the tallies of their values. The tallies of values are ordered in decreasing order. The tallies of truncated values are summed under the value “(Other)”.

Note that:
— only 24% of the labels are “>50K”;
— 2/3 of the records are for males;
— “capital-gain” and “capital-loss” are very skewed.

Mosaic plot explanations

If we pick a categorical variable, say “sex”, we can visualize the frequencies of the appearance of the variable values with the following plot:
Adult census income data sex mosaic plot

The size of the rectangles depends on the frequencies of appearance of the values “Male” and “Female” in the data records. From the rectangle sizes we can see what we already knew from the data summary table: approximately 2/3 of the records are about males.

We can subdivide every rectangle r according to the frequencies of co-occurrence of r’s value with the values of a second categorical variable, say “relationship”:
Adult census income data sex-relationship mosaic plot

The labels corresponding to the values of “relationship” are rotated for legibility. The “relationship” labels are placed according to the co-occurrence with the value “Male” of the variable “sex”. The correspondent fractions of the pairs (“Female”,”Husband”), (“Female”,”Not-in-family”), etc., are deduced from order of the “relationship” labels.

Using colored mosaic plots can help distinguishing which rectangles correspond to which values. Here is the last plot with rectangles colored across the “relationship” data variable:
Adult census income data sex-relationship colored mosaic plot

From the visual representations of the “sex vs. relationship” mosaic plot we can see that large fraction of the males are husbands, none (or a very small fraction) of them are wives. We can also see that none (or a very small fraction) of the females are husbands, the largest fraction of them are “Not-in-family”, and they are approximately three times more than the females that are wives.

Let us make another mosaic plot of a different kind of relationship “sex vs. education”:
Adult census income data sex-education colored mosaic plot

By comparing the sizes of the rectangles corresponding to the values “Bachelors”, “Doctorate”, “Masters”, and “Some-college” on the “sex vs. education” mosaic plot we can see that the fraction of men that have finished college is larger than the fraction of women that have finished college.

We can further subdivide the rectangles according to the co-occurrence frequencies with a third categorical variable. We are going to choose that third variable to be “income”, the values of which can be seen as outcomes or consequents of the values of the first two variables of the mosaic plot.
Adult census income data sex-education-income colored mosaic plot

From the mosaic plot “sex vs. education vs. income” we can make the following observations.
1. Approximately 75% of the males with doctorate degrees or with a professional school degree earn more than $50000 per year.
2. Approximately 60% of the females with a doctorate degree earn more than $50000 per year.
3. Approximately 45% of the females with a professional school degree earn more than $50000.
4. Across all education type females are (much) less likely to earn more than $50000 per year.

Although I mentioned earlier that the “outcome” variable should be the last variable in the mosaic plot, it is also useful to start with the outcome variable to get an attribute breakdown perspective (using a different color scheme):
Adult census income data income-relationship-sex colored mosaic plot

Signature of MosaicPlot

MosaicPlot takes various options for tweaking the labels placement and style. Here is the Mathematica command:

MosaicPlot[censusData[[All, {9, 3, 5, 14}]], "Gap" -> 0.014,
"ColumnNamesOffset" -> 0.07,
"ColumnNames" ->
Map[Style[#, Blue, FontSize -> 15] &, columnNames[[{9, 3, 5, 14}]]],
"LabelRotation" -> {{3, 1}, {1, 1}}, ImageSize -> 900]

with which the following mosaic plot was made:
Adult census income data sex-education-maritalStatus-income mosaic plot colored

The option “Gap” used to regulate the gaps between the rectangle. The options “ColumnNames” and “ColumnNamesOffset” are for the specification of the variable names (in blue in the plot). The option “LabelRotation” specifies the rotation of the labels that correspond to the individual values of the variables. Also, MosaicPlot takes all the options of Graphics (since it is based on it).

Tooltip tables

The function MosaicPlot has an interactive feature using Tooltip that gives a table with the exact co-occurrence (contingency) values when hovering with the mouse over the rectangles. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Future plans

The current implementation of MosaicPlot uses coloring of the rectangles for easier plot reading. An alternative is to use coloring based on correlations statistics. I think though that the tooltip contingency tables with flexible coloring specification make the correlation coloring less needed.

Tries with frequencies for data mining

In computer science a trie (or a prefix tree) is a data structure that allows quick retrieval of associations between a set of keys and a set of values. It is assumed that each key is a sequence of elements. (Most often the keys are considered to be strings i.e. sequences of characters.) The trie nodes do not store the keys. Instead the position of a node in a trie corresponds to a key.

For example, if we have the list of words

{“war”, “ward”, “warden”, “work”, “car”, “cars”, “carbs”}

we can make the trie:
Trie

We can further shrink the trie in order to find the internal “prefixes”:
Shrunk trie

Another transformation of a trie is to replace the numbers of appearances with probabilities — each value of a node is the probability to arrive to that node from the parent:

Shrunk trie with probabilities

Using this last trie we can take a Bayesian perspective and ask questions like: Given that a word starts with “ca” what is the probability the word to end with “s” or “r”? (Within the set of words we built the trie with.)

This document has an introduction to tries with frequencies and it is also a guide to using the package described below. The document has lots of examples and illustrative graph plots; it also discusses how to answer the Bayesian questions.

Several years ago I started experimenting with applying tries to data mining problems. During the last few months I productized my Mathematica trie implementations into a package — see the package TriesWithFrequencies.m provided by the MathematicaForPrediction project at GitHub. The package is for creation, manipulation, and visualization of tries in which the key values are frequencies of appearance in the data that is data mined. The package can also be used to build and use tries as associative arrays.

Using tries for data mining comes from the general idea of utilizing NLP techniques and algorithms in non-language problems. Such an utilization is possible because, philosophically speaking, in the current Big Data + Facebook phase of our civilization every entity can be, and it is, characterzied with tags. If we have a collection of events and they are tagged properly with a good system of tags we can data mine sequences of events with prefix trees.