Tries with frequencies in Java

Introduction

This blog post describes the installation and use in Mathematica of Tries with frequencies [1] implemented in Java [2] through a corresponding Mathematica package [3].

Prefix tree or Trie, [6], is a tree data structure that stores a set of "words" that consist of "characters" — each element can be seen as a key to itself. The article [1] and packages [2,3,4] extend that data structure to have additional data (frequencies) associated with each key.

The packages [2,3] work with lists of strings only. The package [4] can work with more general data but it is much slower.

The main motivation to create the package [3] was to bring the fast Trie functions implementations of [2] into Mathematica in order to prototype, implement, and experiment with different text processing algorithms. (Like, inductive grammar parsers generation and entity name recognition.) The speed of combining [2] and [3] is evaluated in the section "Performance tests" below.

Set-up

This following directory path has to have the jar file "TriesWithFrequencies.jar".

$JavaTriesWithFrequenciesPath = 
  "/Users/antonov/MathFiles/MathematicaForPrediction/Java/TriesWithFrequencies";
FileExistsQ[
 FileNameJoin[{$JavaTriesWithFrequenciesPath, "TriesWithFrequencies.jar"}]]

(* True *)

For more details see the explanations in the README file in the GitHub directory of [2].

The following directory is expected to have the Mathematica package [3].

dirName = "/Users/antonov/MathFiles/MathematicaForPrediction";
FileExistsQ[FileNameJoin[{dirName, "JavaTriesWithFrequencies.m"}]]

(* True *)

AppendTo[$Path, dirName];
Needs["JavaTriesWithFrequencies`"]

This commands installs Java (via JLink`) and loads the necessary Java libraries.

JavaTrieInstall[$JavaTriesWithFrequenciesPath]

Basic examples

For brevity the basic examples are not included in this blog post. Here is album of images that shows the "JavaTrie.*" commands with their effects:

"JavaTrieExample" .

More detailed explanations can be found in the Markdown document, [7]:

Next, we are going to look into performance evaluation examples (also given in [7].)

Membership of words

Assume we want find the words of "Hamlet" that are not in the book "Origin of Species". This section shows that the Java trie creation and query times for this task are quite small.

Read words

The following code reads the words in the texts. We get 33000 words from "Hamlet" and 151000 words from "Origin of Species".

hWords =
  Block[{words},
   words = 
    StringSplit[
     ExampleData[{"Text", "Hamlet"}], {Whitespace, 
      PunctuationCharacter}];
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]
   ];
Length[hWords]

(* 32832 *)

osWords =
  Block[{words},
   words = 
    StringSplit[
     ExampleData[{"Text", "OriginOfSpecies"}], {Whitespace, 
      PunctuationCharacter}];
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]
   ];
Length[osWords]

(* 151205 *)

Membership

First we create trie with "Origin of species" words:

AbsoluteTiming[
 jOStr = JavaTrieCreateBySplit[osWords];
]

(* {0.682531, Null} *)

Sanity check — the "Origin of species" words are in the trie:

AbsoluteTiming[
 And @@ JavaObjectToExpression[
   JavaTrieContains[jOStr, Characters /@ osWords]]
]

(* {1.32224, True} *)

Membership of "Hamlet" words into "Origin of Species":

AbsoluteTiming[
 res = JavaObjectToExpression[
    JavaTrieContains[jOStr, Characters /@ hWords]];
]

(* {0.265307, Null} *)

Tallies of belonging:

Tally[res]

(* {{True, 24924}, {False, 7908}} *)

Sample of words from "Hamlet" that do not belong to "Origin of Species":

RandomSample[Pick[hWords, Not /@ res], 30]

(* {"rosencrantz", "your", "mar", "airy", "rub", "honesty", \
"ambassadors", "oph", "returns", "pale", "virtue", "laertes", \
"villain", "ham", "earnest", "trail", "unhand", "quit", "your", \
"your", "fishmonger", "groaning", "your", "wake", "thou", "liest", \
"polonius", "upshot", "drowned", "grosser"} *)

Common words sample:

RandomSample[Pick[hWords, res], 30]

(* {"as", "indeed", "it", "with", "wild", "will", "to", "good", "so", \
"dirt", "the", "come", "not", "or", "but", "the", "why", "my", "to", \
"he", "and", "you", "it", "to", "potent", "said", "the", "are", \
"question", "soft"} *)

Statistics

The node counts statistics calculation is fast:

AbsoluteTiming[
 JavaTrieNodeCounts[jOStr]
]

(* {0.002344, <|"total" -> 20723, "internal" -> 15484, "leaves" -> 5239|>} *)

The node counts statistics computation after shrinking is comparably fast :

AbsoluteTiming[
 JavaTrieNodeCounts[JavaTrieShrink[jOStr]]
]

(* {0.00539, <|"total" -> 8918,  "internal" -> 3679, "leaves" -> 5239|>} *)

The conversion of a large trie to JSON and computing statistics over the obtained tree is reasonably fast:

AbsoluteTiming[
 res = JavaTrieToJSON[jOStr];
]

(* {0.557221, Null} *)

AbsoluteTiming[
 Quantile[
  Cases[res, ("value" -> v_) :> v, \[Infinity]], 
  Range[0, 1, 0.1]]
]

(* {0.019644, {1., 1., 1., 1., 2., 3., 5., 9., 17., 42., 151205.}} *)

Dictionary infixes

Get all words from a dictionary:

allWords =  DictionaryLookup["*"];
allWords // Length

(* 92518 *)

Trie creation and shrinking:

AbsoluteTiming[
 jDTrie = JavaTrieCreateBySplit[allWords];
 jDShTrie = JavaTrieShrink[jDTrie];
]

(* {0.30508, Null} *)

JSON form extraction:

AbsoluteTiming[
 jsonRes = JavaTrieToJSON[jDShTrie];
]

(* {3.85955, Null} *)

Here are the node statistics of the original and shrunk tries:

"Orginal-trie-vs-Shrunk-trie-Node-Counts"

Find the infixes that have more than three characters and appear more than 10 times:

Multicolumn[#, 4] &@
 Select[SortBy[
   Tally[Cases[
     jsonRes, ("key" -> v_) :> v, Infinity]], -#[[-1]] &], StringLength[#[[1]]] > 3 && #[[2]] > 10 &]
"Long-infixes-in-shrunk-dictionary-trie"

Unit tests

Many of example shown in this document have corresponding tests in the file JavaTriesWithFrequencies-Unit-Tests.wlt hosted at GitHub.

tr = TestReport[
  dirName <> "/UnitTests/JavaTriesWithFrequencies-Unit-Tests.wlt"]
"TestReport"

References

[1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2013/12/06/tries-with-frequencies-for-data-mining/ .

[2] Anton Antonov, Tries with frequencies in Java, (2017), source code at MathematicaForPrediction at GitHub, project Java/TriesWithFrequencies.

[3] Anton Antonov, Java tries with frequencies Mathematica package, (2017), source code at MathematicaForPrediction at GitHub, package JavaTriesWithFrequencies.m .

[4] Anton Antonov, Tries with frequencies Mathematica package, (2013), source code at MathematicaForPrediction at GitHub, package TriesWithFrequencies.m .

[5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), source code at MathematicaForPrediction at GitHub, unit tests file JavaTriesWithFrequencies-Unit-Tests.wlt .

[6] Wikipedia, Trie, https://en.wikipedia.org/wiki/Trie .

[7] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at GitHub.