Tries with frequencies in Java


This blog post describes the installation and use in Mathematica of Tries with frequencies [1] implemented in Java [2] through a corresponding Mathematica package [3].

Prefix tree or Trie, [6], is a tree data structure that stores a set of "words" that consist of "characters" — each element can be seen as a key to itself. The article [1] and packages [2,3,4] extend that data structure to have additional data (frequencies) associated with each key.

The packages [2,3] work with lists of strings only. The package [4] can work with more general data but it is much slower.

The main motivation to create the package [3] was to bring the fast Trie functions implementations of [2] into Mathematica in order to prototype, implement, and experiment with different text processing algorithms. (Like, inductive grammar parsers generation and entity name recognition.) The speed of combining [2] and [3] is evaluated in the section "Performance tests" below.


This following directory path has to have the jar file "TriesWithFrequencies.jar".

$JavaTriesWithFrequenciesPath = 
 FileNameJoin[{$JavaTriesWithFrequenciesPath, "TriesWithFrequencies.jar"}]]

(* True *)

For more details see the explanations in the README file in the GitHub directory of [2].

The following directory is expected to have the Mathematica package [3].

dirName = "/Users/antonov/MathFiles/MathematicaForPrediction";
FileExistsQ[FileNameJoin[{dirName, "JavaTriesWithFrequencies.m"}]]

(* True *)

AppendTo[$Path, dirName];

This commands installs Java (via JLink`) and loads the necessary Java libraries.


Basic examples

For brevity the basic examples are not included in this blog post. Here is album of images that shows the "JavaTrie.*" commands with their effects:

"JavaTrieExample" .

More detailed explanations can be found in the Markdown document, [7]:

Next, we are going to look into performance evaluation examples (also given in [7].)

Membership of words

Assume we want find the words of "Hamlet" that are not in the book "Origin of Species". This section shows that the Java trie creation and query times for this task are quite small.

Read words

The following code reads the words in the texts. We get 33000 words from "Hamlet" and 151000 words from "Origin of Species".

hWords =
   words = 
     ExampleData[{"Text", "Hamlet"}], {Whitespace, 
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]

(* 32832 *)

osWords =
   words = 
     ExampleData[{"Text", "OriginOfSpecies"}], {Whitespace, 
   words = Select[ToLowerCase[words], StringLength[#] > 0 &]

(* 151205 *)


First we create trie with "Origin of species" words:

 jOStr = JavaTrieCreateBySplit[osWords];

(* {0.682531, Null} *)

Sanity check — the "Origin of species" words are in the trie:

 And @@ JavaObjectToExpression[
   JavaTrieContains[jOStr, Characters /@ osWords]]

(* {1.32224, True} *)

Membership of "Hamlet" words into "Origin of Species":

 res = JavaObjectToExpression[
    JavaTrieContains[jOStr, Characters /@ hWords]];

(* {0.265307, Null} *)

Tallies of belonging:


(* {{True, 24924}, {False, 7908}} *)

Sample of words from "Hamlet" that do not belong to "Origin of Species":

RandomSample[Pick[hWords, Not /@ res], 30]

(* {"rosencrantz", "your", "mar", "airy", "rub", "honesty", \
"ambassadors", "oph", "returns", "pale", "virtue", "laertes", \
"villain", "ham", "earnest", "trail", "unhand", "quit", "your", \
"your", "fishmonger", "groaning", "your", "wake", "thou", "liest", \
"polonius", "upshot", "drowned", "grosser"} *)

Common words sample:

RandomSample[Pick[hWords, res], 30]

(* {"as", "indeed", "it", "with", "wild", "will", "to", "good", "so", \
"dirt", "the", "come", "not", "or", "but", "the", "why", "my", "to", \
"he", "and", "you", "it", "to", "potent", "said", "the", "are", \
"question", "soft"} *)


The node counts statistics calculation is fast:


(* {0.002344, <|"total" -> 20723, "internal" -> 15484, "leaves" -> 5239|>} *)

The node counts statistics computation after shrinking is comparably fast :


(* {0.00539, <|"total" -> 8918,  "internal" -> 3679, "leaves" -> 5239|>} *)

The conversion of a large trie to JSON and computing statistics over the obtained tree is reasonably fast:

 res = JavaTrieToJSON[jOStr];

(* {0.557221, Null} *)

  Cases[res, ("value" -> v_) :> v, \[Infinity]], 
  Range[0, 1, 0.1]]

(* {0.019644, {1., 1., 1., 1., 2., 3., 5., 9., 17., 42., 151205.}} *)

Dictionary infixes

Get all words from a dictionary:

allWords =  DictionaryLookup["*"];
allWords // Length

(* 92518 *)

Trie creation and shrinking:

 jDTrie = JavaTrieCreateBySplit[allWords];
 jDShTrie = JavaTrieShrink[jDTrie];

(* {0.30508, Null} *)

JSON form extraction:

 jsonRes = JavaTrieToJSON[jDShTrie];

(* {3.85955, Null} *)

Here are the node statistics of the original and shrunk tries:


Find the infixes that have more than three characters and appear more than 10 times:

Multicolumn[#, 4] &@
     jsonRes, ("key" -> v_) :> v, Infinity]], -#[[-1]] &], StringLength[#[[1]]] > 3 && #[[2]] > 10 &]

Unit tests

Many of example shown in this document have corresponding tests in the file JavaTriesWithFrequencies-Unit-Tests.wlt hosted at GitHub.

tr = TestReport[
  dirName <> "/UnitTests/JavaTriesWithFrequencies-Unit-Tests.wlt"]


[1] Anton Antonov, "Tries with frequencies for data mining", (2013), MathematicaForPrediction at WordPress blog. URL: .

[2] Anton Antonov, Tries with frequencies in Java, (2017), source code at MathematicaForPrediction at GitHub, project Java/TriesWithFrequencies.

[3] Anton Antonov, Java tries with frequencies Mathematica package, (2017), source code at MathematicaForPrediction at GitHub, package JavaTriesWithFrequencies.m .

[4] Anton Antonov, Tries with frequencies Mathematica package, (2013), source code at MathematicaForPrediction at GitHub, package TriesWithFrequencies.m .

[5] Anton Antonov, Java tries with frequencies Mathematica unit tests, (2017), source code at MathematicaForPrediction at GitHub, unit tests file JavaTriesWithFrequencies-Unit-Tests.wlt .

[6] Wikipedia, Trie, .

[7] Anton Antonov, "Tries with frequencies in Java", (2017), MathematicaForPrediction at GitHub.

Text analysis of Trump tweets


This post is to proclaim the MathematicaVsR at GitHub project “Text analysis of Trump tweets” in which we compare Mathematica and R over text analyses of Twitter messages made by Donald Trump (and his staff) before the USA president elections in 2016.

The project follows and extends the exposition and analysis of the R-based blog post "Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half" by David Robinson at; see [1].

The blog post [1] links to several sources that claim that during the election campaign Donald Trump tweeted from his Android phone and his campaign staff tweeted from an iPhone. The blog post [1] examines this hypothesis in a quantitative way (using various R packages.)

The hypothesis in question is well summarized with the tweet:

Every non-hyperbolic tweet is from iPhone (his staff).
Every hyperbolic tweet is from Android (from him).
— Todd Vaziri (@tvaziri) August 6, 2016

This conjecture is fairly well supported by the following mosaic plots, [2]:

TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Sentiment-Device TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Device-Weekday-Sentiment

We can see the that Twitter messages from iPhone are much more likely to be neutral, and the ones from Android are much more polarized. As Christian Rudder (one of the founders of OkCupid, a dating website) explains in the chapter "Death by a Thousand Mehs" of the book "Dataclysm", [3], having a polarizing image (online persona) is as a very good strategy to engage online audience:

[…] And the effect isn’t small-being highly polarizing will in fact get you about 70 percent more messages. That means variance allows you to effectively jump several "leagues" up in the dating pecking order – […]

(The mosaic plots above were made for the Mathematica-part of this project. Mosaic plots and weekday tags are not used in [1].)

Concrete steps

The Mathematica-part of this project does not follow closely the blog post [1]. After the ingestion of the data provided in [1], the Mathematica-part applies alternative algorithms to support and extend the analysis in [1].

The sections in the R-part notebook correspond to some — not all — of the sections in the Mathematica-part.

The following list of steps is for the Mathematica-part.

  1. Data ingestion
    • The blog post [1] shows how to do in R the ingestion of Twitter data of Donald Trump messages.

    • That can be done in Mathematica too using the built-in function ServiceConnect, but that is not necessary since [1] provides a link to the ingested data used [1]:

    • Which leads to the ingesting of an R data frame in the Mathematica-part using RLink.

  2. Adding tags

    • We have to extract device tags for the messages — each message is associated with one of the tags "Android", "iPad", or "iPhone".

    • Using the message time-stamps each message is associated with time tags corresponding to the creation time month, hour, weekday, etc.

    • Here is summary of the data at this stage:


  3. Time series and time related distributions

    • We can make several types of time series plots for general insight and to support the main conjecture.

    • Here is a Mathematica made plot for the same statistic computed in [1] that shows differences in tweet posting behavior:


    • Here are distributions plots of tweets per weekday:


  4. Classification into sentiments and Facebook topics

    • Using the built-in classifiers of Mathematica each tweet message is associated with a sentiment tag and a Facebook topic tag.

    • In [1] the results of this step are derived in several stages.

    • Here is a mosaic plot for conditional probabilities of devices, topics, and sentiments:


  5. Device-word association rules

    • Using Association rule learning device tags are associated with words in the tweets.

    • In the Mathematica-part these associations rules are not needed for the sentiment analysis (because of the built-in classifiers.)

    • The association rule mining is done mostly to support and extend the text analysis in [1] and, of course, for comparison purposes.

    • Here is an example of derived association rules together with their most important measures:


In [1] the sentiments are derived from computed device-word associations, so in [1] the order of steps is 1-2-3-5-4. In Mathematica we do not need the steps 3 and 5 in order to get the sentiments in the 4th step.


Using Mathematica for sentiment analysis is much more direct because of the built-in classifiers.

The R-based blog post [1] uses heavily the "pipeline" operator %>% which is kind of a recent addition to R (and it is both fashionable and convenient to use it.) In Mathematica the related operators are Postfix (//), Prefix (@), Infix (~~), Composition (@*), and RightComposition (/*).

Making the time series plots with the R package "ggplot2" requires making special data frames. I am inclined to think that the Mathematica plotting of time series is more direct, but for this task the data wrangling codes in Mathematica and R are fairly comparable.

Generally speaking, the R package "arules" — used in this project for Associations rule learning — is somewhat awkward to use:

  • it is data frame centric, does not work directly with lists of lists, and

  • requires the use of factors.

The Apriori implementation in “arules” is much faster than the one in “AprioriAlgorithm.m” — “arules” uses a more efficient algorithm implemented in C.


[1] David Robinson, "Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half", (2016),

[2] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress.

[3] Christian Rudder, Dataclysm, Crown, 2014. ASIN: B00J1IQUX8 .

Pareto principle adherence examples

This post (document) is made to provide examples of the Pareto principle manifestation in different datasets.

The Pareto principle is an interesting law that manifests in many contexts. It is also known as "Pareto law", "the law of significant few", "the 80-20 rule".

For example:

  • "80% of the land is owned by 20% of the population",

  • "10% of all lakes contain 90% of all lake water."

For extensive discussion and studied examples see the Wikipedia entry "Pareto principle", [4].

It is a good idea to see for which parts of the analyzed data the Pareto principle manifests. Testing for the Pareto principle is usually simple. For example, assume that we have the GDP of all countries:

countries = CountryData["Countries"];
gdps = {CountryData[#, "Name"], CountryData[#, "GDP"]} & /@ countries;
gdps = DeleteCases[gdps, {_, _Missing}] /. Quantity[x_, _] :> x;

Grid[{RecordsSummary[gdps, {"country", "GDP"}]}, Alignment -> Top, Dividers -> All]


In order to test for the manifestation of the Pareto principle we have to (i) sort the GDP values in descending order, (ii) find the cumulative sums, (iii) normalize the obtained vector by the sum of all values, and (iv) plot the result. These steps are done with the following two commands:

t = Reverse@Sort@gdps[[All, 2]];
ListPlot[Accumulate[t]/Total[t], PlotRange -> All, GridLines -> {{0.2} Length[t], {0.8}}, Frame -> True]


In this document we are going to use the special function ParetoLawPlot defined in the next section and the package [1]. Most of the examples use data that is internally accessible within Mathematica. Several external data examples are considered.

See the package [1] for the function RecordsSummary. See the source file [2] for R functions that facilitate the plotting of Pareto principle graphs. See the package [3] for the outlier detection functions used below.


This simple function makes a list plot that would help assessing the manifestation of the Pareto principle. It takes the same options as ListPlot.

Options[ParetoLawPlot] = Options[ListPlot];
ParetoLawPlot[dataVec : {_?NumberQ ..}, opts : OptionsPattern[]] := ParetoLawPlot[{Tooltip[dataVec, 1]}, opts];
ParetoLawPlot[dataVecs : {{_?NumberQ ..} ..}, opts : OptionsPattern[]] := ParetoLawPlot[MapThread[Tooltip, {dataVecs, Range[Length[dataVecs]]}], opts];
ParetoLawPlot[dataVecs : {Tooltip[{_?NumberQ ..}, _] ..}, opts : OptionsPattern[]] :=
  Block[{t, mc = 0.5},
   t = Map[Tooltip[(Accumulate[#]/Total[#] &)[SortBy[#[[1]], -# &]], #[[2]]] &, dataVecs];
   ListPlot[t, opts, PlotRange -> All, GridLines -> {Length[t[[1, 1]]] Range[0.1, mc, 0.1], {0.8}}, Frame -> True, FrameTicks -> {{Automatic, Automatic}, {Automatic, Table[{Length[t[[1, 1]]] c, ToString[Round[100 c]] <> "%"}, {c, Range[0.1, mc, 0.1]}]}}]

This function is useful for coloring the outliers in the list plots.

ColorPlotOutliers[] := # /. {Point[ps_] :> {Point[ps], Red, Point[ps[[OutlierPosition[ps[[All, 2]]]]]]}} &;
ColorPlotOutliers[oid_] := # /. {Point[ps_] :> {Point[ps], Red, Point[ps[[OutlierPosition[ps[[All, 2]], oid]]]]}} &;

These definitions can be also obtained by loading the packages MathematicaForPredictionUtilities.m and OutlierIdentifiers.m; see [1,3].



Below we are going to use the metric system of units. (If preferred we can easily switch to the imperial system.)

$UnitSystem = "Metric";(*"Imperial"*)


We are going to consider a typical Pareto principle example — weatlh of income distribution.


This code find the Gross Domestic Product (GDP) of different countries:

gdps = {CountryData[#, "Name"], CountryData[#, "GDP"]} & /@CountryData["Countries"];
gdps = DeleteCases[gdps, {_, _Missing}] /. Quantity[x_, _] :> x;

The corresponding Pareto plot (note the default grid lines) shows that 10% of countries have 90% of the wealth:

ParetoLawPlot[gdps[[All, 2]], ImageSize -> 400]


Here is the log histogram of the GDP values.

Histogram[Log10@gdps[[All, 2]], 20, PlotRange -> All]


The following code shows the log plot of countries GDP values and the found outliers.

 DynamicModule[{data = Transpose[{Range[Length[gdps]], Sort[gdps[[All, 2]]]}], pos},
  pos = OutlierPosition[modFunc@data[[All, 2]], tb@*opar];
  If[Length[pos] > 0,
   ListLogPlot[{data, data[[pos]]}, PlotRange -> All, PlotTheme -> "Detailed", FrameLabel -> {"Index", "GDP"}, PlotLegends -> SwatchLegend[{"All", "Outliers"}]],
   ListLogPlot[{data}, PlotRange -> All, PlotTheme -> "Detailed", FrameLabel -> {"Index", "GDP"}, PlotLegends -> SwatchLegend[{"All", "Outliers"}]]
 {{opar, SPLUSQuartileIdentifierParameters, "outliers detector"}, {HampelIdentifierParameters, SPLUSQuartileIdentifierParameters}},
 {{tb, TopOutliers, "bottom|top"}, {BottomOutliers, TopOutliers}},
 {{modFunc, Identity, "data modifier function"}, {Identity, Log}}


This table gives the values for countries with highest GDP.

Block[{data = gdps[[OutlierPosition[gdps[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]



Similar data retrieval and plots can be made for countries populations.

pops = {CountryData[#, "Name"], CountryData[#, "Population"]} & /@CountryData["Countries"];
unit = QuantityUnit[pops[[All, 2]]][[1]];
pops = DeleteCases[pops, {_, _Missing}] /. Quantity[x_, _] :> x;

In the following Pareto plot we can see that 15% of countries have 80% of the total population:

ParetoLawPlot[pops[[All, 2]], PlotLabel -> Row[{"Population", ", ", unit}]]


Here are the countries with most people:

Block[{data = pops[[OutlierPosition[pops[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]



We can also see that the Pareto principle holds for the countries areas:

areas = {CountryData[#, "Name"], CountryData[#, "Area"]} & /@CountryData["Countries"];
areas = DeleteCases[areas, {_, _Missing}] /. Quantity[x_, _] :> x;
ParetoLawPlot[areas[[All, 2]]]


Block[{data = areas[[OutlierPosition[areas[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]


Time series-wise

An interesting diagram is to plot together the curves of GDP changes for different countries. We can see China and Poland have had rapid growth.

res = Table[
    (t = CountryData[countryName, {{"GDP"}, {1970, 2015}}];
     t = Reverse@Sort[t["Path"][[All, 2]] /. Quantity[x_, _] :> x];
     Tooltip[t, countryName])
    , {countryName, {"USA", "China", "Poland", "Germany", "France", "Denmark"}}];

ParetoLawPlot[res, PlotRange -> All, Joined -> True, PlotLegends -> res[[All, 2]]]



This dynamic interface can be used for a given country to compare (i) the GDP evolution in time and (ii) the corresponding Pareto plot.

 DynamicModule[{ts, t},
  ts = CountryData[countryName, {{"GDP"}, {1970, 2015}}];
  t = Reverse@Sort[ts["Path"][[All, 2]] /. Quantity[x_, _] :> x];
  Grid[{{"Date list plot of GDP values", "GDP Pareto plot"}, {DateListPlot[ts, ImageSize -> Medium],
     ParetoLawPlot[t, ImageSize -> Medium]}}]
 ], {countryName, {"USA", "China", "Poland", "Germany", "France", "Denmark"}}]


Country flag colors

The following code demonstrates that the colors of the pixels in country flags also adhere to the Pareto principle.

flags = CountryData[#, "Flag"] & /@ CountryData["Countries"];

flags[[1 ;; 12]]


ids = ImageData /@ flags;

pixels = Apply[Join, Flatten[ids, 1]];

ToBinFunc[x_] := Evaluate[Piecewise[MapIndexed[{#2[[1]], #1[[1]] < x <= #1[[2]]} &, Partition[Range[0, 1, 0.1], 2, 1]]]];

pixelsInt = Transpose@Table[Map[ToBinFunc, pixels[[All, i]]], {i, 1, 3}];

pixelsIntTally = SortBy[Tally[pixelsInt], -#[[-1]] &];

ParetoLawPlot[pixelsIntTally[[All, 2]]]



Loking at lengths in the tunnel data we can see the manifestation of an exaggerated Pareto principle.

tunnelLengths = TunnelData[All, {"Name", "Length"}];
tunnelLengths // Length

(* 1552 *)

t = Reverse[Sort[DeleteMissing[tunnelLengths[[All, -1]]] /. Quantity[x_, _] :> x]];



Here is the logarithmic histogram of the lengths:

Histogram[Log10@t, PlotRange -> All, PlotTheme -> "Detailed"]



The following code gathers the data and make the Pareto plots surface areas, volumes, and fish catch values for lakes. We can that the lakes volumes show exaggerated Pareto principle.

lakeAreas = LakeData[All, "SurfaceArea"];
lakeVolumes = LakeData[All, "Volume"];
lakeFishCatch = LakeData[All, "CommercialFishCatch"];

data = {lakeAreas, lakeVolumes, lakeFishCatch};
t = N@Map[DeleteMissing, data] /. Quantity[x_, _] :> x;

opts = {PlotRange -> All, ImageSize -> Medium}; MapThread[ParetoLawPlot[#1, PlotLabel -> Row[{#2, ", ", #3}], opts] &, {t, {"Lake area", "Lake volume", "Commercial fish catch"}, DeleteMissing[#][[1, 2]] & /@ data}]


City data

One of the examples given in [5] is that the city areas obey the Power Law. Since the Pareto principle is a kind of Power Law we can confirm that observation using Pareto principle plots.

The following grid of Pareto principle plots is for areas and population sizes of cities in selected states of USA.

res = Table[
    (cities = CityData[{All, stateName, "USA"}];
     t = Transpose@Outer[CityData, cities, {"Area", "Population"}];
     t = Map[DeleteMissing[#] /. Quantity[x_, _] :> x &, t, {1}];
     ParetoLawPlot[MapThread[Tooltip, {t, {"Area", "Population"}}], PlotLabel -> stateName, ImageSize -> 250])
    , {stateName, {"Alabama", "California", "Florida", "Georgia", "Illinois", "Iowa", "Kentucky", "Ohio", "Tennessee"}}];

Legended[Grid[ArrayReshape[res, {3, 3}]], SwatchLegend[Cases[res[[1]], _RGBColor, Infinity], {"Area", "Population"}]]


Movie ratings in MovieLens datasets

Looking into the MovieLens 20M dataset, [6], we can see that the Pareto princple holds for (1) most rated movies and (2) most active users. We can also see the manifestation of an exaggerated Pareto law — 90% of all ratings are for 10% of the movies.



The following plot taken from the blog post "PIN analysis", [7], shows that the four digit passwords people use adhere to the Pareto principle: the first 20% of (the unique) most frequently used passwords correspond to the 70% of all passwords use.




[1] Anton Antonov, "MathematicaForPrediction utilities", (2014), source code MathematicaForPrediction at GitHub,, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Pareto principle functions in R, source code MathematicaForPrediction at GitHub,, source code file ParetoLawFunctions.R .

[3] Anton Antonov, Implementation of one dimensional outlier identifying algorithms in Mathematica, (2013), MathematicaForPrediction at GitHub, URL: .

[4] Wikipedia entry, "Pareto principle", URL: .

[5] Wikipedia entry, "Power law", URL: .

[6] GroupLens Research, MovieLens 20M Dataset, (2015).

[7] "PIN analysis", (2012), DataGenetics.

Contingency tables creation examples


In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables.


In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, [1].


The implementation of CrossTabulate in [1] is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in [1] is based on R‘s function xtabs called via RLink.

Other package used in this document are [2] and [4] imported with these commands:


For a different approach to implementing cross-tabulation than those taken in [1] see the Stack Overflow answer by Mr.Wizard.

Using Titanic data

Getting data

titanicData = 
  Flatten@*List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"];
titanicData = DeleteCases[titanicData, {___, _Missing, ___}];

titanicColumnNames = 
  Flatten@*List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
aTitanicColumnNames = 
  AssociationThread[titanicColumnNames -> Range[Length[titanicColumnNames]]];

Note that we have removed the records with missing data (for simpler exposition).

Data summary

(* {1046, 4} *)

RecordsSummary[titanicData, titanicColumnNames]


Using CrossTabulate

Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group.

We can do that by

1. finding the counts contingency table C for the variables "passenger class" and "passenger survival",

2. finding the age contingency table A for the same variables, and

3. do the element-wise division \frac{A}{C}.

ctCounts = 
   titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


ctTotalAge = 
    aTitanicColumnNames /@ {"passenger class", "passenger survival", 
      "passenger age"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge


 TableHeadings -> Values[Rest[ctTotalAge]]]


(We have to make the sparse array ctCounts a regular array because otherwise we get warnings for division by zero because of the sparse array’s default value.)

Let us repeat the steps above by dividing the passengers before-hand according to their sex.

  (mCount = 
     CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]; 
   mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]];
   #[[1,  aTitanicColumnNames["passenger sex"]]] -> 
     MatrixForm[mAge[[1]]/Normal[mCount[[1]]], TableHeadings -> Values[Rest[mAge]]]) &, 
  GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]]


The alternative of CrossTabulate is xtabsViaRLink that is uses R’s function xtabs via RLink.


(* {Paclet[RLinkRuntime,, <>]} *)

ctCounts = 
    titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]],
    {"", "passenger.survival"},
    " ~ + passenger.survival"];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


Relation to mosaic plots

The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section.

 titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ]


Straightforward calling of MatrixForm

At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association .

MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] || KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x);

Now we can do this:

MatrixForm @
 CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]


Remark: Because of this up-value definition for Association with MatrixForm we have the associations returned by CrossTabulate and FromRXTabsForm to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter.

Using larger data

Let us consider an example with larger data that has larger number of unique values in its columns.

Getting online retail invoices data

The following dataset is taken from [6].

data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"];
columnNames = First[data];
data = Rest[data];

aColumnNames = AssociationThread[columnNames -> Range[Length[columnNames]]];

Data summary

We have \approx 66000 rows and 8 columns:

(* {65499, 8} *)

Here is a summary of the columns:

Magnify[#, 0.75] &@RecordsSummary[data, columnNames]


Contingency tables

Country vs. StockCode

There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command:

  GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]]
(* {1,144} *)

The way in which the column "StockCode" was ingested made it have multiple types for its values:

Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]]
(* {{False,9009},{True,56490}} *)

So let us convert it to all strings:

data[[All, aColumnNames["StockCode"]]] = 
  ToString /@ data[[All, aColumnNames["StockCode"]]];

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate:

 ctRes = CrossTabulate[
    data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]];
(* {0.256339,Null} *)

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink:

 rres = xtabsViaRLink[
   data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]],
   {"Country", "StockCode", "Quantity"},
   "Quantity ~ Country + StockCode"]; 
 ctRRes = FromRXTabsForm[rres];
(* {0.843621,Null} *)

Both functions produce the same result:

ctRRes["matrix"] == N@ctRRes[[1]]
(* True *)

Note that xtabsViaRLink is slower but still fairly quick.

Here we plot the contingency table using MatrixPlot :

MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5, 
  FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}], 
  {Automatic, Automatic}}, ImageSize -> 550]


Country vs. Quarter

Let us extend the data with columns that have months and quarters corresponding to the invoice dates.

The following commands computing date objects and extracting month and quarter values from them are too slow.

(*AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \
(* {30.2595,Null} *)

(* {91.1732,Null} *)

We can use the following ad hoc computation instead.

dvals = StringSplit[#, {"/", " ", ":"}] & /@ 

This summary shows that the second value in the dates is day of month, and the first value is most likely month.

Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16]


These commands extend the data and the corresponding column-name-to-index association.

ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"];
dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]];
dataM = MapThread[Join[#1, #2] &, {data, dvals}];
aColumnNamesM = 
  Join[aColumnNames, <|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)|>];
(* {0.054877,Null} *)

Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity".

ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]];
Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes


Uniform tables

Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns.

Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity".

tbs = Association @
    (xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]];
     #[[1, aColumnNamesM["Country"]]] -> xtab) &,
    GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]];

Magnify[#, 0.75] &@
 Map[MatrixForm[#["Matrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs](*[[{1,12,-1}]]*)


Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table.

First we convert the contingency tables into RSparseMatrix objects:

tbs2 = Map[ ToRSparseMatrix[#["Matrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs];

And then we impose the desired row and column names:

tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2];
Magnify[#, 0.75] &@(MatrixForm /@ tbs2)


Generalization : CrossTensorate

A generalization of CrossTabulate is the function CrossTensorate implemented in [1] that takes a "formula" argument similar to R’s xtabs.

This finds number of people of different sub-groups of Titanic data:

ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames];


We can verify the result using Count:

Count[titanicData, {"1st", _, "female", "died"}]
(* 5 *)

Count[titanicData, {"2nd", _, "male", "survived"}]
(* 23 *)

To split the cross-tensor across its first variable we can use this command:

sctRes = Association@
  MapThread[Rule[#1, Join[<|"XTABTensor" -> #2|>, Rest@Rest@ctRes]] &, {ctRes[[2]], # & /@ ctRes["XTABTensor"]}];
MatrixForm /@ sctRes


Or we can call the more general function CrossTensorateSplit implemented in [1]:

Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]]


CrossTensorateSplit can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as :

Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]


[1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

[3] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress blog. URL: .

[4] Anton Antonov, RSparseMatrix Mathematica package, (2015) MathematicaForPrediction at GitHub. URL: .

[5] Anton Antonov, "RSparseMatrix for sparse matrices with named rows and columns", (2015), MathematicaForPrediction at WordPress blog. URL: .

[6] Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: .