Mathematica vs. R at GitHub

In brief

This post is to announce the repository MathematicaVsR at GitHub that has example projects, code, and documents for comparing Mathematica with R.

My plan is to proclaim new completed Mathematica-vs-R projects here, in this blog post, and when appropriate make separate blog posts about them.

Mission statement

The development in the MathematicaVsR at GitHub repository aims to provide a collection of relatively simple but non-trivial example projects that illustrate the use of Mathematica and R in different statistical, machine learning, scientific, and software engineering programming activities.

Each of the projects has implementations and documents made with both Mathematica and R — hopefully that would allow comparison and knowledge transfer.

Where to begin

This presentation, "Mathematica vs. R" given at the Wolfram Technology Conference 2015 is probably a good start.

As a warm-up of how to do the comparison see this mind-map (which is made for Mathematica users):


Projects overview

  1. BrowsingDataWithChernoffFaces

  2. DataWrangling

  3. DistributionExtractionAFromGaussianNoisedMixture

  4. HandwrittenDigitsClassificationByMatrixFactorization

  5. ODEsWithSeasonalities

  6. ProgressiveJackpotModeling

  7. RegressionWithROC

  8. StatementsSaliencyInPodcasts

  9. TextAnalysisOfTrumpTweets

  10. TimeSeriesAnalysisWithQuantileRegression

Future projects

The future projects are listed in order of their completion time proximity — the highest in the list would be committed the soonest.

ROC for classifier ensembles, bootstrapping, damaging, and interpolation


The main goals of this document are:

i) to demonstrate how to create versions and combinations of classifiers utilizing different perspectives,

ii) to apply the Receiver Operating Characteristic (ROC) technique into evaluating the created classifiers (see [2,3]) and

iii) to illustrate the use of the Mathematica packages [5,6].

The concrete steps taken are the following:

  1. Obtain data: Mathematica built-in or external. Do some rudimentary analysis.

  2. Create an ensemble of classifiers and compare its performance to the individual classifiers in the ensemble.

  3. Produce classifier versions with from changed data in order to explore the effect of records outliers.

  4. Make a bootstrapping classifier ensemble and evaluate and compare its performance.

  5. Systematically diminish the training data and evaluate the results with ROC.

  6. Show how to do classifier interpolation utilizing ROC.

In the steps above we skip the necessary preliminary data analysis. For the datasets we use in this document that analysis has been done elsewhere. (See [,,,].) Nevertheless, since ROC is mostly used for binary classifiers we want to analyze the class labels distributions in the datasets in order to designate which class labels are "positive" and which are "negative."

ROC plots evaluation (in brief)

Assume we are given a binary classifier with the class labels P and N (for "positive" and "negative" respectively).

Consider the following measures True Positive Rate (TPR):

 TPR:= \frac {correctly \:  classified \:  positives}{total \:  positives}.

and False Positive Rate (FPR):

 FPR:= \frac {incorrectly \:  classified \:  negatives}{total \:  negatives}.

Assume that we can change the classifier results with a parameter \theta and produce a plot like this one:


For each parameter value \theta _{i} the point {TPR(\theta _{i}), FPR(\theta _{i})} is plotted; points corresponding to consecutive \theta _{i}‘s are connected with a line. We call the obtained curve the ROC curve for the classifier in consideration. The ROC curve resides in the ROC space as defined by the functions FPR and TPR corresponding respectively to the x-axis and the y-axis.

The ideal classifier would have its ROC curve comprised of a line connecting {0,0} to {0,1} and a line connecting {0,1} to {1,1}.

Given a classifier the ROC point closest to {0,1}, generally, would be considered to be the best point.

The wider perspective

This document started as being a part of a conference presentation about illustrating the cultural differences between Statistics and Machine learning (for Wolfram Technology Conference 2016). Its exposition become both deeper and wider than expected. Here are the alternative, original goals of the document:

i) to demonstrate how using ROC a researcher can explore classifiers performance without intimate knowledge of the classifiers` mechanisms, and

ii) to provide concrete examples of the typical investigation approaches employed by machine learning researchers.

To make those points clearer and more memorable we are going to assume that exposition is a result of the research actions of a certain protagonist with a suitably selected character.

A by-product of the exposition is that it illustrates the following lessons from machine learning practices. (See [1].)

  1. For a given classification task there often are multiple competing models.

  2. The outcomes of the good machine learning algorithms might be fairly complex. I.e. there are no simple interpretations when really good results are obtained.

  3. Having high dimensional data can be very useful.

In [1] these three points and discussed under the names "Rashomon", "Occam", and "Bellman". To quote:

Rashomon: the multiplicity of good models;
Occam: the conflict between simplicity and accuracy;
Bellman: dimensionality — curse or blessing."

The protagonist

Our protagonist is a "Simple Nuclear Physicist" (SNP) — someone who is accustomed to obtaining a lot of data that has to be analyzed and mined sometimes very deeply, rigorously, and from a lot of angles, for different hypotheses. SNP is fairly adept in programming and critical thinking, but he does not have or care about deep knowledge of statistics methods or machine learning algorithms. SNP is willing and capable to use software libraries that provide algorithms for statistical methods and machine learning.

SNP is capable of coming up with ROC if he is not aware of it already. ROC is very similar to the so called phase space diagrams physicists do.

Used packages

These commands load the used Mathematica packages [4,5,6]:


Data used

The Titanic dataset

These commands load the Titanic data (that is shipped with Mathematica).

data = ExampleData[{"MachineLearning", "Titanic"}, "TrainingData"];
columnNames = (Flatten@*List) @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
trainingData = DeleteCases[data, {___, _Missing, ___}];

(* {732, 4} *)

RecordsSummary[trainingData, columnNames]


data = ExampleData[{"MachineLearning", "Titanic"}, "TestData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
testData = DeleteCases[data, {___, _Missing, ___}];

(* {314, 4} *)

RecordsSummary[testData, columnNames]


nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

Classifier ensembles

This command makes a classifier ensemble of two built-in classifiers "NearestNeighbors" and "NeuralNetwork":

aCLs = EnsembleClassifier[{"NearestNeighbors", "NeuralNetwork"}, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]]


A classifier ensemble of the package [6] is simply an association mapping classifier IDs to classifier functions.

The first argument given to EnsembleClassifier can be Automatic:

aCLs = EnsembleClassifier[Automatic, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]];

With Automatic the following built-in classifiers are used:


(* {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"} *)

Classification with ensemble votes

Classification with the classifier ensemble can be done using the function EnsembleClassify. If the third argument of EnsembleClassify is "Votes" the result is the class label that appears the most in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "Votes"]

(* "died" *)

The following commands clarify the voting done in the command above.

Map[#[testData[[20, 1 ;; -2]]] &, aCLs]

(* <|"NearestNeighbors" -> "died", "NeuralNetwork" -> "survived", "LogisticRegression" -> "survived", "RandomForest" -> "died", "SupportVectorMachine" -> "died", "NaiveBayes" -> "died"|> *)

(* {{"died", 4}, {"survived", 2}} *)

Classification with ensemble averaged probabilities

If the third argument of EnsembleClassify is "ProbabilitiesMean" the result is the class label that has the highest mean probability in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "ProbabilitiesMean"]

(* "died" *)

The following commands clarify the probability averaging utilized in the command above.

Map[#[testData[[20, 1 ;; -2]], "Probabilities"] &, aCLs]

(* <|"NearestNeighbors" -> <|"died" -> 0.598464, "survived" -> 0.401536|>, "NeuralNetwork" -> <|"died" -> 0.469274, "survived" -> 0.530726|>, "LogisticRegression" -> <|"died" -> 0.445915, "survived" -> 0.554085|>, 
"RandomForest" -> <|"died" -> 0.652414, "survived" -> 0.347586|>, "SupportVectorMachine" -> <|"died" -> 0.929831, "survived" -> 0.0701691|>, "NaiveBayes" -> <|"died" -> 0.622061, "survived" -> 0.377939|>|> *)

(* <|"died" -> 0.61966, "survived" -> 0.38034|> *)

ROC for ensemble votes

The third argument of EnsembleClassifyByThreshold takes a rule of the form label->threshold; the fourth argument is eighter "Votes" or "ProbabiltiesMean".

The following code computes the ROC curve for a range of votes.

rocRange = Range[0, Length[aCLs] - 1, 1];
aROCs = Table[(
    cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "Votes"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, GridLines -> Automatic]


ROC for ensemble probabilities mean

If we want to compute ROC of a range of probability thresholds we EnsembleClassifyByThreshold with the fourth argument being "ProbabilitiesMean".

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.2, "ProbabilitiesMean"]

(* {"survived", "survived", "survived", "survived", "survived", "survived"} *)

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.6, "ProbabilitiesMean"]

(* {"survived", "died", "survived", "died", "died", "survived"} *)

The implementation of EnsembleClassifyByThreshold with "ProbabilitiesMean" relies on the ClassifierFunction signature:

ClassifierFunction[__][record_, "Probabilities"]

Here is the corresponding ROC plot:

rocRange = Range[0, 1, 0.025];
aROCs = Table[(
    cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "ProbabilitiesMean"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
rocEnGr = ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, PlotLabel -> "Classifier ensemble", GridLines -> Automatic]


Comparison of the ensemble classifier with the standard classifiers

This plot compares the ROC curve of the ensemble classifier with the ROC curves of the classifiers that comprise the ensemble.

rocGRs = Table[
   aROCs1 = Table[(
      cres = ClassifyByThreshold[aCLs[[i]], testData[[All, 1 ;; -2]], "survived" -> th];
      ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {th, rocRange}]; 
   ROCPlot[rocRange, aROCs1, PlotLabel -> Keys[aCLs][[i]], PlotRange -> {{0, 1.05}, {0.6, 1.01}}, "PlotJoined" -> Automatic, GridLines -> Automatic],
   {i, 1, Length[aCLs]}];

GraphicsGrid[ArrayReshape[Append[Prepend[rocGRs, rocEnGr], rocEnGr], {2, 4}, ""], Dividers -> All, FrameStyle -> GrayLevel[0.8], ImageSize -> 1200]


Let us plot all ROC curves from the graphics grid above into one plot. For that the single classifier ROC curves are made gray, and their threshold callouts removed. We can see that the classifier ensemble brings very good results for \theta = 0.175 and none of the single classifiers has a better point.

Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), rocEnGr]]


Classifier ensembles by bootstrapping

There are several ways to produce ensemble classifiers using bootstrapping or jackknife resampling procedures.

First, we are going to make a bootstrapping classifier ensemble using one of the Classify methods. Then we are going to make a more complicated bootstrapping classifier with six methods of Classify.

Bootstrapping ensemble with a single classification method

First we select a classification method and make a classifier with it.

clMethod = "NearestNeighbors";
sCL = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> clMethod];

The following code makes a classifier ensemble of 12 classifier functions using resampled, slightly smaller (10%) versions of the original training data (with RandomChoice).

aBootStrapCLs = Association@Table[(
     inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
     ToString[i] -> Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> clMethod]), {i, 12}];

Let us compare the ROC curves of the single classifier with the bootstrapping derived ensemble.

rocRange = Range[0.1, 0.9, 0.025];
 aSingleROCs = Table[(
    cres = ClassifyByThreshold[sCL, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
 aBootStrapROCs = Table[(
    cres = EnsembleClassifyByThreshold[aBootStrapCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];

(* {6.81521, Null} *)

   ROCPlot[rocRange, aSingleROCs, "ROCColor" -> Blue, "PlotJoined" -> Automatic, GridLines -> Automatic],
   ROCPlot[rocRange, aBootStrapROCs, "ROCColor" -> Red, "PlotJoined" -> Automatic]}],
 SwatchLegend @@ Transpose@{{Blue, Row[{"Single ", clMethod, " classifier"}]}, {Red, Row[{"Boostrapping ensemble of\n", Length[aBootStrapCLs], " ", clMethod, " classifiers"}]}}]


We can see that we get much better results with the bootstrapped ensemble.

Bootstrapping ensemble with multiple classifier methods

This code creates an classifier ensemble using the classifier methods corresponding to Automatic given as a first argument to EnsembleClassifier.

 aBootStrapLargeCLs = Association@Table[(
      inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
      ecls = EnsembleClassifier[Automatic, trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]]];
      AssociationThread[Map[# <> "-" <> ToString[i] &, Keys[ecls]] -> Values[ecls]]
     ), {i, 12}];

(* {27.7975, Null} *)

This code computes the ROC statistics with the obtained bootstrapping classifier ensemble:

 aBootStrapLargeROCs = Table[(
     cres = EnsembleClassifyByThreshold[aBootStrapLargeCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];

(* {45.1995, Null} *)

Let us plot the ROC curve of the bootstrapping classifier ensemble (in blue) and the single classifier ROC curves (in gray):

aBootStrapLargeGr = ROCPlot[rocRange, aBootStrapLargeROCs, "PlotJoined" -> Automatic];
Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), aBootStrapLargeGr]]


Again we can see that the bootstrapping ensemble produced better ROC points than the single classifiers.

Damaging data

This section tries to explain why the bootstrapping with resampling to smaller sizes produces good results.

In short, the training data has outliers; if we remove small fractions of the training data we might get better results.

The procedure described in this section can be used in conjunction with the procedures described in the guide for importance of variables investigation [7].

Ordering function

Let us replace the categorical values with numerical in the training data. There are several ways to do it, here is a fairly straightforward one:

nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

Decreasing proportions of females

First, let us find all indices corresponding to records about females.

femaleInds = Flatten@Position[trainingData[[All, 3]], "female"];

The following code standardizes the training data corresponding to females, finds the mean record, computes distances from the mean record, and finally orders the female records indices according to their distances from the mean record.

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[femaleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
femaleInds = femaleInds[[Reverse@Ordering[ds]]];

The following plot shows the distances calculated above.

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"]


The following code removes from the training data the records corresponding to females according to the order computed above. The female records farthest from the mean female record are removed first.

 femaleFrRes = Association@
    Table[cl ->
        inds = Complement[Range[Length[trainingData]], Take[femaleInds, Ceiling[fr*Length[femaleInds]]]];
        cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
        {fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
       {fr, 0, 0.8, 0.05}],
     {cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];

(* {203.001, Null} *)

The following graphics grid shows how the classification results are affected by the removing fractions of the female records from the training data. The results for none or small fractions of records removed are more blue.

   femaleAROCs = femaleFrRes[cl][[All, 2]];
   frRange = femaleFrRes[cl][[All, 1]]; ROCPlot[frRange, femaleAROCs, PlotRange -> {{0.0, 0.25}, {0.2, 0.8}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
   {cl, Keys[femaleFrRes]}],
  {2, 3}], Dividers -> All]


We can see that removing the female records outliers has dramatic effect on the results by the classifiers "NearestNeighbors" and "NeuralNetwork". Not so much on "LogisticRegression" and "NaiveBayes".

Decreasing proportions of males

The code in this sub-section repeats the experiment described in the previous one males (instead of females).

maleInds = Flatten@Position[trainingData[[All, 3]], "male"];

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[maleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
maleInds = maleInds[[Reverse@Ordering[ds]]];

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"]


 maleFrRes = Association@
    Table[cl ->
        inds = Complement[Range[Length[trainingData]], Take[maleInds, Ceiling[fr*Length[maleInds]]]];
        cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
        {fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
       {fr, 0, 0.8, 0.05}],
     {cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];

(* {179.219, Null} *)

   maleAROCs = maleFrRes[cl][[All, 2]];
   frRange = maleFrRes[cl][[All, 1]]; ROCPlot[frRange, maleAROCs, PlotRange -> {{0.0, 0.35}, {0.55, 0.82}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
   {cl, Keys[maleFrRes]}],
  {2, 3}], Dividers -> All]


Classifier interpolation

Assume that we want a classifier that for a given representative set of n items (records) assigns the positive label to an exactly n_p of them. (Or very close to that number.)

If we have two classifiers, one returning more positive items than n_p, the other less than n_p, then we can use geometric computations in the ROC space in order to obtain parameters for a classifier interpolation that will bring positive items close to n_p; see [3]. Below is given Mathematica code with explanations of how that classifier interpolation is done.

Assume that by prior observations we know that for a given dataset of n items the positive class consists of \approx 0.09 n items. Assume that for a given unknown dataset of n items we want 0.2 n of the items to be classified as positive. We can write the equation:

 {FPR} * ((1-0.09) * n) + {TPR} * (0.09 * n) = 0.2 * n ,

which can be simplified to

 {FPR} * (1-0.09) + {TPR} * 0.09 = 0.2 .

The two classifiers

Consider the following two classifiers.

cf1 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "RandomForest"];
cfROC1 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf1[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 82, "FalsePositive" -> 22, "TrueNegative" -> 170, "FalseNegative" -> 40|> *)

cf2 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "LogisticRegression"];
cfROC2 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf2[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 89, "FalsePositive" -> 37, "TrueNegative" -> 155, "FalseNegative" -> 33|> *)

Geometric computations in the ROC space

Here are the ROC space points corresponding to the two classifiers, cf1 and cf2:

p1 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC1]];
p2 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC2]];

Here is the breakdown of frequencies of the class labels:

Tally[trainingData[[All, -1]]]
%[[All, 2]]/Length[trainingData] // N

(* {{"survived", 305}, {"died", 427}}
   {0.416667, 0.583333}) *)

We want to our classifier to produce 38% people to survive. Here we find two points of the corresponding constraint line (on which we ROC points of the desired classifiers should reside):

sol1 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.1}, {x, y}][[1]]
sol2 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.25}, {x, y}][[1]]

(* {x -> 0.1, y -> 0.766667}
   {x -> 0.25, y -> 0.559524} *)

Here using the points q1 and q2 of the constraint line we find the intersection point with the line connecting the ROC points of the classifiers:

{q1, q2} = {{x, y} /. sol1, {x, y} /. sol2};
sol = Solve[ {{x, y} \[Element] InfiniteLine[{q1, q2}] \[And] {x, y} \[Element] InfiniteLine[{p1, p2}]}, {x, y}];
q = {x, y} /. sol[[1]]

(* {0.149753, 0.69796} *)

Let us plot all geometric objects:

Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black, 
  Text["cf1", p1, {-1.5, 1}], Red, Tooltip[Point[p2], "cf2"], Black, 
  Text["cf2", p2, {1.5, -1}], Black, Point[q], Dashed, 
  InfiniteLine[{q1, q2}], Thin, InfiniteLine[{p1, p2}]}, 
 PlotRange -> {{0., 0.3}, {0.6, 0.8}}, 
 GridLines -> Automatic, Frame -> True]


Classifier interpolation

Next we find the ratio of the distance from the intersection point q to the cf1 ROC point and the distance between the ROC points of cf1 and cf2.

k = Norm[p1 - q]/Norm[p1 - p2]
(* 0.450169 *)

The classifier interpolation is made by a weighted random selection based on that ratio (using RandomChoice):

cres = MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}];
cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
p3 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]];
Graphics[{PointSize[0.015], Blue, Point[p1], Red, Point[p2], Black, Dashed, InfiniteLine[{q1, q2}], Green, Point[p3]}, 
 PlotRange -> {{0., 0.3}, {0.6, 0.8}}, 
 GridLines -> Automatic, Frame -> True]


We can run the process multiple times in order to convince ourselves that the interpolated classifier ROC point is very close to the constraint line most of the time.

p3s =
    cres = 
     MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}]; 
    cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
    Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]]), {1000}];

Show[{SmoothDensityHistogram[p3s, ColorFunction -> (Blend[{White, Green}, #] &), Mesh -> 3], 
  Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black, Text["cf1", p1, {-1.5, 1}], 
     Red, Tooltip[Point[p2], "cf2"], Black, Text["cf2", p2, {1.5, -1}], 
     Black, Dashed, InfiniteLine[{q1, q2}]}, GridLines -> Automatic]}, 
 PlotRange -> {{0., 0.3}, {0.6, 0.8}}, 
 GridLines -> Automatic, Axes -> True, 
 AspectRatio -> Automatic]



[1] Leo Breiman, Statistical Modeling: The Two Cultures, (2001), Statistical Science, Vol. 16, No. 3, 199[Dash]231.

[2] Wikipedia entry, Receiver operating characteristic. URL: .

[3] Tom Fawcett, An introduction to ROC analysis, (2006), Pattern Recognition Letters, 27, 861[Dash]874. (Link to PDF.)

[4] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[5] Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ROCFunctions.m.

[6] Anton Antonov, Classifier ensembles functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ClassifierEnsembles.m.

[7] Anton Antonov, "Importance of variables investigation guide", (2016), MathematicaForPrediction at GitHub, folder Documentation.

Basic example of using ROC with Linear regression


This post is for using the package [2] that provides Mathematica implementations of Receiver Operating Characteristic (ROC) functions calculation and plotting. The ROC framework is used for analysis and tuning of binary classifiers, [3]. (The classifiers are assumed to classify into a positive/true label or a negative/false label. )

The function ROCFuntions gives access to the individual ROC functions through string arguments. Those ROC functions are applied to special objects, called ROC Association objects.

Each ROC Association object is an Association that has the following four keys: "TruePositive", "FalsePositive", "TrueNegative", and "FalseNegative" .

Given two lists of actual and predicted labels a ROC Association object can be made with the function ToROCAssociation .

For more definitions and example of ROC terminology and functions see [3].

Why Linear regression

I was asked in this discussion why Linear regression and not, say, Logistic regression.

Here is my answer:

1. I am trying to do a minimal and quick to execute example — the code of the post is included in the package ROCFunctions.m.

2. I am aware that there are better alternatives of LinearModelFit, but I plan to discuss those in the MathematicaVsR project: “Regression with ROC”. (As the name hints, it is not just about Linear regression.)

3. Another point is that although the Linear regression is not a good method for this classification, nevertheless using ROC it can be made to give better results than, say, the built-in “NeuralNetwork” method. See the last section of “Linear regression with”.

Minimal example

Note that here although we use both of the provided Titanic training and test data, the code is doing only training. The test data is used to find the best tuning parameter (threshold) through ROC analysis.

Get packages

These commands load the packages [1,2]:


Using Titanic data

Here is the summary of the Titanic data used below:

titanicData = (Flatten@*List) @@@ExampleData[{"MachineLearning", "Titanic"}, "Data"];
columnNames = (Flatten@*List) @@ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
RecordsSummary[titanicData, columnNames]


This variable dependence grid shows the relationships between the variables.

Magnify[#, 0.7] &@VariableDependenceGrid[titanicData, columnNames]


Get training and testing data

data = ExampleData[{"MachineLearning", "Titanic"}, "TrainingData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
trainingData = DeleteCases[data, {___, _Missing, ___}];

(* {732, 4} *)

data = ExampleData[{"MachineLearning", "Titanic"}, "TestData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
testData = DeleteCases[data, {___, _Missing, ___}];

(* {314, 4} *)

Replace categorical with numerical values

trainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

testData = testData /. {"survived" -> 1, "died" -> 0, "1st" -> 1, "2nd" -> 2, "3rd" -> 3, "male" -> 0, "female" -> 1};

Do linear regression

lfm = LinearModelFit[{trainingData[[All, 1 ;; -2]], trainingData[[All, -1]]}]


Get the predicted values

modelValues = lfm @@@ testData[[All, 1 ;; -2]];

Histogram[modelValues, 20]




Obtain ROC associations over a set of parameter values

testLabels = testData[[All, -1]];

thRange = Range[0.1, 0.9, 0.025];
aROCs = Table[ToROCAssociation[{0, 1}, testLabels, Map[If[# > \[Theta], 1, 0] &, modelValues]], {\[Theta], thRange}];

Evaluate ROC functions for given ROC association

Through[ROCFunctions[{"PPV", "NPV", "TPR", "ACC", "SPC"}][aROCs[[3]]]]

(* {34/43, 19/37, 17/32, 197/314, 95/122} *)

Standard ROC plot

ROCPlot[thRange, aROCs, "PlotJoined" -> Automatic, "ROCPointCallouts" -> True, "ROCPointTooltips" -> True, GridLines -> Automatic]


Plot ROC functions wrt to parameter values

ListLinePlot[Map[Transpose[{thRange, #}] &, Transpose[Map[Through[ROCFunctions[{"PPV", "NPV", "TPR", "ACC", "SPC"}][#]] &, aROCs]]],
 Frame -> True, FrameLabel -> Map[Style[#, Larger] &, {"threshold, \[Theta]", "rate"}], PlotLegends -> Map[# <> ", " <> (ROCFunctions["FunctionInterpretations"][#]) &, {"PPV", "NPV", "TPR", "ACC", "SPC"}], GridLines -> Automatic]


Finding the intersection point of PPV and TPR

We want to find a point that provides balanced positive and negative labels success rates. One way to do this is to find the intersection point of the ROC functions PPV (positive predictive value) and TPR (true positive rate).

Examining the plot above we can come up with the initial condition for \(x\).

ppvFunc = Interpolation[Transpose@{thRange, ROCFunctions["PPV"] /@ aROCs}];
tprFunc = Interpolation[Transpose@{thRange, ROCFunctions["TPR"] /@ aROCs}];
FindRoot[ppvFunc[x] - tprFunc[x] == 0, {x, 0.2}]

(* {x -> 0.3} *)

Area under the ROC curve

The Area Under the ROC curve (AUROC) tells for a given range of the controlling parameter "what is the probability of the classifier to rank a randomly chosen positive instance higher than a randomly chosen negative instance, (assuming ‘positive’ ranks higher than ‘negative’)", [3,4]

Calculating AUROC is easy using the Trapezoidal quadrature formula:

 N@Total[Partition[Sort@Transpose[{ROCFunctions["FPR"] /@ aROCs, ROCFunctions["TPR"] /@ aROCs}], 2, 1] 
   /. {{x1_, y1_}, {x2_, y2_}} :> (x2 - x1) (y1 + (y2 - y1)/2)]

 (* 0.698685 *)

It is also implemented in [2]:


(* 0.698685 *)


[1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ROCFunctions.m .

[3] Wikipedia entry, Receiver operating characteristic. URL: .

[4] Tom Fawcett, An introduction to ROC analysis, (2006), Pattern Recognition Letters, 27, 861-874.

Contingency tables creation examples


In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables.


In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, [1].


The implementation of CrossTabulate in [1] is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in [1] is based on R‘s function xtabs called via RLink.

Other package used in this document are [2] and [4] imported with these commands:


For a different approach to implementing cross-tabulation than those taken in [1] see the Stack Overflow answer by Mr.Wizard.

Using Titanic data

Getting data

titanicData = 
  Flatten@*List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"];
titanicData = DeleteCases[titanicData, {___, _Missing, ___}];

titanicColumnNames = 
  Flatten@*List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
aTitanicColumnNames = 
  AssociationThread[titanicColumnNames -> Range[Length[titanicColumnNames]]];

Note that we have removed the records with missing data (for simpler exposition).

Data summary

(* {1046, 4} *)

RecordsSummary[titanicData, titanicColumnNames]


Using CrossTabulate

Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group.

We can do that by

1. finding the counts contingency table C for the variables "passenger class" and "passenger survival",

2. finding the age contingency table A for the same variables, and

3. do the element-wise division \frac{A}{C}.

ctCounts = 
   titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


ctTotalAge = 
    aTitanicColumnNames /@ {"passenger class", "passenger survival", 
      "passenger age"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge


 TableHeadings -> Values[Rest[ctTotalAge]]]


(We have to make the sparse array ctCounts a regular array because otherwise we get warnings for division by zero because of the sparse array’s default value.)

Let us repeat the steps above by dividing the passengers before-hand according to their sex.

  (mCount = 
     CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]; 
   mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]];
   #[[1,  aTitanicColumnNames["passenger sex"]]] -> 
     MatrixForm[mAge[[1]]/Normal[mCount[[1]]], TableHeadings -> Values[Rest[mAge]]]) &, 
  GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]]


The alternative of CrossTabulate is xtabsViaRLink that is uses R’s function xtabs via RLink.


(* {Paclet[RLinkRuntime,, <>]} *)

ctCounts = 
    titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]],
    {"", "passenger.survival"},
    " ~ + passenger.survival"];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts


Relation to mosaic plots

The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section.

 titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ]


Straightforward calling of MatrixForm

At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association .

MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] || KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x);

Now we can do this:

MatrixForm @
 CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]


Remark: Because of this up-value definition for Association with MatrixForm we have the associations returned by CrossTabulate and FromRXTabsForm to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter.

Using larger data

Let us consider an example with larger data that has larger number of unique values in its columns.

Getting online retail invoices data

The following dataset is taken from [6].

data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"];
columnNames = First[data];
data = Rest[data];

aColumnNames = AssociationThread[columnNames -> Range[Length[columnNames]]];

Data summary

We have \approx 66000 rows and 8 columns:

(* {65499, 8} *)

Here is a summary of the columns:

Magnify[#, 0.75] &@RecordsSummary[data, columnNames]


Contingency tables

Country vs. StockCode

There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command:

  GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]]
(* {1,144} *)

The way in which the column "StockCode" was ingested made it have multiple types for its values:

Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]]
(* {{False,9009},{True,56490}} *)

So let us convert it to all strings:

data[[All, aColumnNames["StockCode"]]] = 
  ToString /@ data[[All, aColumnNames["StockCode"]]];

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate:

 ctRes = CrossTabulate[
    data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]];
(* {0.256339,Null} *)

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink:

 rres = xtabsViaRLink[
   data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]],
   {"Country", "StockCode", "Quantity"},
   "Quantity ~ Country + StockCode"]; 
 ctRRes = FromRXTabsForm[rres];
(* {0.843621,Null} *)

Both functions produce the same result:

ctRRes["matrix"] == N@ctRRes[[1]]
(* True *)

Note that xtabsViaRLink is slower but still fairly quick.

Here we plot the contingency table using MatrixPlot :

MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5, 
  FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}], 
  {Automatic, Automatic}}, ImageSize -> 550]


Country vs. Quarter

Let us extend the data with columns that have months and quarters corresponding to the invoice dates.

The following commands computing date objects and extracting month and quarter values from them are too slow.

(*AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \
(* {30.2595,Null} *)

(* {91.1732,Null} *)

We can use the following ad hoc computation instead.

dvals = StringSplit[#, {"/", " ", ":"}] & /@ 

This summary shows that the second value in the dates is day of month, and the first value is most likely month.

Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16]


These commands extend the data and the corresponding column-name-to-index association.

ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"];
dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]];
dataM = MapThread[Join[#1, #2] &, {data, dvals}];
aColumnNamesM = 
  Join[aColumnNames, <|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)|>];
(* {0.054877,Null} *)

Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity".

ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]];
Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes


Uniform tables

Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns.

Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity".

tbs = Association @
    (xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]];
     #[[1, aColumnNamesM["Country"]]] -> xtab) &,
    GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]];

Magnify[#, 0.75] &@
 Map[MatrixForm[#["Matrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs](*[[{1,12,-1}]]*)


Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table.

First we convert the contingency tables into RSparseMatrix objects:

tbs2 = Map[ ToRSparseMatrix[#["Matrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs];

And then we impose the desired row and column names:

tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2];
Magnify[#, 0.75] &@(MatrixForm /@ tbs2)


Generalization : CrossTensorate

A generalization of CrossTabulate is the function CrossTensorate implemented in [1] that takes a "formula" argument similar to R’s xtabs.

This finds number of people of different sub-groups of Titanic data:

ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames];


We can verify the result using Count:

Count[titanicData, {"1st", _, "female", "died"}]
(* 5 *)

Count[titanicData, {"2nd", _, "male", "survived"}]
(* 23 *)

To split the cross-tensor across its first variable we can use this command:

sctRes = Association@
  MapThread[Rule[#1, Join[<|"XTABTensor" -> #2|>, Rest@Rest@ctRes]] &, {ctRes[[2]], # & /@ ctRes["XTABTensor"]}];
MatrixForm /@ sctRes


Or we can call the more general function CrossTensorateSplit implemented in [1]:

Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]]


CrossTensorateSplit can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as :

Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]


[1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

[3] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress blog. URL: .

[4] Anton Antonov, RSparseMatrix Mathematica package, (2015) MathematicaForPrediction at GitHub. URL: .

[5] Anton Antonov, "RSparseMatrix for sparse matrices with named rows and columns", (2015), MathematicaForPrediction at WordPress blog. URL: .

[6] Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: .

Adaptive numerical Lebesgue integration by set measure estimates


In this document are given outlines and examples of several related implementations of Lebesgue integration, [1], within the framework of NIntegrate, [7]. The focus is on the implementations of Lebesgue integration algorithms that have multiple options and can be easily extended (in order to do further research, optimization, etc.) In terms of NIntegrate‘s framework terminology it is shown how to implement an integration strategy or integration rule based on the theory of the Lebesgue integral. The full implementation of those strategy and rules — LebesgueIntegration, LebesgueIntegrationRule, and GridLebesgueIntegrationRule — are given in the Mathematica package [5].

The advantage of using NIntegrate‘s framework is that a host of supporting algorithms can be employed for preprocessing, execution, experimentation, and testing (correctness, comparison, and profiling.)

Here is a brief description of the integration strategy LebesgueIntegration in [5]:

  1. prepare a function that calculates measure estimates based on random points or low discrepancy sequences of points in the integration domain;

  2. use NIntegrate for the computation of one dimensional integrals for that measure estimate function over the range of the integrand function values.

The strategy is adaptive because of the second step — NIntegrate uses adaptive integration algorithms.

Instead of using an integration strategy we can "tuck in" the whole Lebesgue integration process into an integration rule, and then use that integration rule with the adaptive integration algorithms NIntegrate already has. This is done with the implementations of the integration rules LebesgueIntegrationRule and GridLebesgueIntegrationRule.

Brief theory

Lebesgue integration extends the definition of integral to a much larger class of functions than the class of Riemann integrable functions. The Riemann integral is constructed by partitioning the integrand’s domain (on the x axis). The Lebesgue integral is constructed by partitioning the integrand’s co-domain (on the y axis). For each value y in the co-domain, find the measure \mu(y) of the corresponding set of points \frac{y}{f} in the domain. Roughly speaking, the Lebesgue integral is then the sum of all the products \mu(y) y; see [1]. For our implementation purposes \mu is defined differently, and in the rest of this section we follow [3].

Consider the non-negative bound-able measurable function f:

 y=f(x), f(x) \geq 0, x \in \Omega .

We denote by \mu(y) the measure for the points in \Omega for which f(x)>=y, i.e.

  1.  \mu(y) := \left| \{ x: x\in \Omega \land f(x) \geq y\} \right| .

The Lebesgue integral of f(x) over \Omega can be be defined as:

 \int_{\Omega } f (x) dx = y_0 \mu (y_0) + \lim_{n \to \infty ,\max
         \left(y_i-y_{i-1}\right)\to 0} \sum_{i=1}^n \mu(y_i)(y_i-y_{i-1}) .

Further, we can write the last formula as

  1.  \int_{\Omega } f(x) dx = y_0 \mu(y_0) + \int_{y_0}^{y_n}\mu(y) dy.

The restriction f(x)>=0 can be handled by defining the following functions f_1 and f_2 :

 f_1(x) := \frac{1}{2} (\left| f(x) \right| + f(x) ),

 f_2(x) := \frac{1}{2} (\left| f(x) \right| - f(x) ),

and using the formula

 \int_{\Omega}f(x) dx= \int_{\Omega} f_1(x) dx - \int_{\Omega}f_2(x) dx.

Since finding analytical expressions of \mu is hard we are going to look into ways of approximating \mu.

For more details see [1,2,3,4].

(Note, that the theoretical outline the algorithms considered can be seen as algorithms that reduce multidimensional integration to one dimensional integration.)

Algorithm walk through

We can see that because of Equation (4) we mostly have to focus on estimating the measure function \mu. This section provides a walk through with visual examples of a couple of stochastic ways to do that.

Consider the integral

 \int_{0}^{1} \int_{0}^{1} \sqrt{x_1 + x_2 + 1} dx_1 dx_2 .

In order to estimate \mu in \Omega := [0,1] \times [0,1] for f(x_1, x_2) := \sqrt( 1 + x_1 + x_2 ) we are going to generate in \Omega a set P of low discrepancy sequence of points, [2]. Here this is done with 100 points of the so called "Sobol" sequence:

n = 100;
SeedRandom[0,Method -> {"MKL",Method -> {"Sobol", "Dimension" -> 2}}];
points = RandomReal[{0, 1}, {n, 2}];
ListPlot[points, AspectRatio -> Automatic,
 PlotTheme -> "Detailed",
 FrameLabel -> {"\!\(\*SubscriptBox[\(x\), \(1\)]\)","\!\(\*SubscriptBox[\(x\), \(2\)]\)"}]


To each point p_i let us assign a corresponding "volume" v_i that can be used to approximate \mu with Equation (2). We can of course easily assign such volumes to be \frac{1}{\left| P\right| }, but as it can be seen on the plot this would be a good approximation for a larger number of points. Here is an example of a different volume assignment using a Voronoi diagram, [10]:

vmesh = VoronoiMesh[points, {{0, 1}, {0, 1}}, Frame -> True];
Show[{vmesh, Graphics[{Red, Point[points]}]},
 FrameLabel -> {"\!\(\*SubscriptBox[\(x\), \(1\)]\)", "\!\(\*SubscriptBox[\(x\), \(2\)]\)"}]


(The Voronoi diagram for P finds for each point p_i the set of domain points closest to p_i than any other point of P.)

Here is a breakdown of the Voronoi diagram volumes corresponding to the generated points (compare with \frac{1}{\left| P\right| }) :

volumes =PropertyValue[{vmesh, Dimensions[points][[2]]}, MeshCellMeasure]; 
Histogram[volumes, PlotRange -> All, PlotTheme -> "Detailed",
 FrameLabel -> {"volume", "count"}]


Let us define a function that computes \mu according to Equation (1) with the generated points and assigned volumes:

EstimateMeasure[fval_?NumericQ, pointVals_, pointVolumes_] :=
   pinds = Clip[Sign[pointVals - fval], {0, 1}, {0, 1}];

Here is an example call of that function using the Voronoi diagram volumes:

EstimateMeasure[1.6, Sqrt[2 + Total[#]] & /@ points, volumes]
(* 0.845833 *)

And here is another call using uniform volumes:

EstimateMeasure[1.6, Sqrt[2 + Total[#]] & /@ points,
  Table[1/Length[points], {Length[points]}]] // N
(* 0.85 *)

The results can be verified using ImplicitRegion:

RegionMeasure[ImplicitRegion[ Sqrt[2 + x1 + x2] >= 1.6, {{x1, 0, 1}, {x2, 0, 1}}]]
(* 0.8432 *)

Or using Integrate:

Integrate[Piecewise[{{1, Sqrt[2 + x1 + x2] >= 1.6}}, 0], {x1, 0, 1}, {x2, 0, 1}]
(* 0.8432 *)

At this point we are ready to compute the integral estimate using Formula (4) :

fvals = Sqrt[2 + Total[#]] & /@ points;
Min[fvals]*EstimateMeasure[0, fvals, volumes] +
 NIntegrate[EstimateMeasure[y, fvals, volumes], {y, Min[fvals], Max[fvals]}]
(* 1.72724 *)

To simplify the code we use the symbol \text{fvals} to hold the values f P. Note that instead of the true min and max values of f we use estimates of them with \text{fvals}.

Here is the verification of the result:

Integrate[Sqrt[2 + x1 + x2], {x1, 0, 1}, {x2, 0, 1}]
% // N
(* 8/15 (16 + 2 Sqrt[2] - 9 Sqrt[3]) *)
(* 1.72798 *)

In order to implement the outlined algorithm so it will be more universal we have to consider volumes rescaling, function positivity, and Voronoi diagram implementation(s). For details how these considerations are resolved see the code of the strategy LebesgueIntegration in [5].

Further algorithm elaborations

Measure estimate with regular grid cells

The article 3 and book 4 suggest the measure estimation to be done through membership of regular grid of cells. For example, the 100 points generated in the previous section can be grouped by a 5 \times 5 grid:


The following steps describe in detail an algorithm based on the proposed in [3,4] measure estimation method. The algorithm is implemented in [5] for the symbol, GridLebesgueIntegrationRule.

1. Generate points filling the [0,1]^{d} where d is the dimension of \Omega.

2. Partition [0,1]^{d} with a regular grid according to specifications.

  • 2.1. Assume the cells are indexed with the integers I_c \subset \mathbb{N}, |I_c|=n.
  • 2.2. Assume that all cells have the same volume v_c below.

3. For each point find to which cell of the regular grid it belongs to.

4. For each cell have a list of indices corresponding to the points that belong to it.

5. For a given sub-region of integration r rescale to the points to lie within r; denote those points with P_r.

  • 5.1. Compute the rescaling factor for the integration rule; denote with \gamma.

6. For a given integrand function f evaluate f over all points P_r.

7. For each cell i \in I_c find the min and max values of f.

  • 7.1. Denote with f_i^{\min} and f_i^{\max} correspondingly.

8. For a given value f x_k, where k is some integer enumerating the 1D integration rule sampling points, find the coefficients \xi _i, i \in i_c using the following formula:

\xi_i= Piecewise[\{\{1, y_k \leq f_i^{\min}\},\{0, f_i^{\max }<y_k\}\},
 \{ \frac{Abs[f_i^{\max }-y_k] }{Abs[ f_i^{\max }-f_i^{\min} ]}, f_i^{\min }<y_k<f_i^{\max} \}]

9. Find the measure estimate of \mu((f(x)>f(x_k)) with

 \gamma \sum_{i=1}^{n} v_c \xi_i

Axis splitting in Lebesgue integration rules

The implementations of Lebesgue integration rules are required to provide a splitting axis for use of the adaptive algorithms. Of course we can assign a random splitting axis, but that might lead often to slower computations. One way to provide splitting axis is to choose the axis that minimizes the sum of the variances of sub-divided regions estimated by samples of the rule points. This is the same approach taken in NIntegrate`s rule "MonteCarloRule"; for theoretical details see the chapter "7.8 Adaptive and Recursive Monte Carlo Methods" of [11].

In [5] this splitting axis choice is implemented based on integrand function values. Another approach, more in the spirit of the Lebesgue integration, is to select the splitting axis based on variances of the measure function estimates.

Consider the function:

DensityPlot[Exp[-3 (x - 1)^2 - 4 (y - 1)^2], {x, 0, 3}, {y, 0, 3},
 PlotRange -> All, ColorFunction -> "TemperatureMap"]


Here is an example of sampling points traces using "minimum variance" axis splitting in LebesgueIntegrationRule:

res = Reap@
   NIntegrate[Exp[-3 (x - 1)^2 - 4 (y - 1)^2], {x, 0, 3}, {y, 0, 3},
    Method -> {"GlobalAdaptive",
      Method -> {LebesgueIntegrationRule, "Points" -> 600,
        "PointGenerator" -> "Sobol",
        "AxisSelector" -> "MinVariance"},
      "SingularityHandler" -> None},
    EvaluationMonitor :> Sow[{x, y}],
    PrecisionGoal -> 2.5, MaxRecursion -> 3];
ListPlot[res[[2, 1]], AspectRatio -> Automatic]
(* 0.890916 *)

MinVariance axis split

And here are the sampling points with random selection of a splitting axis:

res = Reap@
   NIntegrate[Exp[-3 (x - 1)^2 - 4 (y - 1)^2], {x, 0, 3}, {y, 0, 3},
    Method -> {"GlobalAdaptive",
      Method -> {LebesgueIntegrationRule, "Points" -> 600,
        "PointGenerator" -> "Sobol",
        "AxisSelector" -> Random},
      "SingularityHandler" -> None},
    EvaluationMonitor :> Sow[{x, y}],
    PrecisionGoal -> 2.5, MaxRecursion -> 3];
ListPlot[res[[2, 1]], AspectRatio -> Automatic]
(* 0.892499 *)

Random axis split

Here is a more precise estimate of that integral:

NIntegrate[Exp[-3 (x - 1)^2 - 4 (y - 1)^2], {x, 0, 3}, {y, 0, 3}]
(* 0.898306 *)

Implementation design within NIntegrate’s framework

Basic usages

The strategy and rule implementations in [5] can be used in the following ways.

NIntegrate[Sqrt[x], {x, 0, 2}, Method -> LebesgueIntegration]
(* 1.88589 *)

NIntegrate[Sqrt[x], {x, 0, 2},
 Method -> {LebesgueIntegration, Method -> "LocalAdaptive",
   "Points" -> 2000, "PointGenerator" -> "Sobol"}, PrecisionGoal -> 3]
(* 1.88597 *)

NIntegrate[Sqrt[x], {x, 0, 2},
 Method -> {LebesgueIntegrationRule, "Points" -> 2000,
   "PointGenerator" -> "Sobol",
   "PointwiseMeasure" -> "VoronoiMesh"}, PrecisionGoal -> 3]
(* 1.88597 *)

NIntegrate[Sqrt[x + y + x], {x, 0, 2}, {y, 0, 3}, {z, 0, 4},
 Method -> {GridLebesgueIntegrationRule,
   Method -> "GaussKronrodRule", "Points" -> 2000,
   "GridSizes" -> 5, "PointGenerator" -> "Sobol"}, PrecisionGoal -> 3]
(* 43.6364 *)


Here are the options for the implemented strategy and rules in [5]:

Grid[Transpose[{#, ColumnForm[Options[#]]} & /@
   Alignment -> Left, Dividers -> All]

Lebesgue methods options

Using variable ranges

Integration with variable ranges works "out of the box."

NIntegrate[Sqrt[x + y], {x, 0, 2}, {y, 0, x}]
(* 2.75817 *)

NIntegrate[Sqrt[x + y], {x, 0, 2}, {y, 0, x},
 Method -> LebesgueIntegration, PrecisionGoal -> 2]
(* 2.75709 *)

NIntegrate[Sqrt[x + y], {x, 0, 2}, {y, 0, x},
 Method -> LebesgueIntegrationRule, PrecisionGoal -> 2]
(* 2.75663 *)

Infinite ranges

In order to get correct results with infinite ranges the wrapper

Method->{"UnitCubeRescaling","FunctionalRangesOnly"->False, _}

has to be used. Here is an example:

NIntegrate[1/(x^2 + 12), {x, 0, Infinity}]
(* 0.45345 *)

NIntegrate[1/(x^2 + 12), {x, 0, Infinity},  
  Method -> {"UnitCubeRescaling", "FunctionalRangesOnly" -> False, 
    Method -> {LebesgueIntegrationRule, "Points" -> 2000}}, PrecisionGoal -> 3]
(* 0.453366 *)

For some integrands we have to specify inter-range points or larger MinRecursion.

NIntegrate[1/(x^2), {x, 1, Infinity}]
(* 1. *)

NIntegrate[1/(x^2), {x, 1, 12, Infinity}, 
  Method -> {"UnitCubeRescaling", "FunctionalRangesOnly" -> False, 
    Method -> {LebesgueIntegrationRule, "Points" -> 1000}}]
(* 0.999466 *)

NIntegrate[1/(x^2), {x, 1, Infinity}, 
  Method -> {"UnitCubeRescaling", "FunctionalRangesOnly" -> False, 
    Method -> {LebesgueIntegrationRule, "Points" -> 1000}}]
(* 0. *)

Evaluation monitoring

With the option "EvaluationMonitor" we can see the sampling points for the strategy and the rules.

This is straightforward for the rules:

res = Reap@
   NIntegrate[Exp[-3 (x - 1)^2], {x, 0, 3},
    Method -> LebesgueIntegrationRule,
    EvaluationMonitor :> Sow[x],
    PrecisionGoal -> 3];
ListPlot[res[[2, 1]]]

Lebesgue rule sampling points

The strategy LebesgueIntegration uses an internal variable for the calculation of the Lebesgue integral. In "EvaluationMonitor" either that variable has to be used, or a symbol name has to be passed though the option "LebesgueIntegralVariableSymbol". Here is an example:

res = Reap@
   NIntegrate[Sqrt[x + y + z], {x, -1, 2}, {y, 0, 1}, {z, 1, 12},
    Method -> {LebesgueIntegration, "Points" -> 600,
      "PointGenerator" -> "Sobol",
      "LebesgueIntegralVariableSymbol" -> fval},
    EvaluationMonitor :> {Sow[fval]},
    PrecisionGoal -> 3];
res = DeleteCases[res, fval, \[Infinity]];
ListPlot[res[[2, 1]]]

Lebesgue strategy sampling points


We can use NIntegrate‘s utility functions for visualization and profiling in order to do comparison of the implemented algorithms with related ones (like "AdaptiveMonteCarlo") which NIntegrate has (or are plugged-in).


Common function and domain:

fexpr = 1/(375/100 - Cos[x] - Cos[y]);
ranges = {{x, 0, \[Pi]}, {y, 0, \[Pi]}};

Common parameters:

pgen = "Random";
npoints = 1000;

"AdaptiveMonteCarlo" call:

 NIntegrate[fexpr, Evaluate[Sequence @@ ranges],
  Method -> {"AdaptiveMonteCarlo",
    Method -> {"MonteCarloRule", "PointGenerator" -> pgen,
      "Points" -> npoints,
      "AxisSelector" -> "MinVariance"}},
  PrecisionGoal -> 3]
 (* {"IntegralEstimate" -> InputForm[2.8527356472097902`], "Evaluations" -> 17000, "Timing" -> 0.0192079} *)

LebesgueIntegrationRule call:

 NIntegrate[fexpr, Evaluate[Sequence @@ ranges],
  Method -> {"GlobalAdaptive",
    Method -> {LebesgueIntegrationRule,
      "PointGenerator" -> pgen, "Points" -> npoints,
      "AxisSelector" -> "MinVariance",
      Method -> "ClenshawCurtisRule"},
    "SingularityHandler" -> None}, PrecisionGoal -> 3]
 (* {"IntegralEstimate" -> InputForm[2.836659588960318], "Evaluations" -> 13000, "Timing" -> 0.384246} *)

Design diagrams

The flow charts below show that the plug-in designs have common elements. In order to make the computations more effective the rule initialization prepares the data that is used in all rule invocations. For the strategy the initialization can be much lighter since the strategy algorithm is executed only once.

In the flow charts the double line boxes designate sub-routines. We can see that so called Hollywood principle "don’t call us, we’ll call you" in Object-oriented programming is manifested.

The following flow chart shows the steps of NIntegrate‘s execution when the integration strategy LebesgueIntegration is used.

Lebesgue integration rule flow strategy

The following flow chart shows the steps of NIntegrate‘s execution when the integration rule LebesgueIntegrationRule is used.

Lebesgue integration rule flow chart

Algorithms versions and options

There are multiple architectural, coding, and interface decisions to make while doing implementations like the ones in [5] and described in this document. The following mind map provides an overview of alternatives and interactions between components and parameters.

"Mind map"

Alternative implementations

In many ways using the Lebesgue integration rule with the adaptive algorithms is similar to using NIntegrate‘s "AdaptiveMonteCarlo" and its rule "MonteCarloRule". Although it is natural to consider plugging-in the Lebesgue integration rules into "AdaptiveMonteCarlo" at this point NIntegrate‘s framework does not allow "AdaptiveMonteCarlo" the use of a rule different than "MonteCarloRule".

We can consider using Monte Carlo algorithms for estimating the measures corresponding to a vector of values (that come from a 1D integration rule). This can be easily done, but it is not that effective because of the way NIntegrate handles vector integrands and because of stopping criteria issues when the measures are close to 0.

Future plans

One of the most interesting extensions of the described Lebesgue integration algorithms and implementation designs is their extension with more advanced features of Mathematica for geometric computation. (Like the functions VoronoiMesh, RegionMeasure, and ImplicitRegion used above.)

Another interesting direction is the derivation and use of symbolic expressions for the measure functions. (Hybrid symbolic and numerical algorithms can be obtained as NIntegrate‘s handling of piecewise functions or the strategy combining symbolic and numerical integration described in [9].)


[1] Wikipedia entry, Lebesgue integration, URL: .

[2] Wikipedia entry, Low-discrepancy sequence, URL: .

[3] B. L. Burrows, A new approach to numerical integration, 1. Inst. Math. Applics., 26(1980), 151-173.

[4] T. He, Dimensionality Reducing Expansion of Multivariate Integration, 2001, Birkhauser Boston. ISBN-13:978-1-4612-7414-8 .

[5] A. Antonov, Adaptive Numerical Lebesgue Integration Mathematica Package, 2016, MathematicaForPrediction project at GitHub.

[6] A. Antonov, Lebesgue integration, Wolfram Demonstrations Project, 2007. URL: .

[7] "Advanced Numerical Integration in the Wolfram Language", Wolfram Research Inc. URL: .

[8] A. Antonov, "How to implement custom integration rules for use by NIntegrate?", (2016) Mathematica StackExchange answer, URL: .

[9] A. Antonov, "How to implement custom NIntegrate integration strategies?", (2016) Mathematica StackExchange answer, URL: .

[10] Wikipedia entry, Voronoi diagram, URL: .

[11] Press, W.H. et al., Numerical Recipes in C (2nd Ed.): The Art of Scientific Computing, 1992, Cambridge University Press, New York, NY, USA.

Making Chernoff faces for data visualization


This blog post describes the use of face-like diagrams to visualize multidimensional data introduced by Herman Chernoff in 1973, see [1].

The idea to use human faces in order to understand, evaluate, or easily discern (the records of) multidimensional data is very creative and inspirational. As Chernoff says in [1], the object of the idea is to "represent multivariate data, subject to strong but possibly complex relationships, in such a way that an investigator can quickly comprehend relevant information and then apply appropriate statistical analysis." It is an interesting question how useful this approach is and it seems that there at least several articles discussing that; see for example [2].

I personally find the use of Chernoff faces useful in a small number of cases, but that is probably true for many “creative” data visualization methods.

Below are given both simple and more advanced examples of constructing Chernoff faces for data records using the Mathematica package [3]. The considered data is categorized as:

  1. a small number of records, each with small number of elements;
  2. a large number of records, each with less elements than Chernoff face parts;
  3. a list of long records, each record with much more elements than Chernoff face parts;
  4. a list of nearest neighbors or recommendations.

For several of the visualizing scenarios the records of two "real life" data sets are used: Fisher Iris flower dataset [7], and "Vinho Verde" wine quality dataset [8]. For the rest of the scenarios the data is generated.

A fundamental restriction of using Chernoff faces is the necessity to properly transform the data variables into the ranges of the Chernoff face diagram parameters. Therefore, proper data transformation (standadizing and rescaling) is an inherent part of the application of Chernoff faces, and this document describes such data transformation procedures (also using [3]).

Package load

The packages [3,4] are used to produce the diagrams in this post. The following two commands load them.



Making faces

Just a face

Here is a face produced by the function ChernoffFace of [3] with its simplest signature:


Here is a list of faces:

Table[ChernoffFace[ImageSize -> Tiny], {7}]

Proper face making

The "proper" way to call ChernoffFace is to use an association for the facial parts placement, size, rotation, and color. The options are passed to Graphics.

ChernoffFace[ AssociationThread[
    Keys[ChernoffFace["FacePartsProperties"]] -> 
   RandomReal[1, Length[ChernoffFace["FacePartsProperties"]]]],
   ImageSize -> Small , Background -> GrayLevel[0.85]]

The Chernoff face drawn with the function ChernoffFace can be parameterized to be asymmetric.

The parameters argument mixes (1) face parts placement, sizes, and rotation, with (2) face parts colors and (3) a parameter should it be attempted to make the face symmetric. All facial parts parameters have the range [0,1]

Here is the facial parameter list:


(* {"FaceLength", "ForheadShape", "EyesVerticalPosition", "EyeSize", \
    "EyeSlant", "LeftEyebrowSlant", "LeftIris", "NoseLength", \
    "MouthSmile", "LeftEyebrowTrim", "LeftEyebrowRaising", "MouthTwist", \
    "MouthWidth", "RightEyebrowTrim", "RightEyebrowRaising", \
    "RightEyebrowSlant", "RightIris"} *)

The order of the parameters is chosen to favor making symmetric faces when a list of random numbers is given as an argument, and to make it easier to discern the faces when multiple records are visualized. For experiments and discussion about which facial features bring better discern-ability see [2]. One of the conclusions of [2] is that eye size and eye brow slant are most decisive, followed by face size and shape.

Here are the rest of the parameters (colors and symmetricity):


(* {"EyeBallColor", "FaceColor", "IrisColor", "MakeSymmetric", \
    "MouthColor", "NoseColor"} *)

Face coloring

The following code make a row of faces by generating seven sequences of random numbers in \([0,1]\), each sequence with length the number of facial parameters. The face color is assigned randomly and the face color or a darker version of it is used as a nose color. If the nose color is the same as the face color the nose is going to be shown "in profile", otherwise as a filled polygon. The colors of the irises are random blend between light brown and light blue. The color of the mouth is randomly selected to be black or red.

Block[{pars = Keys[ChernoffFace["FacePartsProperties"]]},
 Grid[{#}] &@
     AssociationThread[pars -> RandomReal[1, Length[pars]]],
     <|"FaceColor" -> (rc = 
      "NoseColor" -> RandomChoice[{Identity, Darker}][rc],
      "IrisColor" -> Lighter[Blend[{Brown, Blue}, RandomReal[1]]],
      "MouthColor" -> RandomChoice[{Black, Red}]|>], 
    ImageSize -> 100], {7}]

Symmetric faces

The parameter "MakeSymmetric" is by default True. Setting "MakeSymmetric" to true turns an incomplete face specification into a complete specification with the missing paired parameters filled in. In other words, the symmetricity is not enforced on the specified paired parameters, only on the ones for which specifications are missing.

The following faces are made symmetric by removing the facial parts parameters that start with "R" (for "Right") and the parameter "MouthTwist".

Block[{pars = Keys[ChernoffFace["FacePartsProperties"]]},
 Grid[{#}] &@Table[(
    asc = 
       pars -> RandomReal[1, Length[pars]]],
      <|"FaceColor" -> (rc = 
       "NoseColor" -> RandomChoice[{Identity, Darker}][rc],
       "IrisColor" -> 
        Lighter[Blend[{Brown, Blue}, RandomReal[1]]],
       "MouthColor" -> RandomChoice[{Black, Red}]|>];
    asc = 
       x : (StartOfString ~~ Except["R"] ~~ __) /; 
        x != "MouthTwist"]];
    ChernoffFace[asc, ImageSize -> 100]), {7}]]

Note that for the irises we have two possibilities of synchronization:

pars = <|"LeftIris" -> 0.8, "IrisColor" -> Green|>;
{ChernoffFace[Join[pars, <|"RightIris" -> pars["LeftIris"]|>], 
  ImageSize -> 100], 
  Join[pars, <|"RightIris" -> 1 - pars["LeftIris"]|>], 
  ImageSize -> 100]}

Visualizing records (first round)

The conceptually straightforward application of Chernoff faces is to visualize ("give a face" to) each record in a dataset. Because the parameters of the faces have the same ranges for the different records, proper rescaling of the records have to be done first. Of course standardizing the data can be done before rescaling.

First let us generate some random data using different distributions:

{dists, data} = Transpose@Table[(
     rdist = 
         RandomReal[10]], PoissonDistribution[RandomReal[4]], 
        GammaDistribution[RandomReal[{2, 6}], 2]}];
     {rdist, RandomVariate[rdist, 12]}), {10}];
data = Transpose[data];

The data is generated in such a way that each column comes from a certain probability distribution. Hence, each record can be seen as an observation of the variables corresponding to the columns.

This is how the columns of the generated data look like using DistributionChart:

 ChartLabels -> 
    Grid[List /@ {Style[#2[[1]], Bold, Red, Larger]}] &, dists], Above], 
 ChartElementFunction -> "PointDensity", 
 ChartStyle -> "SandyTerrain", ChartLegends -> dists, 
 BarOrigin -> Bottom, GridLines -> Automatic, 
 ImageSize -> 900]

At this point we can make a face for each record of the rescaled data:

faces = Map[ChernoffFace, Transpose[Rescale /@ Transpose[data]]];

and visualize the obtained faces in a grid.

   Partition[#, 4] &@Map[Append[#, ImageSize -> 100] &, faces]],
  "   ", Magnify[#, 0.85] &@
    List /@ Take[Keys[ChernoffFace["FacePartsProperties"]], 
    TableHeadings -> {"Face part"}]

(The table on the right shows which facial parts are used for which data columns.)

Some questions to consider

Several questions and observations arise from the example in the previous section.

1. What should we do if the data records have more elements than facial parts parameters of the Chernoff face diagram?

This is another fundamental restriction of Chernoff faces — the number of data columns is limiter by the number of facial features.

One way to resolve this is to select important variables (columns) of the data; another is to represent the records with a vector of statistics. The latter is shown in the section "Chernoff faces for lists of long lists".

2. Are there Chernoff face parts that are easier to perceive or judge than others and provide better discern-ability for large collections of records?

Research of the pre-attentiveness and effectiveness with Chernoff faces, [2], shows that eye size and eyebrow slant are the features that provide best discern-ability. Below this is used to select some of the variable-to-face-part correspondences.

3. How should we deal with outliers?

Since we cannot just remove the outliers from a record — we have to have complete records — we can simply replace the outliers with the minimum or maximum values allowed for the corresponding Chernoff face feature. (All facial features of ChernoffFace have the range \([0,1]\).) See the next section for an example.

Data standardizing and rescaling

Given a full array of records, we most likely have to standardize and rescale the columns in order to use the function ChernoffFace. To help with that the package [3] provides the function VariablesRescale which has the options "StandardizingFunction" and "RescaleRangeFunction".

Consider the following example of VariableRescale invocation in which: 1. each column is centered around its median and then divided by the inter-quartile half-distance (quartile deviation), 2. followed by clipping of the outliers that are outside of the disk with radius 3 times the quartile deviation, and 3. rescaling to the unit interval.

rdata = VariablesRescale[N@data,
   "StandardizingFunction" -> (Standardize[#, Median, QuartileDeviation] &),
   "RescaleRangeFunction" -> ({-3, 3} QuartileDeviation[#] &)];
TableForm[rdata /. {0 -> Style[0, Bold, Red], 1 -> Style[1, Bold, Red]}]

Remark: The bottom outliers are replaced with 0 and the top outliers with 1 using Clip.

Chernoff faces for a small number of short records

In this section we are going use the Fisher Iris flower data set [7]. By "small number of records" we mean few hundred or less.

Getting the data

These commands get the Fisher Iris flower data set shipped with Mathematica:

irisDataSet = 
   List @@@ ExampleData[{"MachineLearning", "FisherIris"}, "Data"]];
irisColumnNames = 
    List @@ ExampleData[{"MachineLearning", "FisherIris"}, 

(* {150, 5} *)

Here is a summary of the data:

Grid[{RecordsSummary[irisDataSet, irisColumnNames]}, 
 Dividers -> All, Alignment -> Top]

Simple variable dependency analysis

Using the function VariableDependenceGrid of the package [4] we can plot a grid of variable cross-dependencies. We can see from the last row and column that "Petal length" and "Petal width" separate setosa from versicolor and virginica with a pretty large gap.

Magnify[#, 1] &@
 VariableDependenceGrid[irisDataSet, irisColumnNames, 
  "IgnoreCategoricalVariables" -> False]

Chernoff faces for Iris flower records

Since we want to evaluate the usefulness of Chernoff faces for discerning data records groups or clusters, we are going to do the following steps.

  1. Data transformation. This includes standardizing and rescaling and selection of colors.
  2. Make a Chernoff face for each record without the label class "Species of iris".
  3. Plot shuffled Chernoff faces and attempt to visually cluster them or find patterns.
  4. Make a Chernoff face for each record using the label class "Specie of iris" to color the faces. (Records of the same class get faces of the same color.)
  5. Compare the plots and conclusions of step 2 and 4.

1. Data transformation

First we standardize and rescale the data:

chernoffData = VariablesRescale[irisDataSet[[All, 1 ;; 4]]];

These are the colors used for the different species of iris:

faceColorRules = 
 Thread[Union[ irisDataSet[[All, -1]]] \
 -> Map[Lighter[#, 0.5] &, {Purple, Blue, Green}]]

(* {"setosa" -> RGBColor[0.75, 0.5, 0.75], 
    "versicolor" -> RGBColor[0.5, 0.5, 1.], 
    "virginica" -> RGBColor[0.5, 1., 0.5]} *)

Add the colors to the data for the faces:

chernoffData = MapThread[
   Append, {chernoffData, irisDataSet[[All, -1]] /. faceColorRules}];

Plot the distributions of the rescaled variables:

 Transpose@chernoffData[[All, 1 ;; 4]], 
 GridLines -> Automatic, 
 ChartElementFunction -> "PointDensity", 
 ChartStyle -> "SandyTerrain", 
 ChartLegends -> irisColumnNames, ImageSize -> Large]

2. Black-and-white Chernoff faces

Make a black-and-white Chernoff face for each record without using the species class:

chfacesBW = 
     AssociationThread[{"NoseLength", "LeftEyebrowTrim", "EyeSize", 
        "LeftEyebrowSlant"} -> Most[#]], 
     ImageSize -> 100] & /@ chernoffData;

Since "Petal length" and "Petal width" separate the classes well for those columns we have selected the parameters "EyeSize" and "LeftEyebrowSlant" based on [2].

3. Finding patterns in a collection of faces

Combine the faces into a image collage:

ImageCollage[RandomSample[chfacesBW], Background -> White]

We can see that faces with small eyes tend have middle-lowered eyebrows, and that faces with large eyes tend to have middle raised eyebrows and large noses.

4. Chernoff faces colored by the species

Make a Chernoff face for each record using the colors added to the rescaled data:

chfaces = 
     AssociationThread[{"NoseLength", "LeftEyebrowTrim", "EyeSize", 
        "LeftEyebrowSlant", "FaceColor"} -> #], 
     ImageSize -> 100] & /@ chernoffData;

Make an image collage with the obtained faces:

ImageCollage[chfaces, Background -> White]

5. Comparison

We can see that the collage with colored faces completely explains the patterns found in the black-and-white faces: setosa have smaller petals (both length and width), and virginica have larger petals.

Browsing a large number of records with Chernoff faces

If we have a large number of records each comprised of a relative small number of numerical values we can use Chernoff faces to browse the data by taking small subsets of records.

Here is an example using "Vinho Verde" wine quality dataset [8].

Chernoff faces for lists of long lists

In this section we consider data that is a list of lists. Each of the lists (or rows) is fairly long and represents values of the same variable or process. If the data is a full array, then we can say that in this section we deal with transposed versions of the data in the previous sections.

Since each row is a list of many elements visualizing the rows directly with Chernoff faces would mean using a small fraction of the data. A natural approach in those situations is to summarize each row with a set of descriptive statistics and use Chernoff faces for the row summaries.

The process is fairly straightforward; the rest of the section gives concrete code steps of executing it.

Data generation

Here we create 12 rows of 200 elements by selecting a probability distribution for each row.

{dists, data} = Transpose@Table[(
     rdist = 
         RandomReal[10]], PoissonDistribution[RandomReal[4]], 
        GammaDistribution[RandomReal[{2, 6}], 2]}];
     {rdist, RandomVariate[rdist, 200]}), {12}];

(* {12, 200} *)

We have the following 12 "records" each with 200 "fields":

 ChartLabels -> 
   Row[{Style[#2[[1]], Red, Larger],
        "  ", Style[#1, Larger]}] &, dists], 
 ChartElementFunction -> 
   "ColorScheme" -> "SouthwestColors"], BarOrigin -> Left, 
 GridLines -> Automatic, ImageSize -> 1000]

Here is the summary of the records:

Grid[ArrayReshape[RecordsSummary[Transpose@N@data], {3, 4}], 
 Dividers -> All, Alignment -> {Left}]

Data transformation

Here we "transform" each row into a vector of descriptive statistics:

statFuncs = {Mean, StandardDeviation, Kurtosis, Median, 
   QuartileDeviation, PearsonChiSquareTest};
sdata = Map[Through[statFuncs[#]] &, data];

(* {12, 6} *)

Next we rescale the descriptive statistics data:

sdata = VariablesRescale[sdata, 
   "StandardizingFunction" -> (Standardize[#, Median, 
       QuartileDeviation] &)];

For kurtosis we have to do special rescaling if we want to utilize the property that Gaussian processes have kurtosis 3:

sdata[[All, 3]] = 
  Rescale[#, {3, Max[#]}, {0.5, 1}] &@Map[Kurtosis, N[data]];

Here is the summary of the columns of the rescaled descriptive statistics array:

Grid[{RecordsSummary[sdata, ToString /@ statFuncs]}, 
 Dividers -> All]


First we define a function that computes and tabulates (descriptive) statistics over a record.

TipTable[vec_, statFuncs_, faceParts_] :=
     Transpose@{faceParts, statFuncs, 
       NumberForm[Chop[#], 2] & /@ Through[statFuncs[vec]]}, 
     TableHeadings -> {"FacePart", "Statistic", "Value"}]] /; 
   Length[statFuncs] == Length[faceParts];

To visualize the descriptive statistics of the records using Chernoff faces we have to select appropriate facial features.

faceParts = {"NoseLength", "EyeSize", "EyeSlant", 
   "EyesVerticalPosition", "FaceLength", "MouthSmile"};
TipTable[First@sdata, statFuncs, faceParts]

One possible visualization of all records is with the following commands. Note the addition of the parameter "FaceColor" to also represent how close a standardized row is to a sample from Normal Distribution.

{odFaceColor, ndFaceColor} = {White,  ColorData[7, "ColorList"][[8]]};
Grid[ArrayReshape[Flatten@#, {4, 3}, ""], Dividers -> All, 
   Alignment -> {Left, Top}] &@
  (asc = AssociationThread[faceParts -> #2];
    chFace = 
      Join[asc, <|
        "FaceColor" -> Blend[{odFaceColor, ndFaceColor}, #2[[-1]]], 
        "IrisColor" -> GrayLevel[0.8], 
        "NoseColor" -> ndFaceColor|>], ImageSize -> 120, 
      AspectRatio -> Automatic];
    tt = TipTable[N@#3, Join[statFuncs, {Last@statFuncs}], 
      Join[faceParts, {"FaceColor"}]];
    Column[{Style[#1, Red], 
      Grid[{{Magnify[#4, 0.8], 
         Tooltip[chFace, tt]}, {Magnify[tt, 0.7], SpanFromAbove}}, 
       Alignment -> {Left, Top}]}]) &
  , {Range[Length[sdata]], sdata, data, dists}]

Visualizing similarity with nearest neighbors or recommendations

General idea

Assume the following scenario: 1. we have a set of items (movies, flowers, etc.), 2. we have picked one item, 3. we have computed the Nearest Neighbors (NNs) of that item, and 4. we want to visualize how much of a good fit the NNs are to the picked item.

Conceptually we can translate the phrase "how good the found NNs (or recommendations) are" to:

  • "how similar the NNs are to the selected item", or

  • "how different the NNs are to the selected item."

If we consider the picked item as the prototype of the most normal or central item then we can use Chernoff faces to visualize item’s NNs deviations.

Remark: Note that Chernoff faces provide similarity visualization most linked to Euclidean distance that to other distances.

Concrete example

The code in this section demonstrates how to visualize nearest neighbors by Chernoff faces variations.

First we create a nearest neighbors finding function over the Fisher Iris data set (without the species class label):

irisNNFunc = 
 Nearest[irisDataSet[[All, 1 ;; -2]] -> Automatic, 
  DistanceFunction -> EuclideanDistance]

Here are nearest neighbors of some random row from the data.

itemInd = 67;
nnInds = irisNNFunc[irisDataSet[[itemInd, 1 ;; -2]], 20];

We can visualize the distances with of the obtained NNs with the prototype:

  EuclideanDistance[#, irisDataSet[[itemInd, 1 ;; -2]]] &, 
  irisDataSet[[nnInds, 1 ;; -2]]]]

Next we subtract the prototype row from the NNs data rows, we standardize, and we rescale the interval \([ 0, 3 \sigma ]\) to \([ 0.5, 1 ]\):

snns = Transpose@Map[
       Standardize[#, 0 &, StandardDeviation], {0, 3}, {0.5, 1}], {0, 
       1}] &,
     Map[# - irisDataSet[[itemInd, 1 ;; -2]] &, 
      irisDataSet[[nnInds, 1 ;; -2]]]];

Here is how the original NNs data row look like:

 Take[irisDataSet[[nnInds]], 12], TableHeadings -> irisColumnNames]

And here is how the rescaled NNs data rows look like:

GridTableForm[Take[snns, 12], 
 TableHeadings -> Most[irisColumnNames]]

Next we make Chernoff faces for the rescaled rows and present them in a easier to grasp way.

We use the face parts:

Take[Keys[ChernoffFace["FacePartsProperties"]], 4]

(* {"FaceLength", "ForheadShape", "EyesVerticalPosition", "EyeSize"} *)

To make the face comparison easier, the first face is the one of the prototype, each Chernoff face is drawn within the same rectangular frame, and the NNs indices are added on top of the faces.

chfaces = 
  ChernoffFace[#, Frame -> True, 
     PlotRange -> {{-1, 1}, {-2, 1.5}}, FrameTicks -> False, 
     ImageSize -> 100] & /@ snns;
chfaces = 
     1 -> 
       Text[Style[#2, Bold, Red], {0, 1.4}]]] &, {chfaces, nnInds}];
ImageCollage[chfaces, Background -> GrayLevel[0.95]]

We can see that the first few – i.e. closest — NNs have fairly normal looking faces.

Note that using a large number of NNs would change the rescaled values and in that way the first NNs would appear more similar.


[1] Herman Chernoff (1973). "The Use of Faces to Represent Points in K-Dimensional Space Graphically" (PDF). Journal of the American Statistical Association (American Statistical Association) 68 (342): 361-368. doi:10.2307/2284077. JSTOR 2284077. URL: .

[2] Christopher J. Morris; David S. Ebert; Penny L. Rheingans, "Experimental analysis of the effectiveness of features in Chernoff faces", Proc. SPIE 3905, 28th AIPR Workshop: 3D Visualization for Data Exploration and Decision Making, (5 May 2000); doi: 10.1117/12.384865. URL: .

[3] Anton Antonov, Chernoff Faces implementation in Mathematica, (2016), source code at MathematicaForPrediction at GitHub, package ChernofFacess.m .

[4] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[5] Anton Antonov, Variable importance determination by classifiers implementation in Mathematica,(2015), source code at MathematicaForPrediction at GitHub, package VariableImportanceByClassifiers.m.

[6] Anton Antonov, "Importance of variables investigation guide", (2016), MathematicaForPrediction at GitHub,, folder Documentation.

[7] Wikipedia entry, Iris flower data set, _set .

[8] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. URL .

Comparison of PCA, NNMF, and ICA over image de-noising


In a previous blog post, [1], I compared Principal Component Analysis (PCA) / Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NNMF) over a collection of noised images of digit handwriting from the MNIST data set, [3], which is available in Mathematica.

This blog post adds to that comparison the use of Independent Component Analysis (ICA) proclaimed in my previous blog post, [1].


The ICA related additional computations to those in [1] follow.

First we load the package IndependentComponentAnalysis.m :


From that package we can use the function IndependentComponentAnalysis to find components:

{S, A} = IndependentComponentAnalysis[Transpose[noisyTestImagesMat], 9, PrecisionGoal -> 4.5];
Norm[noisyTestImagesMat - Transpose[S.A]]/Norm[noisyTestImagesMat]
(* 0.592739 *)

Let us visualize the norms of the mixing matrix A :

norms = Norm /@ A;
ListPlot[norms, PlotRange -> All, PlotLabel -> "Norms of A rows", 
    PlotTheme -> "Detailed"] // 
pos = OutlierPosition[norms, TopOutliers@*HampelIdentifierParameters]

Norms of the mixing matrix

Next we can visualize the found "source" images:

ncol = 2;
 MapIndexed[{#2[[1]], Norm[#], 
   ImageAdjust@Image[Partition[#, dims[[1]]]]} &, 
 Transpose[S]] /. (# -> Style[#, Red] & /@ pos),
Table["", {ncol - QuotientRemainder[Dimensions[S][[2]], ncol][[2]]}]
], ncol], Dividers -> All]

ICA found source images of 6 and 7 images matrix

After selecting several of source images we zero the rest by modifying the matrix A:

pos = {6, 7, 9};
norms = Norm /@ A;
dA = DiagonalMatrix[
 ReplacePart[ConstantArray[0, Length[norms]], Map[List, pos] -> 1]];
newMatICA = 
 Transpose[Map[# + Mean[Transpose[noisyTestImagesMat]] &, S.dA.A]];
  denoisedImagesICA = Map[Image[Partition[#, dims[[2]]]] &, newMatICA];

Visual comparison of de-noised images

Next we visualize the ICA de-noised images together with the original images and the SVD and NNMF de-noised ones.

There are several ways to present that combination of images.

Grayscale 6 and 7 images, orginal, noised, PCA, NNMF, ICA de-noised

Binarized 6 and 7 images, orginal, noised, PCA, NNMF, ICA de-noised

Image collage of orginal, noised, PCA, NNMF, ICA de-noised 6 and 7 images

Comparison using classifiers

We can further compare the de-noising results by building digit classifiers and running them over the de-noised images.

icaCM = ClassifierMeasurements[digitCF, 
    Thread[(Binarize[#, 0.55] &@*ImageAdjust@*ColorNegate /@ 
        denoisedImagesICA) -> testImageLabels]]

We can see that with ICA we get better results than with PCA/SVD, probably not as good as NNMF, but very close.

Classifier comparison over PCA, NNMF, ICA de-noised images of 6 and 7

All images of this blog post

Computations results for ICA application of noised handwriting images of 6 an 7


[1] A. Antonov, "Comparison of PCA and NNMF over image de-noising", (2016), MathematicaForPrediction at WordPress.

[2] A. Antonov, "Independent Component Analysis for multidimensional signals", (2016), MathematicaForPrediction at WordPress.

[3] Wikipedia entry, MNIST database.