# Introduction

The main goals of this document are:

i) to demonstrate how to create versions and combinations of classifiers utilizing different perspectives,

ii) to apply the Receiver Operating Characteristic (ROC) technique into evaluating the created classifiers (see [2,3]) and

iii) to illustrate the use of the Mathematica packages [5,6].

The concrete steps taken are the following:

1. Obtain data: Mathematica built-in or external. Do some rudimentary analysis.

2. Create an ensemble of classifiers and compare its performance to the individual classifiers in the ensemble.

3. Produce classifier versions with from changed data in order to explore the effect of records outliers.

4. Make a bootstrapping classifier ensemble and evaluate and compare its performance.

5. Systematically diminish the training data and evaluate the results with ROC.

6. Show how to do classifier interpolation utilizing ROC.

In the steps above we skip the necessary preliminary data analysis. For the datasets we use in this document that analysis has been done elsewhere. (See [,,,].) Nevertheless, since ROC is mostly used for binary classifiers we want to analyze the class labels distributions in the datasets in order to designate which class labels are "positive" and which are "negative."

## ROC plots evaluation (in brief)

Assume we are given a binary classifier with the class labels P and N (for "positive" and "negative" respectively).

Consider the following measures True Positive Rate (TPR): and False Positive Rate (FPR): Assume that we can change the classifier results with a parameter and produce a plot like this one: For each parameter value the point is plotted; points corresponding to consecutive ‘s are connected with a line. We call the obtained curve the ROC curve for the classifier in consideration. The ROC curve resides in the ROC space as defined by the functions FPR and TPR corresponding respectively to the -axis and the -axis.

The ideal classifier would have its ROC curve comprised of a line connecting {0,0} to {0,1} and a line connecting {0,1} to {1,1}.

Given a classifier the ROC point closest to {0,1}, generally, would be considered to be the best point.

## The wider perspective

This document started as being a part of a conference presentation about illustrating the cultural differences between Statistics and Machine learning (for Wolfram Technology Conference 2016). Its exposition become both deeper and wider than expected. Here are the alternative, original goals of the document:

i) to demonstrate how using ROC a researcher can explore classifiers performance without intimate knowledge of the classifiers mechanisms, and

ii) to provide concrete examples of the typical investigation approaches employed by machine learning researchers.

To make those points clearer and more memorable we are going to assume that exposition is a result of the research actions of a certain protagonist with a suitably selected character.

A by-product of the exposition is that it illustrates the following lessons from machine learning practices. (See .)

1. For a given classification task there often are multiple competing models.

2. The outcomes of the good machine learning algorithms might be fairly complex. I.e. there are no simple interpretations when really good results are obtained.

3. Having high dimensional data can be very useful.

In  these three points and discussed under the names "Rashomon", "Occam", and "Bellman". To quote:

Rashomon: the multiplicity of good models;
Occam: the conflict between simplicity and accuracy;
Bellman: dimensionality — curse or blessing."

## The protagonist

Our protagonist is a "Simple Nuclear Physicist" (SNP) — someone who is accustomed to obtaining a lot of data that has to be analyzed and mined sometimes very deeply, rigorously, and from a lot of angles, for different hypotheses. SNP is fairly adept in programming and critical thinking, but he does not have or care about deep knowledge of statistics methods or machine learning algorithms. SNP is willing and capable to use software libraries that provide algorithms for statistical methods and machine learning.

SNP is capable of coming up with ROC if he is not aware of it already. ROC is very similar to the so called phase space diagrams physicists do.

# Used packages

These commands load the used Mathematica packages [4,5,6]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ROCFunctions.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ClassifierEnsembles.m"]

# Data used

## The Titanic dataset

These commands load the Titanic data (that is shipped with Mathematica).

data = ExampleData[{"MachineLearning", "Titanic"}, "TrainingData"];
columnNames = (Flatten@*List) @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
trainingData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[trainingData]

(* {732, 4} *)

RecordsSummary[trainingData, columnNames] data = ExampleData[{"MachineLearning", "Titanic"}, "TestData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
testData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[testData]

(* {314, 4} *)

RecordsSummary[testData, columnNames] nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

# Classifier ensembles

This command makes a classifier ensemble of two built-in classifiers "NearestNeighbors" and "NeuralNetwork":

aCLs = EnsembleClassifier[{"NearestNeighbors", "NeuralNetwork"}, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]] A classifier ensemble of the package  is simply an association mapping classifier IDs to classifier functions.

The first argument given to EnsembleClassifier can be Automatic:

SeedRandom
aCLs = EnsembleClassifier[Automatic, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]];

With Automatic the following built-in classifiers are used:

Keys[aCLs]

(* {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"} *)

Classification with the classifier ensemble can be done using the function EnsembleClassify. If the third argument of EnsembleClassify is "Votes" the result is the class label that appears the most in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "Votes"]

(* "died" *)

The following commands clarify the voting done in the command above.

Map[#[testData[[20, 1 ;; -2]]] &, aCLs]
Tally[Values[%]]

(* <|"NearestNeighbors" -> "died", "NeuralNetwork" -> "survived", "LogisticRegression" -> "survived", "RandomForest" -> "died", "SupportVectorMachine" -> "died", "NaiveBayes" -> "died"|> *)

(* {{"died", 4}, {"survived", 2}} *)

## Classification with ensemble averaged probabilities

If the third argument of EnsembleClassify is "ProbabilitiesMean" the result is the class label that has the highest mean probability in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "ProbabilitiesMean"]

(* "died" *)

The following commands clarify the probability averaging utilized in the command above.

Map[#[testData[[20, 1 ;; -2]], "Probabilities"] &, aCLs]
Mean[Values[%]]

(* <|"NearestNeighbors" -> <|"died" -> 0.598464, "survived" -> 0.401536|>, "NeuralNetwork" -> <|"died" -> 0.469274, "survived" -> 0.530726|>, "LogisticRegression" -> <|"died" -> 0.445915, "survived" -> 0.554085|>,
"RandomForest" -> <|"died" -> 0.652414, "survived" -> 0.347586|>, "SupportVectorMachine" -> <|"died" -> 0.929831, "survived" -> 0.0701691|>, "NaiveBayes" -> <|"died" -> 0.622061, "survived" -> 0.377939|>|> *)

(* <|"died" -> 0.61966, "survived" -> 0.38034|> *)

The third argument of EnsembleClassifyByThreshold takes a rule of the form label->threshold; the fourth argument is eighter "Votes" or "ProbabiltiesMean".

The following code computes the ROC curve for a range of votes.

rocRange = Range[0, Length[aCLs] - 1, 1];
aROCs = Table[(
cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "Votes"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, GridLines -> Automatic] ## ROC for ensemble probabilities mean

If we want to compute ROC of a range of probability thresholds we EnsembleClassifyByThreshold with the fourth argument being "ProbabilitiesMean".

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.2, "ProbabilitiesMean"]

(* {"survived", "survived", "survived", "survived", "survived", "survived"} *)

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.6, "ProbabilitiesMean"]

(* {"survived", "died", "survived", "died", "died", "survived"} *)

The implementation of EnsembleClassifyByThreshold with "ProbabilitiesMean" relies on the ClassifierFunction signature:

ClassifierFunction[__][record_, "Probabilities"]

Here is the corresponding ROC plot:

rocRange = Range[0, 1, 0.025];
aROCs = Table[(
cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "ProbabilitiesMean"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
rocEnGr = ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, PlotLabel -> "Classifier ensemble", GridLines -> Automatic] ## Comparison of the ensemble classifier with the standard classifiers

This plot compares the ROC curve of the ensemble classifier with the ROC curves of the classifiers that comprise the ensemble.

rocGRs = Table[
aROCs1 = Table[(
cres = ClassifyByThreshold[aCLs[[i]], testData[[All, 1 ;; -2]], "survived" -> th];
ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {th, rocRange}];
ROCPlot[rocRange, aROCs1, PlotLabel -> Keys[aCLs][[i]], PlotRange -> {{0, 1.05}, {0.6, 1.01}}, "PlotJoined" -> Automatic, GridLines -> Automatic],
{i, 1, Length[aCLs]}];

GraphicsGrid[ArrayReshape[Append[Prepend[rocGRs, rocEnGr], rocEnGr], {2, 4}, ""], Dividers -> All, FrameStyle -> GrayLevel[0.8], ImageSize -> 1200] Let us plot all ROC curves from the graphics grid above into one plot. For that the single classifier ROC curves are made gray, and their threshold callouts removed. We can see that the classifier ensemble brings very good results for and none of the single classifiers has a better point.

Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), rocEnGr]] # Classifier ensembles by bootstrapping

There are several ways to produce ensemble classifiers using bootstrapping or jackknife resampling procedures.

First, we are going to make a bootstrapping classifier ensemble using one of the Classify methods. Then we are going to make a more complicated bootstrapping classifier with six methods of Classify.

## Bootstrapping ensemble with a single classification method

First we select a classification method and make a classifier with it.

clMethod = "NearestNeighbors";
sCL = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> clMethod];

The following code makes a classifier ensemble of 12 classifier functions using resampled, slightly smaller (10%) versions of the original training data (with RandomChoice).

SeedRandom;
aBootStrapCLs = Association@Table[(
inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
ToString[i] -> Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> clMethod]), {i, 12}];

Let us compare the ROC curves of the single classifier with the bootstrapping derived ensemble.

rocRange = Range[0.1, 0.9, 0.025];
AbsoluteTiming[
aSingleROCs = Table[(
cres = ClassifyByThreshold[sCL, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
aBootStrapROCs = Table[(
cres = EnsembleClassifyByThreshold[aBootStrapCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
]

(* {6.81521, Null} *)

Legended[
Show[{
ROCPlot[rocRange, aSingleROCs, "ROCColor" -> Blue, "PlotJoined" -> Automatic, GridLines -> Automatic],
ROCPlot[rocRange, aBootStrapROCs, "ROCColor" -> Red, "PlotJoined" -> Automatic]}],
SwatchLegend @@ Transpose@{{Blue, Row[{"Single ", clMethod, " classifier"}]}, {Red, Row[{"Boostrapping ensemble of\n", Length[aBootStrapCLs], " ", clMethod, " classifiers"}]}}] We can see that we get much better results with the bootstrapped ensemble.

## Bootstrapping ensemble with multiple classifier methods

This code creates an classifier ensemble using the classifier methods corresponding to Automatic given as a first argument to EnsembleClassifier.

SeedRandom
AbsoluteTiming[
aBootStrapLargeCLs = Association@Table[(
inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
ecls = EnsembleClassifier[Automatic, trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]]];
AssociationThread[Map[# <> "-" <> ToString[i] &, Keys[ecls]] -> Values[ecls]]
), {i, 12}];
]

(* {27.7975, Null} *)

This code computes the ROC statistics with the obtained bootstrapping classifier ensemble:

AbsoluteTiming[
aBootStrapLargeROCs = Table[(
cres = EnsembleClassifyByThreshold[aBootStrapLargeCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
]

(* {45.1995, Null} *)

Let us plot the ROC curve of the bootstrapping classifier ensemble (in blue) and the single classifier ROC curves (in gray):

aBootStrapLargeGr = ROCPlot[rocRange, aBootStrapLargeROCs, "PlotJoined" -> Automatic];
Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), aBootStrapLargeGr]] Again we can see that the bootstrapping ensemble produced better ROC points than the single classifiers.

# Damaging data

This section tries to explain why the bootstrapping with resampling to smaller sizes produces good results.

In short, the training data has outliers; if we remove small fractions of the training data we might get better results.

The procedure described in this section can be used in conjunction with the procedures described in the guide for importance of variables investigation .

## Ordering function

Let us replace the categorical values with numerical in the training data. There are several ways to do it, here is a fairly straightforward one:

nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

## Decreasing proportions of females

First, let us find all indices corresponding to records about females.

femaleInds = Flatten@Position[trainingData[[All, 3]], "female"];

The following code standardizes the training data corresponding to females, finds the mean record, computes distances from the mean record, and finally orders the female records indices according to their distances from the mean record.

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[femaleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
femaleInds = femaleInds[[Reverse@Ordering[ds]]];

The following plot shows the distances calculated above.

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"] The following code removes from the training data the records corresponding to females according to the order computed above. The female records farthest from the mean female record are removed first.

AbsoluteTiming[
femaleFrRes = Association@
Table[cl ->
Table[(
inds = Complement[Range[Length[trainingData]], Take[femaleInds, Ceiling[fr*Length[femaleInds]]]];
cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
{fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
{fr, 0, 0.8, 0.05}],
{cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];
]

(* {203.001, Null} *)

The following graphics grid shows how the classification results are affected by the removing fractions of the female records from the training data. The results for none or small fractions of records removed are more blue.

GraphicsGrid[ArrayReshape[
Table[
femaleAROCs = femaleFrRes[cl][[All, 2]];
frRange = femaleFrRes[cl][[All, 1]]; ROCPlot[frRange, femaleAROCs, PlotRange -> {{0.0, 0.25}, {0.2, 0.8}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
{cl, Keys[femaleFrRes]}],
{2, 3}], Dividers -> All] We can see that removing the female records outliers has dramatic effect on the results by the classifiers "NearestNeighbors" and "NeuralNetwork". Not so much on "LogisticRegression" and "NaiveBayes".

## Decreasing proportions of males

The code in this sub-section repeats the experiment described in the previous one males (instead of females).

maleInds = Flatten@Position[trainingData[[All, 3]], "male"];

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[maleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
maleInds = maleInds[[Reverse@Ordering[ds]]];

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"] AbsoluteTiming[
maleFrRes = Association@
Table[cl ->
Table[(
inds = Complement[Range[Length[trainingData]], Take[maleInds, Ceiling[fr*Length[maleInds]]]];
cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
{fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
{fr, 0, 0.8, 0.05}],
{cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];
]

(* {179.219, Null} *)

GraphicsGrid[ArrayReshape[
Table[
maleAROCs = maleFrRes[cl][[All, 2]];
frRange = maleFrRes[cl][[All, 1]]; ROCPlot[frRange, maleAROCs, PlotRange -> {{0.0, 0.35}, {0.55, 0.82}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
{cl, Keys[maleFrRes]}],
{2, 3}], Dividers -> All] # Classifier interpolation

Assume that we want a classifier that for a given representative set of items (records) assigns the positive label to an exactly of them. (Or very close to that number.)

If we have two classifiers, one returning more positive items than , the other less than , then we can use geometric computations in the ROC space in order to obtain parameters for a classifier interpolation that will bring positive items close to ; see . Below is given Mathematica code with explanations of how that classifier interpolation is done.

Assume that by prior observations we know that for a given dataset of items the positive class consists of items. Assume that for a given unknown dataset of items we want of the items to be classified as positive. We can write the equation: which can be simplified to ## The two classifiers

Consider the following two classifiers.

cf1 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "RandomForest"];
cfROC1 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf1[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 82, "FalsePositive" -> 22, "TrueNegative" -> 170, "FalseNegative" -> 40|> *)

cf2 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "LogisticRegression"];
cfROC2 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf2[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 89, "FalsePositive" -> 37, "TrueNegative" -> 155, "FalseNegative" -> 33|> *)

## Geometric computations in the ROC space

Here are the ROC space points corresponding to the two classifiers, cf1 and cf2:

p1 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC1]];
p2 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC2]];

Here is the breakdown of frequencies of the class labels:

Tally[trainingData[[All, -1]]]
%[[All, 2]]/Length[trainingData] // N

(* {{"survived", 305}, {"died", 427}}
{0.416667, 0.583333}) *)

We want to our classifier to produce % people to survive. Here we find two points of the corresponding constraint line (on which we ROC points of the desired classifiers should reside):

sol1 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.1}, {x, y}][]
sol2 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.25}, {x, y}][]

(* {x -> 0.1, y -> 0.766667}
{x -> 0.25, y -> 0.559524} *)

Here using the points q1 and q2 of the constraint line we find the intersection point with the line connecting the ROC points of the classifiers:

{q1, q2} = {{x, y} /. sol1, {x, y} /. sol2};
sol = Solve[ {{x, y} \[Element] InfiniteLine[{q1, q2}] \[And] {x, y} \[Element] InfiniteLine[{p1, p2}]}, {x, y}];
q = {x, y} /. sol[]

(* {0.149753, 0.69796} *)

Let us plot all geometric objects:

Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black,
Text["cf1", p1, {-1.5, 1}], Red, Tooltip[Point[p2], "cf2"], Black,
Text["cf2", p2, {1.5, -1}], Black, Point[q], Dashed,
InfiniteLine[{q1, q2}], Thin, InfiniteLine[{p1, p2}]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Frame -> True] Classifier-Interpolation-geometric-objects

## Classifier interpolation

Next we find the ratio of the distance from the intersection point q to the cf1 ROC point and the distance between the ROC points of cf1 and cf2.

k = Norm[p1 - q]/Norm[p1 - p2]
(* 0.450169 *)

The classifier interpolation is made by a weighted random selection based on that ratio (using RandomChoice):

SeedRandom
cres = MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}];
cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
p3 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]];
Graphics[{PointSize[0.015], Blue, Point[p1], Red, Point[p2], Black, Dashed, InfiniteLine[{q1, q2}], Green, Point[p3]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Frame -> True] Classifier-Interpolation-single-results

We can run the process multiple times in order to convince ourselves that the interpolated classifier ROC point is very close to the constraint line most of the time.

p3s =
Table[(
cres =
MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}];
cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]]), {1000}];

Show[{SmoothDensityHistogram[p3s, ColorFunction -> (Blend[{White, Green}, #] &), Mesh -> 3],
Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black, Text["cf1", p1, {-1.5, 1}],
Red, Tooltip[Point[p2], "cf2"], Black, Text["cf2", p2, {1.5, -1}],
Black, Dashed, InfiniteLine[{q1, q2}]}, GridLines -> Automatic]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Axes -> True,
AspectRatio -> Automatic] Classifier-Interpolation-1000-results

 Leo Breiman, Statistical Modeling: The Two Cultures, (2001), Statistical Science, Vol. 16, No. 3, 199[Dash]231.

 Tom Fawcett, An introduction to ROC analysis, (2006), Pattern Recognition Letters, 27, 861[Dash]874. (Link to PDF.)

 Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

 Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ROCFunctions.m.

 Anton Antonov, Classifier ensembles functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ClassifierEnsembles.m.

 Anton Antonov, "Importance of variables investigation guide", (2016), MathematicaForPrediction at GitHub, folder Documentation.

# Basic example of using ROC with Linear regression

## Introduction

The package  provides Mathematica implementations of Receiver Operating Characteristic (ROC) functions calculation and plotting. The ROC framework is used for analysis and tuning of binary classifiers, . (The classifiers are assumed to classify into a positive/true label or a negative/false label. )

The function ROCFuntions gives access to the individual ROC functions through string arguments. Those ROC functions are applied to special objects, called ROC Association objects.

Each ROC Association object is an Association that has the following four keys: “TruePositive”, “FalsePositive”, “TrueNegative”, and “FalseNegative” .

Given two lists of actual and predicted labels a ROC Association object can be made with the function ToROCAssociation .

For more definitions and example of ROC terminology and functions see .

## Minimal example

Note that here although we use both of the provided Titanic training and test data, the code is doing only training. The test data is used to find the best tuning parameter (threshold) through ROC analysis.

### Get packages

These commands load the packages [1,2]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ROCFunctions.m"]

### Using Titanic data

Here is the summary of the Titanic data used below:

titanicData = (Flatten@*List) @@@ExampleData[{"MachineLearning", "Titanic"}, "Data"];
columnNames = (Flatten@*List) @@ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
RecordsSummary[titanicData, columnNames] This variable dependence grid shows the relationships between the variables.

Magnify[#, 0.7] &@VariableDependenceGrid[titanicData, columnNames] ### Get training and testing data

data = ExampleData[{"MachineLearning", "Titanic"}, "TrainingData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
trainingData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[trainingData]

(* {732, 4} *)

data = ExampleData[{"MachineLearning", "Titanic"}, "TestData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
testData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[testData]

(* {314, 4} *)

### Replace categorical with numerical values

trainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

testData = testData /. {"survived" -> 1, "died" -> 0, "1st" -> 1, "2nd" -> 2, "3rd" -> 3, "male" -> 0, "female" -> 1};

### Do linear regression

lfm = LinearModelFit[{trainingData[[All, 1 ;; -2]], trainingData[[All, -1]]}] ### Get the predicted values

modelValues = lfm @@@ testData[[All, 1 ;; -2]];

Histogram[modelValues, 20] RecordsSummary[modelValues] ### Obtain ROC associations over a set of parameter values

testLabels = testData[[All, -1]];

thRange = Range[0.1, 0.9, 0.025];
aROCs = Table[ToROCAssociation[{1, 0}, testLabels, Map[If[# > \[Theta], 1, 0] &, modelValues]], {\[Theta], thRange}];

### Evaluate ROC functions for given ROC association

N @ Through[ROCFunctions[{"PPV", "NPV", "TPR", "ACC", "SPC", "MCC"}][aROCs[]]]

(* {0.513514, 0.790698, 0.778689, 0.627389, 0.53125, 0.319886} *)

### Standard ROC plot

ROCPlot[thRange, aROCs, "PlotJoined" -> Automatic, "ROCPointCallouts" -> True, "ROCPointTooltips" -> True, GridLines -> Automatic] ### Plot ROC functions wrt to parameter values

rocFuncs = {"PPV", "NPV", "TPR", "ACC", "SPC", "MCC"};
rocFuncTips = Map[# <> ", " <> (ROCFunctions["FunctionInterpretations"][#]) &, rocFuncs];
ListLinePlot[
MapThread[Tooltip[Transpose[{thRange, #1}], #2] &, {Transpose[Map[Through[ROCFunctions[rocFuncs][#]] &, aROCs]], rocFuncTips}],
Frame -> True,
FrameLabel -> Map[Style[#, Larger] &, {"threshold, \[Theta]", "rate"}],
PlotLegends -> rocFuncTips, GridLines -> Automatic] ### Finding the intersection point of PPV and TPR

We want to find a point that provides balanced positive and negative labels success rates. One way to do this is to find the intersection point of the ROC functions PPV (positive predictive value) and TPR (true positive rate).

Examining the plot above we can come up with the initial condition for x.

ppvFunc = Interpolation[Transpose@{thRange, ROCFunctions["PPV"] /@ aROCs}];
tprFunc = Interpolation[Transpose@{thRange, ROCFunctions["TPR"] /@ aROCs}];
FindRoot[ppvFunc[x] - tprFunc[x] == 0, {x, 0.2}]

(* {x -> 0.3} *)

### Area under the ROC curve

The Area Under the ROC curve (AUROC) tells for a given range of the controlling parameter “what is the probability of the classifier to rank a randomly chosen positive instance higher than a randomly chosen negative instance, (assuming ‘positive’ ranks higher than ‘negative’)”, [3,4]

Calculating AUROC is easy using the Trapezoidal quadrature formula:

 N@Total[Partition[Sort@Transpose[{ROCFunctions["FPR"] /@ aROCs, ROCFunctions["TPR"] /@ aROCs}], 2, 1]
/. {{x1_, y1_}, {x2_, y2_}} :> (x2 - x1) (y1 + (y2 - y1)/2)]

(* 0.474513 *)

It is also implemented in :

N@ROCFunctions["AUROC"][aROCs]

(* 0.474513 *)

## References

 Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

 Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ROCFunctions.m .

 Tom Fawcett, An introduction to ROC analysis, (2006), Pattern Recognition Letters, 27, 861-874.

# Contingency tables creation examples

## Introduction

In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables.

### Code

In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, .

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]

The implementation of CrossTabulate in  is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in  is based on R‘s function xtabs called via RLink.

Other package used in this document are  and  imported with these commands:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"]

For a different approach to implementing cross-tabulation than those taken in  see the Stack Overflow answer http://stackoverflow.com/a/8101951 by Mr.Wizard.

## Using Titanic data

### Getting data

titanicData =
Flatten@*List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"];
titanicData = DeleteCases[titanicData, {___, _Missing, ___}];

titanicColumnNames =
Flatten@*List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
aTitanicColumnNames =
AssociationThread[titanicColumnNames -> Range[Length[titanicColumnNames]]];

Note that we have removed the records with missing data (for simpler exposition).

### Data summary

Dimensions[titanicData]
(* {1046, 4} *)

RecordsSummary[titanicData, titanicColumnNames] ### Using CrossTabulate

Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group.

We can do that by

1. finding the counts contingency table for the variables "passenger class" and "passenger survival",

2. finding the age contingency table for the same variables, and

3. do the element-wise division .

ctCounts =
CrossTabulate[
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts ctTotalAge =
CrossTabulate[
titanicData[[All,
aTitanicColumnNames /@ {"passenger class", "passenger survival",
"passenger age"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge MatrixForm[
ctTotalAge[]/
Normal[ctCounts[]],
TableHeadings -> Values[Rest[ctTotalAge]]] (We have to make the sparse array ctCounts a regular array because otherwise we get warnings for division by zero because of the sparse array’s default value.)

Let us repeat the steps above by dividing the passengers before-hand according to their sex.

Association@
Map[
(mCount =
CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]];
#[[1,  aTitanicColumnNames["passenger sex"]]] ->
GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]] The alternative of CrossTabulate is xtabsViaRLink that is uses R’s function xtabs via RLink.

Needs["RLink"]
InstallR[]

ctCounts =
FromRXTabsForm@
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]],
{"passenger.sex", "passenger.survival"},
" ~ passenger.sex + passenger.survival"];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts ### Relation to mosaic plots

The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section.

MosaicPlot[
titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ] ## Straightforward calling of MatrixForm

At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association .

Unprotect[Association];
MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] || KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x);
Protect[Association];

Now we can do this:

MatrixForm @
CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]] Remark: Because of this up-value definition for Association with MatrixForm we have the associations returned by CrossTabulate and FromRXTabsForm to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter.

## Using larger data

Let us consider an example with larger data that has larger number of unique values in its columns.

### Getting online retail invoices data

The following dataset is taken from .

data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"];
columnNames = First[data];
data = Rest[data];

aColumnNames = AssociationThread[columnNames -> Range[Length[columnNames]]];

### Data summary

We have rows and columns:

Dimensions[data]
(* {65499, 8} *)

Here is a summary of the columns:

Magnify[#, 0.75] &@RecordsSummary[data, columnNames] ### Contingency tables

#### Country vs. StockCode

There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command:

MinMax@Map[Length@*Union,
GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]]
(* {1,144} *)

The way in which the column "StockCode" was ingested made it have multiple types for its values:

Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]]
(* {{False,9009},{True,56490}} *)

So let us convert it to all strings:

data[[All, aColumnNames["StockCode"]]] =
ToString /@ data[[All, aColumnNames["StockCode"]]];

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate:

AbsoluteTiming[
ctRes = CrossTabulate[
data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]];
]
(* {0.256339,Null} *)

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink:

AbsoluteTiming[
data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]],
{"Country", "StockCode", "Quantity"},
"Quantity ~ Country + StockCode"];
ctRRes = FromRXTabsForm[rres];
]
(* {0.843621,Null} *)

Both functions produce the same result:

ctRRes["matrix"] == N@ctRRes[]
(* True *)

Note that xtabsViaRLink is slower but still fairly quick.

Here we plot the contingency table using MatrixPlot :

MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5,
FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}],
{Automatic, Automatic}}, ImageSize -> 550] #### Country vs. Quarter

Let us extend the data with columns that have months and quarters corresponding to the invoice dates.

The following commands computing date objects and extracting month and quarter values from them are too slow.

(*AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \
","Hour",":","Minute"}}]&/@data[[All,aColumnNames["InvoiceDate"]]];
]*)
(* {30.2595,Null} *)

(*AbsoluteTiming[
dvals=DateValue[dobjs,{"MonthName","QuarterNameShort"}];
]*)
(* {91.1732,Null} *)

dvals = StringSplit[#, {"/", " ", ":"}] & /@
data[[All,
aColumnNames["InvoiceDate"]]];

This summary shows that the second value in the dates is day of month, and the first value is most likely month.

Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16] These commands extend the data and the corresponding column-name-to-index association.

ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"];
dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]];
dataM = MapThread[Join[#1, #2] &, {data, dvals}];
aColumnNamesM =
Join[aColumnNames, <|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)|>];
(* {0.054877,Null} *)

Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity".

ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]];
Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes ## Uniform tables

Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns.

Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity".

tbs = Association @
Map[
(xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]];
#[[1, aColumnNamesM["Country"]]] -> xtab) &,
GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]];

Magnify[#, 0.75] &@
Map[MatrixForm[#["Matrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs](*[[{1,12,-1}]]*) Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table.

First we convert the contingency tables into RSparseMatrix objects:

tbs2 = Map[ ToRSparseMatrix[#["Matrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs];

And then we impose the desired row and column names:

tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2];
Magnify[#, 0.75] &@(MatrixForm /@ tbs2) ## Generalization : CrossTensorate

A generalization of CrossTabulate is the function CrossTensorate implemented in  that takes a "formula" argument similar to R’s xtabs.

This finds number of people of different sub-groups of Titanic data:

ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames];
MatrixForm[ctRes] We can verify the result using Count:

Count[titanicData, {"1st", _, "female", "died"}]
(* 5 *)

Count[titanicData, {"2nd", _, "male", "survived"}]
(* 23 *)

To split the cross-tensor across its first variable we can use this command:

sctRes = Association@
MapThread[Rule[#1, Join[<|"XTABTensor" -> #2|>, Rest@Rest@ctRes]] &, {ctRes[], # & /@ ctRes["XTABTensor"]}];
MatrixForm /@ sctRes Or we can call the more general function CrossTensorateSplit implemented in :

Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]] CrossTensorateSplit can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as :

Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]

## References

 Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

 Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

 Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/datasets/Online+Retail .