# A monad for classification workflows

## Introduction

In this document we describe the design and implementation of a (software programming) monad for classification workflows specification and execution. The design and implementation are done with Mathematica / Wolfram Language (WL).

The goal of the monad design is to make the specification of classification workflows (relatively) easy, straightforward, by following a certain main scenario and specifying variations over that scenario.

The monad is named ClCon and it is based on the State monad package "StateMonadCodeGenerator.m", [AAp1, AA1], the classifier ensembles package "ClassifierEnsembles.m", [AAp4, AA2], and the package for Receiver Operating Characteristic (ROC) functions calculation and plotting "ROCFunctions.m", [AAp5, AA2, Wk2].

The data for this document is read from WL’s repository using the package "GetMachineLearningDataset.m", [AAp10].

The monadic programming design is used as a Software Design Pattern. The ClCon monad can be also seen as a Domain Specific Language (DSL) for the specification and programming of machine learning classification workflows.

Here is an example of using the ClCon monad over the Titanic data:

The table above is produced with the package "MonadicTracing.m", [AAp2, AA1], and some of the explanations below also utilize that package.

As it was mentioned above the monad ClCon can be seen as a DSL. Because of this the monad pipelines made with ClCon are sometimes called "specifications".

### Contents description

The document has the following structure.

• The sections "Package load" and "Data load" obtain the needed code and data.
(Needed and put upfront from the "Reproducible research" point of view.)

• The sections "Design consideration" and "Monad design" provide motivation and design decisions rationale.

• The sections "ClCon overview" and "Monad elements" provide technical description of the ClCon monad needed to utilize it.
(Using a fair amount of examples.)

• The section "Example use cases" gives several more elaborated examples of ClCon that have "real life" flavor.
(But still didactic and concise enough.)

• The section "Unit test" describes the tests used in the development of the ClCon monad.
(The random pipelines unit tests are especially interesting.)

• The section "Future plans" outlines future directions of development.
(The most interesting and important one is the "conversational agent" direction.)

• The section "Implementation notes" has (i) a diagram outlining the ClCon development process, and (ii) a list of observations and morals.
(Some fairly obvious, but deemed fairly significant and hence stated explicitly.)

Remark: One can read only the sections "Introduction", "Design consideration", "Monad design", and "ClCon overview". That set of sections provide a fairly good, programming language agnostic exposition of the substance and novel ideas of this document.

The following commands load the packages [AAp1–AAp10, AAp12]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicContextualClassification.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaVsR/master/Projects/ProgressiveMachineLearning/Mathematica/GetMachineLearningDataset.m"]

(*
Importing from GitHub: MathematicaForPredictionUtilities.m
Importing from GitHub: MosaicPlot.m
Importing from GitHub: CrossTabulate.m
Importing from GitHub: ClassifierEnsembles.m
Importing from GitHub: ROCFunctions.m
Importing from GitHub: VariableImportanceByClassifiers.m
Importing from GitHub: SSparseMatrix.m
Importing from GitHub: OutlierIdentifiers.m
*)

In this section we load data that is used in the rest of the document. The "quick" data is created in order to specify quick, illustrative computations.

Remark: In all datasets the classification labels are in the last column.

The summarization of the data is done through ClCon, which in turn uses the function RecordsSummary from the package "MathematicaForPredictionUtilities.m", [AAp7].

### WL resources data

The following commands produce datasets using the package [AAp10] (that utilizes ExampleData):

dsTitanic = GetMachineLearningDataset["Titanic"];
dsMushroom = GetMachineLearningDataset["Mushroom"];
dsWineQuality = GetMachineLearningDataset["WineQuality"];

Here is are the dimensions of the datasets:

Dataset[Dataset[Map[Prepend[Dimensions[ToExpression[#]], #] &, {"dsTitanic", "dsMushroom", "dsWineQuality"}]][All, AssociationThread[{"name", "rows", "columns"}, #] &]]

Here is the summary of dsTitanic:

ClConUnit[dsTitanic]⟹ClConSummarizeData["MaxTallies" -> 12];

Here is the summary of dsMushroom in long form:

ClConUnit[dsMushroom]⟹ClConSummarizeDataLongForm["MaxTallies" -> 12];

Here is the summary of dsWineQuality in long form:

ClConUnit[dsWineQuality]⟹ClConSummarizeDataLongForm["MaxTallies" -> 12];

### "Quick" data

In this subsection we make up some data that is used for illustrative purposes.

SeedRandom[212]
dsData = RandomInteger[{0, 1000}, {100}];
dsData = Dataset[
Transpose[{dsData, Mod[dsData, 3], Last@*IntegerDigits /@ dsData, ToString[Mod[#, 3]] & /@ dsData}]];
dsData = Dataset[dsData[All, AssociationThread[{"number", "feature1", "feature2", "label"}, #] &]];
Dimensions[dsData]

(* {100, 4} *)

Here is a sample of the data:

RandomSample[dsData, 6]

Here is a summary of the data:

ClConUnit[dsData]⟹ClConSummarizeData;

Here we convert the data into a list of record-label rules (and show the summary):

mlrData = ClConToNormalClassifierData[dsData];
ClConUnit[mlrData]⟹ClConSummarizeData;

Finally, we make the array version of the dataset:

arrData = Normal[dsData[All, Values]];

## Design considerations

The steps of the main classification workflow addressed in this document follow.

1. Retrieving data from a data repository.

2. Optionally, transform the data.

3. Split data into training and test parts.

• Optionally, split training data into training and validation parts.
4. Make a classifier with the training data.

5. Test the classifier over the test data.

• Computation of different measures including ROC.

The following diagram shows the steps.

Very often the workflow above is too simple in real situations. Often when making "real world" classifiers we have to experiment with different transformations, different classifier algorithms, and parameters for both transformations and classifiers. Examine the following mind-map that outlines the activities in making competition classifiers.

In view of the mind-map above we can come up with the following flow-chart that is an elaboration on the main, simple workflow flow-chart.

• the introduction of new elements in classification workflows,

• workflows elements variability, and

• workflows iterative changes and refining,

it is beneficial to have a DSL for classification workflows. We choose to make such a DSL through a functional programming monad, [Wk1, AA1].

Here is a quote from [Wk1] that fairly well describes why we choose to make a classification workflow monad and hints on the desired properties of such a monad.

[…] The monad represents computations with a sequential structure: a monad defines what it means to chain operations together. This enables the programmer to build pipelines that process data in a series of steps (i.e. a series of actions applied to the data), in which each action is decorated with the additional processing rules provided by the monad. […]

Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. […]

Remark: Note that quote from [Wk1] refers to chained monadic operations as "pipelines". We use the terms "monad pipeline" and "pipeline" below.

The monad we consider is designed to speed-up the programming of classification workflows outlined in the previous section. The monad is named ClCon for "Classification with Context".

We want to be able to construct monad pipelines of the general form:

ClCon is based on the State monad, [Wk1, AA1], so the monad pipeline form (1) has the following more specific form:

This means that some monad operations will not just change the pipeline value but they will also change the pipeline context.

In the monad pipelines of ClCon we store different objects in the contexts for at least one of the following two reasons.

1. The object will be needed later on in the pipeline.

2. The object is hard to compute.

Such objects are training data, ROC data, and classifiers.

Let us list the desired properties of the monad.

• Rapid specification of non-trivial classification workflows.

• The monad works with different data types: Dataset, lists of machine learning rules, full arrays.

• The pipeline values can be of different types. Most monad functions modify the pipeline value; some modify the context; some just echo results.

• The monad works with single classifier objects and with classifier ensembles.

• This means support of different classifier measures and ROC plots for both single classifiers and classifier ensembles.
• The monad allows of cursory examination and summarization of the data.
• For insight and in order to verify assumptions.
• The monad has operations to compute importance of variables.

• We can easily obtain the pipeline value, context, and different context objects for manipulation outside of the monad.

• We can calculate classification measures using a specified ROC parameter and a class label.

• We can easily plot different combinations of ROC functions.

The ClCon components and their interaction are given in the following diagram. (The components correspond to the main workflow given in the previous section.)

In the diagram above the operations are given in rectangles. Data objects are given in round corner rectangles and classifier objects are given in round corner squares.

The main ClCon operations implicitly put in the context or utilize from the context the following objects:

• training data,

• test data,

• validation data,

• classifier (a classifier function or an association of classifier functions),

• ROC data,

• variable names list.

Note the that the monadic set of types of ClCon pipeline values is fairly heterogenous and certain awareness of "the current pipeline value" is assumed when composing ClCon pipelines.

Obviously, we can put in the context any object through the generic operations of the State monad of the package "StateMonadGenerator.m", [AAp1].

## ClCon overview

When using a monad we lift certain data into the "monad space", using monad’s operations we navigate computations in that space, and at some point we take results from it.

With the approach taken in this document the "lifting" into the ClCon monad is done with the function ClConUnit. Results from the monad can be obtained with the functions ClConTakeValue, ClConContext, or with the other ClCon functions with the prefix "ClConTake" (see below.)

Here is a corresponding diagram of a generic computation with the ClCon monad:

Remark: It is a good idea to compare the diagram with formulas (1) and (2).

Let us examine a concrete ClCon pipeline that corresponds to the diagram above. In the following table each pipeline operation is combined together with a short explanation and the context keys after its execution.

Here is the output of the pipeline:

In the specified pipeline computation the last column of the dataset is assumed to be the one with the class labels.

The ClCon functions are separated into four groups:

• operations,

• setters,

• takers,

An overview of the those functions is given in the tables in next two sub-sections. The next section, "Monad elements", gives details and examples for the usage of the ClCon operations.

### Monad functions interaction with the pipeline value and context

The following table gives an overview the interaction of the ClCon monad functions with the pipeline value and context.

Several functions that use ROC data have two rows in the table because they calculate the needed ROC data if it is not available in the monad context.

Here are the ClCon State Monad functions (generated using the prefix "ClCon", [AAp1, AA1]):

In this section we show that ClCon has all of the properties listed in the previous section.

The monad head is ClCon. Anything wrapped in ClCon can serve as monad’s pipeline value. It is better though to use the constructor ClConUnit. (Which adheres to the definition in [Wk1].)

ClCon[{{1, "a"}, {2, "b"}}, <||>]⟹ClConSummarizeData;

### Lifting data to the monad

The function lifting the data into the monad ClCon is ClConUnit.

The lifting to the monad marks the beginning of the monadic pipeline. It can be done with data or without data. Examples follow.

ClConUnit[dsData]⟹ClConSummarizeData;
ClConUnit[]⟹ClConSetTrainingData[dsData]⟹ClConSummarizeData;

(See the sub-section "Setters and takers" for more details of setting and taking values in ClCon contexts.)

Currently the monad can deal with data in the following forms:

• datasets,

• matrices,

• lists of example->label rules.

The ClCon monad also has the non-monadic function ClConToNormalClassifierData which can be used to convert datasets and matrices to lists of example->label rules. Here is an example:

Short[ClConToNormalClassifierData[dsData], 3]

(*
{{639, 0, 9} -> "0", {121, 1, 1} -> "1", {309, 0, 9} ->  "0", {648, 0, 8} -> "0", {995, 2, 5} -> "2", {127, 1, 7} -> "1", {908, 2, 8} -> "2", {564, 0, 4} -> "0", {380, 2, 0} -> "2", {860, 2, 0} -> "2",
<<80>>,
{464, 2, 4} -> "2", {449, 2, 9} -> "2", {522, 0, 2} -> "0", {288, 0, 8} -> "0", {51, 0, 1} -> "0", {108, 0, 8} -> "0", {76, 1, 6} -> "1", {706, 1, 6} -> "1", {765, 0, 5} -> "0", {195, 0, 5} -> "0"}
*)

When the data lifted to the monad is a dataset or a matrix it is assumed that the last column has the class labels. WL makes it easy to rearrange columns in such a way the any column of dataset or a matrix to be the last.

### Data splitting

The splitting is made with ClConSplitData, which takes up to two arguments and options. The first argument specifies the fraction of training data. The second argument — if given — specifies the fraction of the validation part of the training data. If the value of option Method is "LabelsProportional", then the splitting is done in correspondence of the class labels tallies. ("LabelsProportional" is the default value.) Data splitting demonstration examples follow.

Here are the dimensions of the dataset dsData:

Dimensions[dsData]

(* {100, 4} *)

Here we split the data into 70% for training and 30% for testing and then we verify that the corresponding number of rows add to the number of rows of dsData:

val = ClConUnit[dsData]⟹ClConSplitData[0.7]⟹ClConTakeValue;
Map[Dimensions, val]
Total[First /@ %]

(*
<|"trainingData" -> {69, 4}, "testData" -> {31, 4}|>
100
*)

Note that if Method is not "LabelsProportional" we get slightly different results.

val = ClConUnit[dsData]⟹ClConSplitData[0.7, Method -> "Random"]⟹ClConTakeValue;
Map[Dimensions, val]
Total[First /@ %]

(*
<|"trainingData" -> {70, 4}, "testData" -> {30, 4}|>
100
*)

In the following code we split the data into 70% for training and 30% for testing, then the training data is further split into 90% for training and 10% for classifier training validation; then we verify that the number of rows add up.

val = ClConUnit[dsData]⟹ClConSplitData[0.7, 0.1]⟹ClConTakeValue;
Map[Dimensions, val]
Total[First /@ %]

(*
<|"trainingData" -> {61, 4}, "testData" -> {31, 4}, "validationData" -> {8, 4}|>
100
*)

### Classifier training

The monad ClCon supports both single classifiers obtained with Classify and classifier ensembles obtained with Classify and managed with the package "ClassifierEnsembles.m", [AAp4].

#### Single classifier training

With the following pipeline we take the Titanic data, split it into 75/25 % parts, train a Logistic Regression classifier, and finally take that classifier from the monad.

cf =
ClConUnit[dsTitanic]⟹
ClConSplitData[0.75]⟹
ClConMakeClassifier["LogisticRegression"]⟹
ClConTakeClassifier;

Here is information about the obtained classifier:

ClassifierInformation[cf, "TrainingTime"]

(* Quantity[3.84008, "Seconds"] *)

If we want to pass parameters to the classifier training we can use the Method option. Here we train a Random Forest classifier with 400 trees:

cf =
ClConUnit[dsTitanic]⟹
ClConSplitData[0.75]⟹
ClConMakeClassifier[Method -> {"RandomForest", "TreeNumber" -> 400}]⟹
ClConTakeClassifier;

ClassifierInformation[cf, "TreeNumber"]

(* 400 *)

#### Classifier ensemble training

With the following pipeline we take the Titanic data, split it into 75/25 % parts, train a classifier ensemble of three Logistic Regression classifiers and two Nearest Neighbors classifiers using random sampling of 90% of the training data, and finally take that classifier ensemble from the monad.

ensemble =
ClConUnit[dsTitanic]⟹
ClConSplitData[0.75]⟹
ClConMakeClassifier[{{"LogisticRegression", 0.9, 3}, {"NearestNeighbors", 0.9, 2}}]⟹
ClConTakeClassifier;

The classifier ensemble is simply an association with keys that are automatically assigned names and corresponding values that are classifiers.

ensemble

Here are the training times of the classifiers in the obtained ensemble:

ClassifierInformation[#, "TrainingTime"] & /@ ensemble

(*
<|"LogisticRegression[1,0.9]" -> Quantity[3.47836, "Seconds"],
"LogisticRegression[2,0.9]" -> Quantity[3.47681, "Seconds"],
"LogisticRegression[3,0.9]" -> Quantity[3.4808, "Seconds"],
"NearestNeighbors[1,0.9]" -> Quantity[1.82454, "Seconds"],
"NearestNeighbors[2,0.9]" -> Quantity[1.83804, "Seconds"]|>
*)

A more precise specification can be given using associations. The specification

<|"method" -> "LogisticRegression", "sampleFraction" -> 0.9, "numberOfClassifiers" -> 3, "samplingFunction" -> RandomChoice|>

says "make three Logistic Regression classifiers, for each taking 90% of the training data using the function RandomChoice."

Here is a pipeline specification equivalent to the pipeline specification above:

ensemble2 =
ClConUnit[dsTitanic]⟹
ClConSplitData[0.75]⟹
ClConMakeClassifier[{
<|"method" -> "LogisticRegression",
"sampleFraction" -> 0.9,
"numberOfClassifiers" -> 3,
"samplingFunction" -> RandomSample|>,
<|"method" -> "NearestNeighbors",
"sampleFraction" -> 0.9,
"numberOfClassifiers" -> 2,
"samplingFunction" -> RandomSample|>}]⟹
ClConTakeClassifier;

ensemble2

### Classifier testing

Classifier testing is done with the testing data in the context.

Here is a pipeline that takes the Titanic data, splits it, and trains a classifier:

p =
ClConUnit[dsTitanic]⟹
ClConSplitData[0.75]⟹
ClConMakeClassifier["DecisionTree"];

Here is how we compute selected classifier measures:

p⟹
ClConClassifierMeasurements[{"Accuracy", "Precision", "Recall", "FalsePositiveRate"}]⟹
ClConTakeValue

(*
<|"Accuracy" -> 0.792683,
"Precision" -> <|"died" -> 0.802691, "survived" -> 0.771429|>,
"Recall" -> <|"died" -> 0.881773, "survived" -> 0.648|>,
"FalsePositiveRate" -> <|"died" -> 0.352, "survived" -> 0.118227|>|>
*)

(The measures are listed in the function page of ClassifierMeasurements.)

Here we show the confusion matrix plot:

p⟹ClConClassifierMeasurements["ConfusionMatrixPlot"]⟹ClConEchoValue;

Here is how we plot ROC curves by specifying the ROC parameter range and the image size:

p⟹ClConROCPlot["FPR", "TPR", "ROCRange" -> Range[0, 1, 0.1], ImageSize -> 200];

Remark: ClCon uses the package ROCFunctions.m, [AAp5], which implements all functions defined in [Wk2].

Here we plot ROC functions values (y-axis) over the ROC parameter (x-axis):

p⟹ClConROCListLinePlot[{"ACC", "TPR", "FPR", "SPC"}];

Note of the "ClConROC*Plot" functions automatically echo the plots. The plots are also made to be the pipeline value. Using the option specification "Echo"->False the automatic echoing of plots can be suppressed. With the option "ClassLabels" we can focus on specific class labels.

p⟹
ClConROCListLinePlot[{"ACC", "TPR", "FPR", "SPC"}, "Echo" -> False, "ClassLabels" -> "survived", ImageSize -> Medium]⟹
ClConEchoValue;

### Variable importance finding

Using the pipeline constructed above let us find the most decisive variables using systematic random shuffling (as explained in [AA3]):

p⟹
ClConAccuracyByVariableShuffling⟹
ClConTakeValue

(*
<|None -> 0.792683, "id" -> 0.664634, "passengerClass" -> 0.75, "passengerAge" -> 0.777439, "passengerSex" -> 0.612805|>
*)

We deduce that "passengerSex" is the most decisive variable because its corresponding classification success rate is the smallest. (See [AA3] for more details.)

Using the option "ClassLabels" we can focus on specific class labels:

p⟹ClConAccuracyByVariableShuffling["ClassLabels" -> "survived"]⟹ClConTakeValue

(*
<|None -> {0.771429}, "id" -> {0.595506}, "passengerClass" -> {0.731959}, "passengerAge" -> {0.71028}, "passengerSex" -> {0.414414}|>
*)

### Setters and takers

The values from the monad context can be set or obtained with the corresponding "setters" and "takers" functions as summarized in previous section.

For example:

p⟹ClConTakeClassifier

(* ClassifierFunction[__] *)

Short[Normal[p⟹ClConTakeTrainingData]]

(*
{<|"id" -> 858, "passengerClass" -> "3rd", "passengerAge" -> 30, "passengerSex" -> "male", "passengerSurvival" -> "survived"|>, <<979>> }
*)

Short[Normal[p⟹ClConTakeTestData]]

(* {<|"id" -> 285, "passengerClass" -> "1st", "passengerAge" -> 60, "passengerSex" -> "female", "passengerSurvival" -> "survived"|> , <<327>> }
*)

p⟹ClConTakeVariableNames

(* {"id", "passengerClass", "passengerAge", "passengerSex", "passengerSurvival"} *)

If other values are put in the context they can be obtained through the (generic) function ClConTakeContext, [AAp1]:

p = ClConUnit[RandomReal[1, {2, 2}]]⟹ClConAddToContext["data"];

(p⟹ClConTakeContext)["data"]

(* {{0.815836, 0.191562}, {0.396868, 0.284587}} *)

Another generic function from [AAp1] is ClConTakeValue (used many times above.)

## Example use cases

### Classification with MNIST data

Here we show an example of using ClCon with the reasonably large dataset of images MNIST, [YL1].

mnistData = ExampleData[{"MachineLearning", "MNIST"}, "Data"];

SeedRandom[3423]
p =
ClConUnit[RandomSample[mnistData, 20000]]⟹
ClConSplitData[0.7]⟹
ClConSummarizeData⟹
ClConMakeClassifier["NearestNeighbors"]⟹
ClConClassifierMeasurements[{"Accuracy", "ConfusionMatrixPlot"}]⟹
ClConEchoValue;

Here we plot the ROC curve for a specified digit:

p⟹ClConROCPlot["ClassLabels" -> 5];

### Conditional continuation

In this sub-section we show how the computations in a ClCon pipeline can be stopped or continued based on a certain condition.

The pipeline below makes a simple classifier ("LogisticRegression") for the WineQuality data, and if the recall for the important label ("high") is not large enough makes a more complicated classifier ("RandomForest"). The pipeline marks intermediate steps by echoing outcomes and messages.

SeedRandom[267]
res =
ClConUnit[dsWineQuality[All, Join[#, <|"wineQuality" -> If[#wineQuality >= 7, "high", "low"]|>] &]]⟹
ClConSplitData[0.75, 0.2]⟹
ClConSummarizeData(* summarize the data *)⟹
ClConMakeClassifier[Method -> "LogisticRegression"](* training a simple classifier *)⟹
ClConROCPlot["FPR", "TPR", "ROCPointCallouts" -> False]⟹
ClConClassifierMeasurements[{"Accuracy", "Precision", "Recall", "FalsePositiveRate"}]⟹
ClConEchoValue⟹
ClConIfElse[#["Recall", "high"] > 0.70 & (* criteria based on the recall for "high" *),
ClConEcho["Good recall for \"high\"!", "Success:"],
ClConUnit[##]⟹
ClConEcho[Style["Recall for \"high\" not good enough... making a large random forest.", Darker[Red]], "Info:"]⟹
ClConMakeClassifier[Method -> {"RandomForest", "TreeNumber" -> 400}](* training a complicated classifier *)⟹
ClConROCPlot["FPR", "TPR", "ROCPointCallouts" -> False]⟹
ClConClassifierMeasurements[{"Accuracy", "Precision", "Recall", "FalsePositiveRate"}]⟹
ClConEchoValue &];

We can see that the recall with the more complicated is classifier is higher. Also the ROC plots of the second classifier are visibly closer to the ideal one. Still, the recall is not good enough, we have to find a threshold that is better that the default one. (See the next sub-section.)

### Classification with custom thresholds

(In this sub-section we use the monad from the previous sub-section.)

Here we compute classification measures using the threshold 0.3 for the important class label ("high"):

res⟹
ClConClassifierMeasurementsByThreshold[{"Accuracy", "Precision", "Recall", "FalsePositiveRate"}, "high" -> 0.3]⟹
ClConTakeValue

(* <|"Accuracy" -> 0.782857,  "Precision" -> <|"high" -> 0.498871, "low" -> 0.943734|>,
"Recall" -> <|"high" -> 0.833962, "low" -> 0.76875|>,
"FalsePositiveRate" -> <|"high" -> 0.23125, "low" -> 0.166038|>|> *)

We can see that the recall for "high" is fairly large and the rest of the measures have satisfactory values. (The accuracy did not drop that much, and the false positive rate is not that large.)

Here we compute suggestions for the best thresholds:

res (* start with a previous monad *)⟹
ClConROCPlot[ImageSize -> 300] (* make ROC plots *)⟹
ClConSuggestROCThresholds[3] (* find the best 3 thresholds per class label *)⟹
ClConEchoValue (* echo the result *);

The suggestions are the ROC points that closest to the point {0, 1} (which corresponds to the ideal classifier.)

Here is a way to use threshold suggestions within the monad pipeline:

res⟹
ClConSuggestROCThresholds⟹
ClConEchoValue⟹
(ClConUnit[##]⟹
ClConClassifierMeasurementsByThreshold[{"Accuracy", "Precision", "Recall"}, "high" -> First[#1["high"]]] &)⟹
ClConEchoValue;

(*
value: <|high->{0.35},low->{0.65}|>
value: <|Accuracy->0.825306,Precision-><|high->0.571831,low->0.928736|>,Recall-><|high->0.766038,low->0.841667|>|>
*)

## Unit tests

The development of ClCon was done with two types of unit tests: (1) directly specified tests, [AAp11], and (2) tests based on randomly generated pipelines, [AAp12].

Both unit test packages should be further extended in order to provide better coverage of the functionalities and illustrate — and postulate — pipeline behavior.

### Directly specified tests

Here we run the unit tests file "MonadicContextualClassification-Unit-Tests.wlt", [AAp11]:

AbsoluteTiming[
]

The natural language derived test ID’s should give a fairly good idea of the functionalities covered in [AAp11].

Values[Map[#["TestID"] &, testObject["TestResults"]]]

"DataToContext-no-[]", "DataToContext-with-[]", \
"ClassifierMaking-with-Dataset-1", "ClassifierMaking-with-MLRules-1", \
"AccuracyByVariableShuffling-1", "ROCData-1", \
"ClassifierEnsemble-different-methods-1", \
"ClassifierEnsemble-different-methods-2-cont", \
"ClassifierEnsemble-different-methods-3-cont", \
"ClassifierEnsemble-one-method-1", "ClassifierEnsemble-one-method-2", \
"ClassifierEnsemble-one-method-3-cont", \
"ClassifierEnsemble-one-method-4-cont", "AssignVariableNames-1", \
"AssignVariableNames-2", "AssignVariableNames-3", "SplitData-1", \
"Set-and-take-training-data", "Set-and-take-test-data", \
"Set-and-take-validation-data", "Partial-data-summaries-1", \
"Assign-variable-names-1", "Split-data-100-pct", \
"MakeClassifier-with-empty-unit-1", \
"No-rocData-after-second-MakeClassifier-1"} *)

### Random pipelines tests

Since the monad ClCon is a DSL it is natural to test it with a large number of randomly generated "sentences" of that DSL. For the ClCon DSL the sentences are ClCon pipelines. The package "MonadicContextualClassificationRandomPipelinesUnitTests.m", [AAp12], has functions for generation of ClCon random pipelines and running them as verification tests. A short example follows.

Generate pipelines:

SeedRandom[234]
pipelines = MakeClConRandomPipelines[300];
Length[pipelines]

(* 300 *)

Here is a sample of the generated pipelines:

Block[{DoubleLongRightArrow, pipelines = RandomSample[pipelines, 6]},
Clear[DoubleLongRightArrow];
pipelines = pipelines /. {_Dataset -> "ds", _?DataRulesForClassifyQ -> "mlrData"};
GridTableForm[
Map[List@ToString[DoubleLongRightArrow @@ #, FormatType -> StandardForm] &, pipelines],
]
AutoCollapse[]

Here we run the pipelines as unit tests:

AbsoluteTiming[
res = TestRunClConPipelines[pipelines, "Echo" -> True];
]

(* {350.083, Null} *)

From the test report results we see that a dozen tests failed with messages, all of the rest passed.

rpTRObj = TestReport[res]

(The message failures, of course, have to be examined — some bugs were found in that way. Currently the actual test messages are expected.)

## Future plans

### Workflow operations

#### Outliers

Better outliers finding and manipulation incorporation in ClCon. Currently only outlier finding is surfaced in [AAp3]. (The package internally has other related functions.)

ClConUnit[dsTitanic[Select[#passengerSex == "female" &]]]⟹
ClConOutlierPosition⟹
ClConTakeValue

(* {4, 17, 21, 22, 25, 29, 38, 39, 41, 59} *)

#### Dimension reduction

Support of dimension reduction application — quick construction of pipelines that allow the applying different dimension reduction methods.

Currently with ClCon dimension reduction is applied only to data the non-label parts of which can be easily converted into numerical matrices.

ClConUnit[dsWineQuality]⟹
ClConSplitData[0.7]⟹
ClConReduceDimension[2, "Echo" -> True]⟹
ClConRetrieveFromContext["svdRes"]⟹
ClConEchoFunctionValue["SVD dimensions:", Dimensions /@ # &]⟹
ClConSummarizeData;

### Conversational agent

Using the packages [AAp13, AAp15] we can generate ClCon pipelines with natural commands. The plan is to develop and document those functionalities further.

## Implementation notes

The ClCon package, MonadicContextualClassification.m, [AAp3], is based on the packages [AAp1, AAp4-AAp9]. It was developed using Mathematica and the Mathematica plug-in for IntelliJ IDEA, by Patrick Scheibe , [PS1]. The following diagram shows the development workflow.

Some observations and morals follow.

• Making the unit tests [AAp11] made the final implementation stage much more comfortable.
• Of course, in retrospect that is obvious.
• Initially "MonadicContextualClassification.m" was not real a package, just a collection of global context functions with the prefix "ClCon". This made some programming design decisions harder, slower, and more cumbersome. By making a proper package the development became much easier because of the "peace of mind" brought by the context feature encapsulation.
• The making of random pipeline tests, [AAp12], helped catch a fair amount of inconvenient "features" and bugs.
• (Both tests sets [AAp11, AAp12] can be made to be more comprehensive.)
• The design of a conversational agent for producing ClCon pipelines with natural language commands brought a very fruitful viewpoint on the overall functionalities and the determination and limits of the ClCon development goals. See [AAp13, AAp14, AAp15].

• "Eat your own dog food", or in this case: "use ClCon functionalities to implement ClCon functionalities."

• Since we are developing a DSL it is natural to use that DSL for its own advancement.

• Again, in retrospect that is obvious. Also probably should be seen as a consequence of practicing a certain code refactoring discipline.

• The reason to list that moral is that often it is somewhat "easier" to implement functionalities thinking locally, ad-hoc, forgetting or not reviewing other, already implemented functions.

• In order come be better design and find inconsistencies: write many pipelines and discuss with co-workers.

• This is obvious. I would like to mention that a somewhat good alternative to discussions is (i) writing this document and related ones and (ii) making, running, and examining of the random pipelines tests.

## References

### Packages

[AAp9] Anton Antonov, SSparseMatrix Mathematica package, (2018), MathematicaForPrediction at GitHub.

[AAp10] Anton Antonov, Obtain and transform Mathematica machine learning data-sets, (2018), MathematicaVsR at GitHub.

### ConverationalAgents Packages

[AAp13] Anton Antonov, Classifier workflows grammar in EBNF, (2018), ConversationalAgents at GitHub, https://github.com/antononcube/ConversationalAgents.

[AAp14] Anton Antonov, Classifier workflows grammar Mathematica unit tests, (2018), ConversationalAgents at GitHub, https://github.com/antononcube/ConversationalAgents.

[AAp15] Anton Antonov, ClCon translator Mathematica package, (2018), ConversationalAgents at GitHub, https://github.com/antononcube/ConversationalAgents.

### MathematicaForPrediction articles

[AA1] Anton Antonov, Monad code generation and extension, (2017), MathematicaForPrediction at GitHub, https://github.com/antononcube/MathematicaForPrediction.

### Other

[YL1] Yann LeCun et al., MNIST database site. URL: http://yann.lecun.com/exdb/mnist/ .

[PS1] Patrick Scheibe, Mathematica (Wolfram Language) support for IntelliJ IDEA, (2013-2018), Mathematica-IntelliJ-Plugin at GitHub. URL: https://github.com/halirutan/Mathematica-IntelliJ-Plugin .

# Introduction

The main goals of this document are:

i) to demonstrate how to create versions and combinations of classifiers utilizing different perspectives,

ii) to apply the Receiver Operating Characteristic (ROC) technique into evaluating the created classifiers (see [2,3]) and

iii) to illustrate the use of the Mathematica packages [5,6].

The concrete steps taken are the following:

1. Obtain data: Mathematica built-in or external. Do some rudimentary analysis.

2. Create an ensemble of classifiers and compare its performance to the individual classifiers in the ensemble.

3. Produce classifier versions with from changed data in order to explore the effect of records outliers.

4. Make a bootstrapping classifier ensemble and evaluate and compare its performance.

5. Systematically diminish the training data and evaluate the results with ROC.

6. Show how to do classifier interpolation utilizing ROC.

In the steps above we skip the necessary preliminary data analysis. For the datasets we use in this document that analysis has been done elsewhere. (See [,,,].) Nevertheless, since ROC is mostly used for binary classifiers we want to analyze the class labels distributions in the datasets in order to designate which class labels are "positive" and which are "negative."

## ROC plots evaluation (in brief)

Assume we are given a binary classifier with the class labels P and N (for "positive" and "negative" respectively).

Consider the following measures True Positive Rate (TPR):

and False Positive Rate (FPR):

Assume that we can change the classifier results with a parameter and produce a plot like this one:

For each parameter value the point is plotted; points corresponding to consecutive ‘s are connected with a line. We call the obtained curve the ROC curve for the classifier in consideration. The ROC curve resides in the ROC space as defined by the functions FPR and TPR corresponding respectively to the -axis and the -axis.

The ideal classifier would have its ROC curve comprised of a line connecting {0,0} to {0,1} and a line connecting {0,1} to {1,1}.

Given a classifier the ROC point closest to {0,1}, generally, would be considered to be the best point.

## The wider perspective

This document started as being a part of a conference presentation about illustrating the cultural differences between Statistics and Machine learning (for Wolfram Technology Conference 2016). Its exposition become both deeper and wider than expected. Here are the alternative, original goals of the document:

i) to demonstrate how using ROC a researcher can explore classifiers performance without intimate knowledge of the classifiers mechanisms, and

ii) to provide concrete examples of the typical investigation approaches employed by machine learning researchers.

To make those points clearer and more memorable we are going to assume that exposition is a result of the research actions of a certain protagonist with a suitably selected character.

A by-product of the exposition is that it illustrates the following lessons from machine learning practices. (See [1].)

1. For a given classification task there often are multiple competing models.

2. The outcomes of the good machine learning algorithms might be fairly complex. I.e. there are no simple interpretations when really good results are obtained.

3. Having high dimensional data can be very useful.

In [1] these three points and discussed under the names "Rashomon", "Occam", and "Bellman". To quote:

Rashomon: the multiplicity of good models;
Occam: the conflict between simplicity and accuracy;
Bellman: dimensionality — curse or blessing."

## The protagonist

Our protagonist is a "Simple Nuclear Physicist" (SNP) — someone who is accustomed to obtaining a lot of data that has to be analyzed and mined sometimes very deeply, rigorously, and from a lot of angles, for different hypotheses. SNP is fairly adept in programming and critical thinking, but he does not have or care about deep knowledge of statistics methods or machine learning algorithms. SNP is willing and capable to use software libraries that provide algorithms for statistical methods and machine learning.

SNP is capable of coming up with ROC if he is not aware of it already. ROC is very similar to the so called phase space diagrams physicists do.

# Used packages

These commands load the used Mathematica packages [4,5,6]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ROCFunctions.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ClassifierEnsembles.m"]

# Data used

## The Titanic dataset

These commands load the Titanic data (that is shipped with Mathematica).

data = ExampleData[{"MachineLearning", "Titanic"}, "TrainingData"];
columnNames = (Flatten@*List) @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
trainingData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[trainingData]

(* {732, 4} *)

RecordsSummary[trainingData, columnNames]

data = ExampleData[{"MachineLearning", "Titanic"}, "TestData"];
data = ((Flatten@*List) @@@ data)[[All, {1, 2, 3, -1}]];
testData = DeleteCases[data, {___, _Missing, ___}];
Dimensions[testData]

(* {314, 4} *)

RecordsSummary[testData, columnNames]

nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

# Classifier ensembles

This command makes a classifier ensemble of two built-in classifiers "NearestNeighbors" and "NeuralNetwork":

aCLs = EnsembleClassifier[{"NearestNeighbors", "NeuralNetwork"}, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]]

A classifier ensemble of the package [6] is simply an association mapping classifier IDs to classifier functions.

The first argument given to EnsembleClassifier can be Automatic:

SeedRandom[8989]
aCLs = EnsembleClassifier[Automatic, trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]]];

With Automatic the following built-in classifiers are used:

Keys[aCLs]

(* {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"} *)

Classification with the classifier ensemble can be done using the function EnsembleClassify. If the third argument of EnsembleClassify is "Votes" the result is the class label that appears the most in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "Votes"]

(* "died" *)

The following commands clarify the voting done in the command above.

Map[#[testData[[20, 1 ;; -2]]] &, aCLs]
Tally[Values[%]]

(* <|"NearestNeighbors" -> "died", "NeuralNetwork" -> "survived", "LogisticRegression" -> "survived", "RandomForest" -> "died", "SupportVectorMachine" -> "died", "NaiveBayes" -> "died"|> *)

(* {{"died", 4}, {"survived", 2}} *)

## Classification with ensemble averaged probabilities

If the third argument of EnsembleClassify is "ProbabilitiesMean" the result is the class label that has the highest mean probability in the ensemble results.

EnsembleClassify[aCLs, testData[[20, 1 ;; -2]], "ProbabilitiesMean"]

(* "died" *)

The following commands clarify the probability averaging utilized in the command above.

Map[#[testData[[20, 1 ;; -2]], "Probabilities"] &, aCLs]
Mean[Values[%]]

(* <|"NearestNeighbors" -> <|"died" -> 0.598464, "survived" -> 0.401536|>, "NeuralNetwork" -> <|"died" -> 0.469274, "survived" -> 0.530726|>, "LogisticRegression" -> <|"died" -> 0.445915, "survived" -> 0.554085|>,
"RandomForest" -> <|"died" -> 0.652414, "survived" -> 0.347586|>, "SupportVectorMachine" -> <|"died" -> 0.929831, "survived" -> 0.0701691|>, "NaiveBayes" -> <|"died" -> 0.622061, "survived" -> 0.377939|>|> *)

(* <|"died" -> 0.61966, "survived" -> 0.38034|> *)

The third argument of EnsembleClassifyByThreshold takes a rule of the form label->threshold; the fourth argument is eighter "Votes" or "ProbabiltiesMean".

The following code computes the ROC curve for a range of votes.

rocRange = Range[0, Length[aCLs] - 1, 1];
aROCs = Table[(
cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "Votes"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, GridLines -> Automatic]

## ROC for ensemble probabilities mean

If we want to compute ROC of a range of probability thresholds we EnsembleClassifyByThreshold with the fourth argument being "ProbabilitiesMean".

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.2, "ProbabilitiesMean"]

(* {"survived", "survived", "survived", "survived", "survived", "survived"} *)

EnsembleClassifyByThreshold[aCLs, testData[[1 ;; 6, 1 ;; -2]], "survived" -> 0.6, "ProbabilitiesMean"]

(* {"survived", "died", "survived", "died", "died", "survived"} *)

The implementation of EnsembleClassifyByThreshold with "ProbabilitiesMean" relies on the ClassifierFunction signature:

ClassifierFunction[__][record_, "Probabilities"]

Here is the corresponding ROC plot:

rocRange = Range[0, 1, 0.025];
aROCs = Table[(
cres = EnsembleClassifyByThreshold[aCLs, testData[[All, 1 ;; -2]], "survived" -> i, "ProbabilitiesMean"]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
rocEnGr = ROCPlot[rocRange, aROCs, "PlotJoined" -> Automatic, PlotLabel -> "Classifier ensemble", GridLines -> Automatic]

## Comparison of the ensemble classifier with the standard classifiers

This plot compares the ROC curve of the ensemble classifier with the ROC curves of the classifiers that comprise the ensemble.

rocGRs = Table[
aROCs1 = Table[(
cres = ClassifyByThreshold[aCLs[[i]], testData[[All, 1 ;; -2]], "survived" -> th];
ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {th, rocRange}];
ROCPlot[rocRange, aROCs1, PlotLabel -> Keys[aCLs][[i]], PlotRange -> {{0, 1.05}, {0.6, 1.01}}, "PlotJoined" -> Automatic, GridLines -> Automatic],
{i, 1, Length[aCLs]}];

GraphicsGrid[ArrayReshape[Append[Prepend[rocGRs, rocEnGr], rocEnGr], {2, 4}, ""], Dividers -> All, FrameStyle -> GrayLevel[0.8], ImageSize -> 1200]

Let us plot all ROC curves from the graphics grid above into one plot. For that the single classifier ROC curves are made gray, and their threshold callouts removed. We can see that the classifier ensemble brings very good results for and none of the single classifiers has a better point.

Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), rocEnGr]]

# Classifier ensembles by bootstrapping

There are several ways to produce ensemble classifiers using bootstrapping or jackknife resampling procedures.

First, we are going to make a bootstrapping classifier ensemble using one of the Classify methods. Then we are going to make a more complicated bootstrapping classifier with six methods of Classify.

## Bootstrapping ensemble with a single classification method

First we select a classification method and make a classifier with it.

clMethod = "NearestNeighbors";
sCL = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> clMethod];

The following code makes a classifier ensemble of 12 classifier functions using resampled, slightly smaller (10%) versions of the original training data (with RandomChoice).

SeedRandom[1262];
aBootStrapCLs = Association@Table[(
inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
ToString[i] -> Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> clMethod]), {i, 12}];

Let us compare the ROC curves of the single classifier with the bootstrapping derived ensemble.

rocRange = Range[0.1, 0.9, 0.025];
AbsoluteTiming[
aSingleROCs = Table[(
cres = ClassifyByThreshold[sCL, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
aBootStrapROCs = Table[(
cres = EnsembleClassifyByThreshold[aBootStrapCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
]

(* {6.81521, Null} *)

Legended[
Show[{
ROCPlot[rocRange, aSingleROCs, "ROCColor" -> Blue, "PlotJoined" -> Automatic, GridLines -> Automatic],
ROCPlot[rocRange, aBootStrapROCs, "ROCColor" -> Red, "PlotJoined" -> Automatic]}],
SwatchLegend @@ Transpose@{{Blue, Row[{"Single ", clMethod, " classifier"}]}, {Red, Row[{"Boostrapping ensemble of\n", Length[aBootStrapCLs], " ", clMethod, " classifiers"}]}}]

We can see that we get much better results with the bootstrapped ensemble.

## Bootstrapping ensemble with multiple classifier methods

This code creates an classifier ensemble using the classifier methods corresponding to Automatic given as a first argument to EnsembleClassifier.

SeedRandom[2324]
AbsoluteTiming[
aBootStrapLargeCLs = Association@Table[(
inds = RandomChoice[Range[Length[trainingData]], Floor[0.9*Length[trainingData]]];
ecls = EnsembleClassifier[Automatic, trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]]];
AssociationThread[Map[# <> "-" <> ToString[i] &, Keys[ecls]] -> Values[ecls]]
), {i, 12}];
]

(* {27.7975, Null} *)

This code computes the ROC statistics with the obtained bootstrapping classifier ensemble:

AbsoluteTiming[
aBootStrapLargeROCs = Table[(
cres = EnsembleClassifyByThreshold[aBootStrapLargeCLs, testData[[All, 1 ;; -2]], "survived" -> i]; ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres]), {i, rocRange}];
]

(* {45.1995, Null} *)

Let us plot the ROC curve of the bootstrapping classifier ensemble (in blue) and the single classifier ROC curves (in gray):

aBootStrapLargeGr = ROCPlot[rocRange, aBootStrapLargeROCs, "PlotJoined" -> Automatic];
Show[Append[rocGRs /. {RGBColor[___] -> GrayLevel[0.8]} /. {Text[p_, ___] :> Null} /. ((PlotLabel -> _) :> (PlotLabel -> Null)), aBootStrapLargeGr]]

Again we can see that the bootstrapping ensemble produced better ROC points than the single classifiers.

# Damaging data

This section tries to explain why the bootstrapping with resampling to smaller sizes produces good results.

In short, the training data has outliers; if we remove small fractions of the training data we might get better results.

The procedure described in this section can be used in conjunction with the procedures described in the guide for importance of variables investigation [7].

## Ordering function

Let us replace the categorical values with numerical in the training data. There are several ways to do it, here is a fairly straightforward one:

nTrainingData = trainingData /. {"survived" -> 1, "died" -> 0, "1st" -> 0, "2nd" -> 1, "3rd" -> 2, "male" -> 0, "female" -> 1};

## Decreasing proportions of females

First, let us find all indices corresponding to records about females.

femaleInds = Flatten@Position[trainingData[[All, 3]], "female"];

The following code standardizes the training data corresponding to females, finds the mean record, computes distances from the mean record, and finally orders the female records indices according to their distances from the mean record.

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[femaleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
femaleInds = femaleInds[[Reverse@Ordering[ds]]];

The following plot shows the distances calculated above.

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"]

The following code removes from the training data the records corresponding to females according to the order computed above. The female records farthest from the mean female record are removed first.

AbsoluteTiming[
femaleFrRes = Association@
Table[cl ->
Table[(
inds = Complement[Range[Length[trainingData]], Take[femaleInds, Ceiling[fr*Length[femaleInds]]]];
cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
{fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
{fr, 0, 0.8, 0.05}],
{cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];
]

(* {203.001, Null} *)

The following graphics grid shows how the classification results are affected by the removing fractions of the female records from the training data. The results for none or small fractions of records removed are more blue.

GraphicsGrid[ArrayReshape[
Table[
femaleAROCs = femaleFrRes[cl][[All, 2]];
frRange = femaleFrRes[cl][[All, 1]]; ROCPlot[frRange, femaleAROCs, PlotRange -> {{0.0, 0.25}, {0.2, 0.8}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
{cl, Keys[femaleFrRes]}],
{2, 3}], Dividers -> All]

We can see that removing the female records outliers has dramatic effect on the results by the classifiers "NearestNeighbors" and "NeuralNetwork". Not so much on "LogisticRegression" and "NaiveBayes".

## Decreasing proportions of males

The code in this sub-section repeats the experiment described in the previous one males (instead of females).

maleInds = Flatten@Position[trainingData[[All, 3]], "male"];

t = Transpose@Map[Rescale@*Standardize, N@Transpose@nTrainingData[[maleInds, 1 ;; 2]]];
m = Mean[t];
ds = Map[EuclideanDistance[#, m] &, t];
maleInds = maleInds[[Reverse@Ordering[ds]]];

ListPlot[Sort@ds, PlotRange -> All, PlotTheme -> "Detailed"]

AbsoluteTiming[
maleFrRes = Association@
Table[cl ->
Table[(
inds = Complement[Range[Length[trainingData]], Take[maleInds, Ceiling[fr*Length[maleInds]]]];
cf = Classify[trainingData[[inds, 1 ;; -2]] -> trainingData[[inds, -1]], Method -> cl]; cfPredictedLabels = cf /@ testData[[All, 1 ;; -2]];
{fr, ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cfPredictedLabels]}),
{fr, 0, 0.8, 0.05}],
{cl, {"NearestNeighbors", "NeuralNetwork", "LogisticRegression", "RandomForest", "SupportVectorMachine", "NaiveBayes"}}];
]

(* {179.219, Null} *)

GraphicsGrid[ArrayReshape[
Table[
maleAROCs = maleFrRes[cl][[All, 2]];
frRange = maleFrRes[cl][[All, 1]]; ROCPlot[frRange, maleAROCs, PlotRange -> {{0.0, 0.35}, {0.55, 0.82}}, PlotLabel -> cl, "ROCPointColorFunction" -> (Blend[{Blue, Red}, #3/Length[frRange]] &), ImageSize -> 300],
{cl, Keys[maleFrRes]}],
{2, 3}], Dividers -> All]

# Classifier interpolation

Assume that we want a classifier that for a given representative set of items (records) assigns the positive label to an exactly of them. (Or very close to that number.)

If we have two classifiers, one returning more positive items than , the other less than , then we can use geometric computations in the ROC space in order to obtain parameters for a classifier interpolation that will bring positive items close to ; see [3]. Below is given Mathematica code with explanations of how that classifier interpolation is done.

Assume that by prior observations we know that for a given dataset of items the positive class consists of items. Assume that for a given unknown dataset of items we want of the items to be classified as positive. We can write the equation:

which can be simplified to

## The two classifiers

Consider the following two classifiers.

cf1 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "RandomForest"];
cfROC1 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf1[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 82, "FalsePositive" -> 22, "TrueNegative" -> 170, "FalseNegative" -> 40|> *)

cf2 = Classify[trainingData[[All, 1 ;; -2]] -> trainingData[[All, -1]], Method -> "LogisticRegression"];
cfROC2 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cf2[testData[[All, 1 ;; -2]]]]
(* <|"TruePositive" -> 89, "FalsePositive" -> 37, "TrueNegative" -> 155, "FalseNegative" -> 33|> *)

## Geometric computations in the ROC space

Here are the ROC space points corresponding to the two classifiers, cf1 and cf2:

p1 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC1]];
p2 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC2]];

Here is the breakdown of frequencies of the class labels:

Tally[trainingData[[All, -1]]]
%[[All, 2]]/Length[trainingData] // N

(* {{"survived", 305}, {"died", 427}}
{0.416667, 0.583333}) *)

We want to our classifier to produce % people to survive. Here we find two points of the corresponding constraint line (on which we ROC points of the desired classifiers should reside):

sol1 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.1}, {x, y}][[1]]
sol2 = Solve[{{x, y} \[Element] ImplicitRegion[{x (1 - 0.42) + y 0.42 == 0.38}, {x, y}], x == 0.25}, {x, y}][[1]]

(* {x -> 0.1, y -> 0.766667}
{x -> 0.25, y -> 0.559524} *)

Here using the points q1 and q2 of the constraint line we find the intersection point with the line connecting the ROC points of the classifiers:

{q1, q2} = {{x, y} /. sol1, {x, y} /. sol2};
sol = Solve[ {{x, y} \[Element] InfiniteLine[{q1, q2}] \[And] {x, y} \[Element] InfiniteLine[{p1, p2}]}, {x, y}];
q = {x, y} /. sol[[1]]

(* {0.149753, 0.69796} *)

Let us plot all geometric objects:

Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black,
Text["cf1", p1, {-1.5, 1}], Red, Tooltip[Point[p2], "cf2"], Black,
Text["cf2", p2, {1.5, -1}], Black, Point[q], Dashed,
InfiniteLine[{q1, q2}], Thin, InfiniteLine[{p1, p2}]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Frame -> True]

## Classifier interpolation

Next we find the ratio of the distance from the intersection point q to the cf1 ROC point and the distance between the ROC points of cf1 and cf2.

k = Norm[p1 - q]/Norm[p1 - p2]
(* 0.450169 *)

The classifier interpolation is made by a weighted random selection based on that ratio (using RandomChoice):

SeedRandom[8989]
cres = MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}];
cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
p3 = Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]];
Graphics[{PointSize[0.015], Blue, Point[p1], Red, Point[p2], Black, Dashed, InfiniteLine[{q1, q2}], Green, Point[p3]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Frame -> True]

We can run the process multiple times in order to convince ourselves that the interpolated classifier ROC point is very close to the constraint line most of the time.

p3s =
Table[(
cres =
MapThread[If, {RandomChoice[{1 - k, k} -> {True, False}, Length[testData]], cf1@testData[[All, 1 ;; -2]], cf2@testData[[All, 1 ;; -2]]}];
cfROC3 = ToROCAssociation[{"survived", "died"}, testData[[All, -1]], cres];
Through[ROCFunctions[{"FPR", "TPR"}][cfROC3]]), {1000}];

Show[{SmoothDensityHistogram[p3s, ColorFunction -> (Blend[{White, Green}, #] &), Mesh -> 3],
Graphics[{PointSize[0.015], Blue, Tooltip[Point[p1], "cf1"], Black, Text["cf1", p1, {-1.5, 1}],
Red, Tooltip[Point[p2], "cf2"], Black, Text["cf2", p2, {1.5, -1}],
Black, Dashed, InfiniteLine[{q1, q2}]}, GridLines -> Automatic]},
PlotRange -> {{0., 0.3}, {0.6, 0.8}},
GridLines -> Automatic, Axes -> True,
AspectRatio -> Automatic]

# References

[1] Leo Breiman, Statistical Modeling: The Two Cultures, (2001), Statistical Science, Vol. 16, No. 3, 199[Dash]231.

[3] Tom Fawcett, An introduction to ROC analysis, (2006), Pattern Recognition Letters, 27, 861[Dash]874. (Link to PDF.)

[4] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[5] Anton Antonov, Receiver operating characteristic functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ROCFunctions.m.

[6] Anton Antonov, Classifier ensembles functions Mathematica package, (2016), source code MathematicaForPrediction at GitHub, package ClassifierEnsembles.m.

[7] Anton Antonov, "Importance of variables investigation guide", (2016), MathematicaForPrediction at GitHub, folder Documentation.

# Making Chernoff faces for data visualization

## Introduction

This blog post describes the use of face-like diagrams to visualize multidimensional data introduced by Herman Chernoff in 1973, see [1].

The idea to use human faces in order to understand, evaluate, or easily discern (the records of) multidimensional data is very creative and inspirational. As Chernoff says in [1], the object of the idea is to "represent multivariate data, subject to strong but possibly complex relationships, in such a way that an investigator can quickly comprehend relevant information and then apply appropriate statistical analysis." It is an interesting question how useful this approach is and it seems that there at least several articles discussing that; see for example [2].

I personally find the use of Chernoff faces useful in a small number of cases, but that is probably true for many “creative” data visualization methods.

Below are given both simple and more advanced examples of constructing Chernoff faces for data records using the Mathematica package [3]. The considered data is categorized as:

1. a small number of records, each with small number of elements;
2. a large number of records, each with less elements than Chernoff face parts;
3. a list of long records, each record with much more elements than Chernoff face parts;
4. a list of nearest neighbors or recommendations.

For several of the visualizing scenarios the records of two "real life" data sets are used: Fisher Iris flower dataset [7], and "Vinho Verde" wine quality dataset [8]. For the rest of the scenarios the data is generated.

A fundamental restriction of using Chernoff faces is the necessity to properly transform the data variables into the ranges of the Chernoff face diagram parameters. Therefore, proper data transformation (standadizing and rescaling) is an inherent part of the application of Chernoff faces, and this document describes such data transformation procedures (also using [3]).

The packages [3,4] are used to produce the diagrams in this post. The following two commands load them.

Import["https://raw.githubusercontent.com/antononcube/\
MathematicaForPrediction/master/ChernoffFaces.m"]

Import["https://raw.githubusercontent.com/antononcube/\
MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]

## Making faces

### Just a face

Here is a face produced by the function ChernoffFace of [3] with its simplest signature:

SeedRandom[152092]
ChernoffFace[]

Here is a list of faces:

SeedRandom[152092]
Table[ChernoffFace[ImageSize -> Tiny], {7}]

### Proper face making

The "proper" way to call ChernoffFace is to use an association for the facial parts placement, size, rotation, and color. The options are passed to Graphics.

SeedRandom[2331];
Keys[ChernoffFace["FacePartsProperties"]] ->
RandomReal[1, Length[ChernoffFace["FacePartsProperties"]]]],
ImageSize -> Small , Background -> GrayLevel[0.85]]

The Chernoff face drawn with the function ChernoffFace can be parameterized to be asymmetric.

The parameters argument mixes (1) face parts placement, sizes, and rotation, with (2) face parts colors and (3) a parameter should it be attempted to make the face symmetric. All facial parts parameters have the range [0,1]

Here is the facial parameter list:

Keys[ChernoffFace["FacePartsProperties"]]

(* {"FaceLength", "ForheadShape", "EyesVerticalPosition", "EyeSize", \
"EyeSlant", "LeftEyebrowSlant", "LeftIris", "NoseLength", \
"MouthSmile", "LeftEyebrowTrim", "LeftEyebrowRaising", "MouthTwist", \
"MouthWidth", "RightEyebrowTrim", "RightEyebrowRaising", \
"RightEyebrowSlant", "RightIris"} *)

The order of the parameters is chosen to favor making symmetric faces when a list of random numbers is given as an argument, and to make it easier to discern the faces when multiple records are visualized. For experiments and discussion about which facial features bring better discern-ability see [2]. One of the conclusions of [2] is that eye size and eye brow slant are most decisive, followed by face size and shape.

Here are the rest of the parameters (colors and symmetricity):

Complement[Keys[ChernoffFace["Properties"]],
Keys[ChernoffFace["FacePartsProperties"]]]

(* {"EyeBallColor", "FaceColor", "IrisColor", "MakeSymmetric", \
"MouthColor", "NoseColor"} *)

### Face coloring

The following code make a row of faces by generating seven sequences of random numbers in $$[0,1]$$, each sequence with length the number of facial parameters. The face color is assigned randomly and the face color or a darker version of it is used as a nose color. If the nose color is the same as the face color the nose is going to be shown "in profile", otherwise as a filled polygon. The colors of the irises are random blend between light brown and light blue. The color of the mouth is randomly selected to be black or red.

SeedRandom[201894];
Block[{pars = Keys[ChernoffFace["FacePartsProperties"]]},
Grid[{#}] &@
Table[ChernoffFace[Join[
<|"FaceColor" -> (rc =
ColorData["BeachColors"][RandomReal[1]]),
"NoseColor" -> RandomChoice[{Identity, Darker}][rc],
"IrisColor" -> Lighter[Blend[{Brown, Blue}, RandomReal[1]]],
"MouthColor" -> RandomChoice[{Black, Red}]|>],
ImageSize -> 100], {7}]
]

### Symmetric faces

The parameter "MakeSymmetric" is by default True. Setting "MakeSymmetric" to true turns an incomplete face specification into a complete specification with the missing paired parameters filled in. In other words, the symmetricity is not enforced on the specified paired parameters, only on the ones for which specifications are missing.

The following faces are made symmetric by removing the facial parts parameters that start with "R" (for "Right") and the parameter "MouthTwist".

SeedRandom[201894];
Block[{pars = Keys[ChernoffFace["FacePartsProperties"]]},
Grid[{#}] &@Table[(
asc =
pars -> RandomReal[1, Length[pars]]],
<|"FaceColor" -> (rc =
ColorData["BeachColors"][RandomReal[1]]),
"NoseColor" -> RandomChoice[{Identity, Darker}][rc],
"IrisColor" ->
Lighter[Blend[{Brown, Blue}, RandomReal[1]]],
"MouthColor" -> RandomChoice[{Black, Red}]|>];
asc =
Pick[asc,
StringMatchQ[Keys[asc],
x : (StartOfString ~~ Except["R"] ~~ __) /;
x != "MouthTwist"]];
ChernoffFace[asc, ImageSize -> 100]), {7}]]

Note that for the irises we have two possibilities of synchronization:

pars = <|"LeftIris" -> 0.8, "IrisColor" -> Green|>;
{ChernoffFace[Join[pars, <|"RightIris" -> pars["LeftIris"]|>],
ImageSize -> 100],
ChernoffFace[
Join[pars, <|"RightIris" -> 1 - pars["LeftIris"]|>],
ImageSize -> 100]}

## Visualizing records (first round)

The conceptually straightforward application of Chernoff faces is to visualize ("give a face" to) each record in a dataset. Because the parameters of the faces have the same ranges for the different records, proper rescaling of the records have to be done first. Of course standardizing the data can be done before rescaling.

First let us generate some random data using different distributions:

SeedRandom[3424]
{dists, data} = Transpose@Table[(
rdist =
RandomChoice[{NormalDistribution[RandomReal[10],
RandomReal[10]], PoissonDistribution[RandomReal[4]],
{rdist, RandomVariate[rdist, 12]}), {10}];
data = Transpose[data];

The data is generated in such a way that each column comes from a certain probability distribution. Hence, each record can be seen as an observation of the variables corresponding to the columns.

This is how the columns of the generated data look like using DistributionChart:

DistributionChart[Transpose[data],
ChartLabels ->
Placed[MapIndexed[
Grid[List /@ {Style[#2[[1]], Bold, Red, Larger]}] &, dists], Above],
ChartElementFunction -> "PointDensity",
ChartStyle -> "SandyTerrain", ChartLegends -> dists,
BarOrigin -> Bottom, GridLines -> Automatic,
ImageSize -> 900]

At this point we can make a face for each record of the rescaled data:

faces = Map[ChernoffFace, Transpose[Rescale /@ Transpose[data]]];

and visualize the obtained faces in a grid.

Row[{Grid[
Partition[#, 4] &@Map[Append[#, ImageSize -> 100] &, faces]],
"   ", Magnify[#, 0.85] &@
GridTableForm[
List /@ Take[Keys[ChernoffFace["FacePartsProperties"]],
Dimensions[data][[2]]],
}]

(The table on the right shows which facial parts are used for which data columns.)

## Some questions to consider

Several questions and observations arise from the example in the previous section.

#### 1. What should we do if the data records have more elements than facial parts parameters of the Chernoff face diagram?

This is another fundamental restriction of Chernoff faces — the number of data columns is limiter by the number of facial features.

One way to resolve this is to select important variables (columns) of the data; another is to represent the records with a vector of statistics. The latter is shown in the section "Chernoff faces for lists of long lists".

#### 2. Are there Chernoff face parts that are easier to perceive or judge than others and provide better discern-ability for large collections of records?

Research of the pre-attentiveness and effectiveness with Chernoff faces, [2], shows that eye size and eyebrow slant are the features that provide best discern-ability. Below this is used to select some of the variable-to-face-part correspondences.

#### 3. How should we deal with outliers?

Since we cannot just remove the outliers from a record — we have to have complete records — we can simply replace the outliers with the minimum or maximum values allowed for the corresponding Chernoff face feature. (All facial features of ChernoffFace have the range $$[0,1]$$.) See the next section for an example.

## Data standardizing and rescaling

Given a full array of records, we most likely have to standardize and rescale the columns in order to use the function ChernoffFace. To help with that the package [3] provides the function VariablesRescale which has the options "StandardizingFunction" and "RescaleRangeFunction".

Consider the following example of VariableRescale invocation in which: 1. each column is centered around its median and then divided by the inter-quartile half-distance (quartile deviation), 2. followed by clipping of the outliers that are outside of the disk with radius 3 times the quartile deviation, and 3. rescaling to the unit interval.

rdata = VariablesRescale[N@data,
"StandardizingFunction" -> (Standardize[#, Median, QuartileDeviation] &),
"RescaleRangeFunction" -> ({-3, 3} QuartileDeviation[#] &)];
TableForm[rdata /. {0 -> Style[0, Bold, Red], 1 -> Style[1, Bold, Red]}]

Remark: The bottom outliers are replaced with 0 and the top outliers with 1 using Clip.

## Chernoff faces for a small number of short records

In this section we are going use the Fisher Iris flower data set [7]. By "small number of records" we mean few hundred or less.

### Getting the data

These commands get the Fisher Iris flower data set shipped with Mathematica:

irisDataSet =
Map[Flatten,
List @@@ ExampleData[{"MachineLearning", "FisherIris"}, "Data"]];
irisColumnNames =
Most@Flatten[
List @@ ExampleData[{"MachineLearning", "FisherIris"},
"VariableDescriptions"]];
Dimensions[irisDataSet]

(* {150, 5} *)

Here is a summary of the data:

Grid[{RecordsSummary[irisDataSet, irisColumnNames]},
Dividers -> All, Alignment -> Top]

### Simple variable dependency analysis

Using the function VariableDependenceGrid of the package [4] we can plot a grid of variable cross-dependencies. We can see from the last row and column that "Petal length" and "Petal width" separate setosa from versicolor and virginica with a pretty large gap.

Magnify[#, 1] &@
VariableDependenceGrid[irisDataSet, irisColumnNames,
"IgnoreCategoricalVariables" -> False]

### Chernoff faces for Iris flower records

Since we want to evaluate the usefulness of Chernoff faces for discerning data records groups or clusters, we are going to do the following steps.

1. Data transformation. This includes standardizing and rescaling and selection of colors.
2. Make a Chernoff face for each record without the label class "Species of iris".
3. Plot shuffled Chernoff faces and attempt to visually cluster them or find patterns.
4. Make a Chernoff face for each record using the label class "Specie of iris" to color the faces. (Records of the same class get faces of the same color.)
5. Compare the plots and conclusions of step 2 and 4.

#### 1. Data transformation

First we standardize and rescale the data:

chernoffData = VariablesRescale[irisDataSet[[All, 1 ;; 4]]];

These are the colors used for the different species of iris:

faceColorRules =
-> Map[Lighter[#, 0.5] &, {Purple, Blue, Green}]]

(* {"setosa" -> RGBColor[0.75, 0.5, 0.75],
"versicolor" -> RGBColor[0.5, 0.5, 1.],
"virginica" -> RGBColor[0.5, 1., 0.5]} *)

Add the colors to the data for the faces:

chernoffData = MapThread[
Append, {chernoffData, irisDataSet[[All, -1]] /. faceColorRules}];

Plot the distributions of the rescaled variables:

DistributionChart[
Transpose@chernoffData[[All, 1 ;; 4]],
GridLines -> Automatic,
ChartElementFunction -> "PointDensity",
ChartStyle -> "SandyTerrain",
ChartLegends -> irisColumnNames, ImageSize -> Large]

#### 2. Black-and-white Chernoff faces

Make a black-and-white Chernoff face for each record without using the species class:

chfacesBW =
ChernoffFace[
"LeftEyebrowSlant"} -> Most[#]],
ImageSize -> 100] & /@ chernoffData;

Since "Petal length" and "Petal width" separate the classes well for those columns we have selected the parameters "EyeSize" and "LeftEyebrowSlant" based on [2].

#### 3. Finding patterns in a collection of faces

Combine the faces into a image collage:

ImageCollage[RandomSample[chfacesBW], Background -> White]

We can see that faces with small eyes tend have middle-lowered eyebrows, and that faces with large eyes tend to have middle raised eyebrows and large noses.

#### 4. Chernoff faces colored by the species

Make a Chernoff face for each record using the colors added to the rescaled data:

chfaces =
ChernoffFace[
"LeftEyebrowSlant", "FaceColor"} -> #],
ImageSize -> 100] & /@ chernoffData;

Make an image collage with the obtained faces:

ImageCollage[chfaces, Background -> White]

#### 5. Comparison

We can see that the collage with colored faces completely explains the patterns found in the black-and-white faces: setosa have smaller petals (both length and width), and virginica have larger petals.

## Browsing a large number of records with Chernoff faces

If we have a large number of records each comprised of a relative small number of numerical values we can use Chernoff faces to browse the data by taking small subsets of records.

Here is an example using "Vinho Verde" wine quality dataset [8].

## Chernoff faces for lists of long lists

In this section we consider data that is a list of lists. Each of the lists (or rows) is fairly long and represents values of the same variable or process. If the data is a full array, then we can say that in this section we deal with transposed versions of the data in the previous sections.

Since each row is a list of many elements visualizing the rows directly with Chernoff faces would mean using a small fraction of the data. A natural approach in those situations is to summarize each row with a set of descriptive statistics and use Chernoff faces for the row summaries.

The process is fairly straightforward; the rest of the section gives concrete code steps of executing it.

### Data generation

Here we create 12 rows of 200 elements by selecting a probability distribution for each row.

SeedRandom[1425]
{dists, data} = Transpose@Table[(
rdist =
RandomChoice[{NormalDistribution[RandomReal[10],
RandomReal[10]], PoissonDistribution[RandomReal[4]],
{rdist, RandomVariate[rdist, 200]}), {12}];
Dimensions[data]

(* {12, 200} *)

We have the following 12 "records" each with 200 "fields":

DistributionChart[data,
ChartLabels ->
MapIndexed[
Row[{Style[#2[[1]], Red, Larger],
"  ", Style[#1, Larger]}] &, dists],
ChartElementFunction ->
ChartElementData["PointDensity",
"ColorScheme" -> "SouthwestColors"], BarOrigin -> Left,
GridLines -> Automatic, ImageSize -> 1000]

Here is the summary of the records:

Grid[ArrayReshape[RecordsSummary[Transpose@N@data], {3, 4}],
Dividers -> All, Alignment -> {Left}]

### Data transformation

Here we "transform" each row into a vector of descriptive statistics:

statFuncs = {Mean, StandardDeviation, Kurtosis, Median,
QuartileDeviation, PearsonChiSquareTest};
sdata = Map[Through[statFuncs[#]] &, data];
Dimensions[sdata]

(* {12, 6} *)

Next we rescale the descriptive statistics data:

sdata = VariablesRescale[sdata,
"StandardizingFunction" -> (Standardize[#, Median,
QuartileDeviation] &)];

For kurtosis we have to do special rescaling if we want to utilize the property that Gaussian processes have kurtosis 3:

sdata[[All, 3]] =
Rescale[#, {3, Max[#]}, {0.5, 1}] &@Map[Kurtosis, N[data]];

Here is the summary of the columns of the rescaled descriptive statistics array:

Grid[{RecordsSummary[sdata, ToString /@ statFuncs]},
Dividers -> All]

### Visualization

First we define a function that computes and tabulates (descriptive) statistics over a record.

Clear[TipTable]
TipTable[vec_, statFuncs_, faceParts_] :=
Block[{},
GridTableForm[
Transpose@{faceParts, statFuncs,
NumberForm[Chop[#], 2] & /@ Through[statFuncs[vec]]},
TableHeadings -> {"FacePart", "Statistic", "Value"}]] /;
Length[statFuncs] == Length[faceParts];

To visualize the descriptive statistics of the records using Chernoff faces we have to select appropriate facial features.

faceParts = {"NoseLength", "EyeSize", "EyeSlant",
"EyesVerticalPosition", "FaceLength", "MouthSmile"};
TipTable[First@sdata, statFuncs, faceParts]

One possible visualization of all records is with the following commands. Note the addition of the parameter "FaceColor" to also represent how close a standardized row is to a sample from Normal Distribution.

{odFaceColor, ndFaceColor} = {White,  ColorData[7, "ColorList"][[8]]};
Grid[ArrayReshape[Flatten@#, {4, 3}, ""], Dividers -> All,
Alignment -> {Left, Top}] &@
chFace =
ChernoffFace[
Join[asc, <|
"FaceColor" -> Blend[{odFaceColor, ndFaceColor}, #2[[-1]]],
"IrisColor" -> GrayLevel[0.8],
"NoseColor" -> ndFaceColor|>], ImageSize -> 120,
AspectRatio -> Automatic];
tt = TipTable[N@#3, Join[statFuncs, {Last@statFuncs}],
Join[faceParts, {"FaceColor"}]];
Column[{Style[#1, Red],
Grid[{{Magnify[#4, 0.8],
Tooltip[chFace, tt]}, {Magnify[tt, 0.7], SpanFromAbove}},
Alignment -> {Left, Top}]}]) &
, {Range[Length[sdata]], sdata, data, dists}]

## Visualizing similarity with nearest neighbors or recommendations

### General idea

Assume the following scenario: 1. we have a set of items (movies, flowers, etc.), 2. we have picked one item, 3. we have computed the Nearest Neighbors (NNs) of that item, and 4. we want to visualize how much of a good fit the NNs are to the picked item.

Conceptually we can translate the phrase "how good the found NNs (or recommendations) are" to:

• "how similar the NNs are to the selected item", or

• "how different the NNs are to the selected item."

If we consider the picked item as the prototype of the most normal or central item then we can use Chernoff faces to visualize item’s NNs deviations.

Remark: Note that Chernoff faces provide similarity visualization most linked to Euclidean distance that to other distances.

### Concrete example

The code in this section demonstrates how to visualize nearest neighbors by Chernoff faces variations.

First we create a nearest neighbors finding function over the Fisher Iris data set (without the species class label):

irisNNFunc =
Nearest[irisDataSet[[All, 1 ;; -2]] -> Automatic,
DistanceFunction -> EuclideanDistance]

Here are nearest neighbors of some random row from the data.

itemInd = 67;
nnInds = irisNNFunc[irisDataSet[[itemInd, 1 ;; -2]], 20];

We can visualize the distances with of the obtained NNs with the prototype:

ListPlot[Map[
EuclideanDistance[#, irisDataSet[[itemInd, 1 ;; -2]]] &,
irisDataSet[[nnInds, 1 ;; -2]]]]

Next we subtract the prototype row from the NNs data rows, we standardize, and we rescale the interval $$[ 0, 3 \sigma ]$$ to $$[ 0.5, 1 ]$$:

snns = Transpose@Map[
Clip[Rescale[
Standardize[#, 0 &, StandardDeviation], {0, 3}, {0.5, 1}], {0,
1}] &,
Transpose@
Map[# - irisDataSet[[itemInd, 1 ;; -2]] &,
irisDataSet[[nnInds, 1 ;; -2]]]];

Here is how the original NNs data row look like:

GridTableForm[
Take[irisDataSet[[nnInds]], 12], TableHeadings -> irisColumnNames]

And here is how the rescaled NNs data rows look like:

GridTableForm[Take[snns, 12],
TableHeadings -> Most[irisColumnNames]]

Next we make Chernoff faces for the rescaled rows and present them in a easier to grasp way.

We use the face parts:

Take[Keys[ChernoffFace["FacePartsProperties"]], 4]

(* {"FaceLength", "ForheadShape", "EyesVerticalPosition", "EyeSize"} *)

To make the face comparison easier, the first face is the one of the prototype, each Chernoff face is drawn within the same rectangular frame, and the NNs indices are added on top of the faces.

chfaces =
ChernoffFace[#, Frame -> True,
PlotRange -> {{-1, 1}, {-2, 1.5}}, FrameTicks -> False,
ImageSize -> 100] & /@ snns;
chfaces =
ReplacePart[#1,
1 ->
Append[#1[[1]],
Text[Style[#2, Bold, Red], {0, 1.4}]]] &, {chfaces, nnInds}];
ImageCollage[chfaces, Background -> GrayLevel[0.95]]

We can see that the first few – i.e. closest — NNs have fairly normal looking faces.

Note that using a large number of NNs would change the rescaled values and in that way the first NNs would appear more similar.

## References

[1] Herman Chernoff (1973). "The Use of Faces to Represent Points in K-Dimensional Space Graphically" (PDF). Journal of the American Statistical Association (American Statistical Association) 68 (342): 361-368. doi:10.2307/2284077. JSTOR 2284077. URL: http://lya.fciencias.unam.mx/rfuentes/faces-chernoff.pdf .

[2] Christopher J. Morris; David S. Ebert; Penny L. Rheingans, "Experimental analysis of the effectiveness of features in Chernoff faces", Proc. SPIE 3905, 28th AIPR Workshop: 3D Visualization for Data Exploration and Decision Making, (5 May 2000); doi: 10.1117/12.384865. URL: http://www.research.ibm.com/people/c/cjmorris/publications/Chernoff_990402.pdf .

[3] Anton Antonov, Chernoff Faces implementation in Mathematica, (2016), source code at MathematicaForPrediction at GitHub, package ChernofFacess.m .

[4] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[5] Anton Antonov, Variable importance determination by classifiers implementation in Mathematica,(2015), source code at MathematicaForPrediction at GitHub, package VariableImportanceByClassifiers.m.

[6] Anton Antonov, "Importance of variables investigation guide", (2016), MathematicaForPrediction at GitHub, https://github.com/antononcube/MathematicaForPrediction, folder Documentation.

[7] Wikipedia entry, Iris flower data set, https://en.wikipedia.org/wiki/Iris_flower_data _set .

[8] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. URL https://archive.ics.uci.edu/ml/datasets/Wine+Quality .

# Importance of variables investigation

## Introduction

This blog post demonstrates a procedure for variable importance investigation of mixed categorical and numerical data.

The procedure was used in a previous blog “Classification and association rules for census income data”, [1]. It is implemented in the package VariableImportanceByClassifiers.m, [2], and described below; I took it from [5].

The document “Importance of variables investigation guide”, [3], has much more extensive descriptions, explanations, and code for importance of variables investigation using classifiers, Mosaic plots, Decision trees, Association rules, and Dimension reduction.

At community.wolfram.com I published the discussion opening of “Classifier agnostic procedure for finding the importance of variables” that has an exposition that parallels this blog post, but uses a different data set (“Mushroom”). (The discussion was/is also featured in “Staff picks”; it is easier to follow the Mathematica code in it.)

## Procedure outline

Here we describe the procedure used (that is also done in [3]).

1. Split the data into training and testing datasets.

2. Build a classifier with the training set.

3. Verify using the test set that good classification results are obtained. Find the baseline accuracy.

4. If the number of variables (attributes) is k for each i, 1ik:

4.1. Shuffle the values of the i-th column of the test data and find the classification success rates.

5. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.

The variables for which the classification success rates are the worst are the most decisive.

Note that instead of using the overall baseline accuracy we can make the comparison over the accuracies for selected, more important class labels. (See the examples below.)

The procedure is classifier agnostic. With certain classifiers, Naive Bayesian classifiers and Decision trees, the importance of variables can be directly concluded from their structure obtained after training.

The procedure can be enhanced by using dimension reduction before building the classifiers. (See [3] for an outline.)

## Implementation description

The implementation of the procedure is straightforward in Mathematica — see the package VariableImportanceByClassifiers.m, [2].

The package can be imported with the command:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/VariableImportanceByClassifiers.m"]

At this point the package has only one function, AccuracyByVariableShuffling, that takes as arguments a ClassifierFunction object, a dataset, optional variable names, and the option “FScoreLabels” that allows the use of accuracies over a custom list of class labels instead of overall baseline accuracy.

Here is the function signature:

AccuracyByVariableShuffling[ clFunc_ClassifierFunction, testData_, variableNames_:Automatic, opts:OptionsPattern[] ]

The returned result is an Association structure that contains the baseline accuracy and the accuracies corresponding to the shuffled versions of the dataset. I.e. steps 3 and 4 of the procedure are performed by AccuracyByVariableShuffling. Returning the result in the form Association[___] means we can treat the result as a list with named elements similar to the list structures in Lua and R.

For the examples in the next section we also going to use the package MosaicPlot.m, [4], that can be imported with the following command:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]

## Concrete application over the “Titanic” dataset

testSetName = "Titanic"; trainingSet = ExampleData[{"MachineLearning", testSetName}, "TrainingData"]; testSet = ExampleData[{"MachineLearning", testSetName}, "TestData"];

2. Variable names and unique class labels.

varNames = Flatten[List @@ ExampleData[{"MachineLearning", testSetName}, "VariableDescriptions"]]

(*Out[1055]= {"passenger class", "passenger age", "passenger sex", "passenger survival"} *)

classLabels = Union[ExampleData[{"MachineLearning", testSetName}, "Data"][[All, -1]]]

(*Out[1056]= {"died", "survived"} *)

3. Here is a data summary.

Grid[List@RecordsSummary[(Flatten /@ (List @@@ Join[trainingSet, testSet])) /. _Missing -> 0, varNames], Dividers -> All, Alignment -> {Left, Top}]

4. Make the classifier.

clFunc = Classify[trainingSet, Method -> "RandomForest"]

(*Out[1010]= ClassifierFunction[\[Ellipsis]] *)

5. Obtain accuracies after shuffling.

accs = AccuracyByVariableShuffling[clFunc, testSet, varNames]

(*Out[1011]= <|None -> 0.78117, "passenger class" -> 0.704835, "passenger age" -> 0.768448, "passenger sex" -> 0.610687|>*)

6. Tabulate the results.

Grid[ Prepend[ List @@@ Normal[accs/First[accs]], Style[#, Bold, Blue, FontFamily -> "Times"] & /@ {"shuffled variable", "accuracy ratio"}], Alignment -> Left, Dividers -> All]

7. Further confirmation of the found variable importance can be done using the mosaic plots.
We can see that female passengers are much more likely to survive and especially female passengers from first and second class.

t = (Flatten /@ (List @@@ trainingSet)); MosaicPlot[t[[All, {1, 3, 4}]], ColorRules -> {3 -> ColorData[7, "ColorList"]} ]

5a. In order to use F-scores instead of overall accuracy the desired class labels are specified with the option “FScoreLabels”.

accs = AccuracyByVariableShuffling[clFunc, testSet, varNames, "FScoreLabels" -> classLabels]

(*Out[1101]= <| None -> {0.838346, 0.661417}, "passenger class" -> {0.79927, 0.537815}, "passenger age" -> {0.828996, 0.629032}, "passenger sex" -> {0.703499, 0.337449}|>*)

5b. Here is another example that uses the class label with the smallest F-score.
(Probably the most important since it is most mis-classified).

accs = AccuracyByVariableShuffling[clFunc, testSet, varNames, "FScoreLabels" -> Position[#, Min[#]][[1, 1, 1]] &@ ClassifierMeasurements[clFunc, testSet, "FScore"]]

(*Out[1102]= {0.661417}, "passenger class" -> {0.520325}, "passenger age" -> {0.623482}, "passenger sex" -> {0.367521}|>*)

5c. It is good idea to verify that we get the same results using different classifiers. Below is given code that computes the shuffled accuracies and returns the relative damage scores for a set of methods of Classify.

mres = Association@Map[ Function[{clMethod}, cf = Classify[titanicTrainingData, Method -> clMethod]; accRes = AccuracyByVariableShuffling[cf, testSet, varNames, "FScoreLabels" -> "survived"]; clMethod -> (accRes[None] - Rest[accRes])/accRes[None] ], {"LogisticRegression", "NearestNeighbors", "NeuralNetwork", "RandomForest", "SupportVectorMachine"}] ; Dataset[mres]`

## References

[1] Anton Antonov, “Classification and association rules for census income data”, (2014), MathematicaForPrediction at WordPress.com.

[2] Anton Antonov, Variable importance determination by classifiers implementation in Mathematica, (2015), source code at MathematicaForPrediction at GitHub, package VariableImportanceByClassifiers.m.

[3] Anton Antonov, “Importance of variables investigation guide”, (2016), MathematicaForPrediction at GitHub, folder Documentation.

[4] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

[5] Leo Breiman et al., Classification and regression trees, Chapman & Hall, 1984, ISBN-13: 978-0412048418.