Directional quantile envelopes in 3D

Introduction

This blog post was mostly made as a response to a comment of my previous blog post “Directional quantile envelopes”, [1]:

This looks extremely useful and elegant – any plans on generalizing to 3 (or more) dimensions?
–Jan Liphardt

Since I did say in the previous blog post that the algorithm can be easily extended to 3D and higher dimensions, I decided to illustrate that with an example. The 3D implementation is really easy using Mathematica’s geometric computation functions (i) for derivation of rotation matrices and (ii) for regions manipulation. (The latter are new in version 10.)

The algorithm shown below is a slightly modified version of the algorithm in “Directional quantile envelopes”. Instead of the step that does the intersection of directional quantile lines with Reduce, we can use ImplicitRegion to specify the region enclosed by the directional quantile planes. We can use other built-in functions for region predicates and manipulation to derive, visualize, or utilize the directional quantile regions found by the algorithm. (All these are exemplified below.)

Data generation

Let us first generate some data.

npoints = 10000;
data = {RandomReal[{0, 2 Pi}, npoints],
RandomReal[{-Pi, Pi}, npoints],
RandomVariate[PoissonDistribution[4.8], npoints]};
data = MapThread[#3*{Cos[#1] Cos[#2], Sin[#1] Cos[#2], Sin[#2]} &, data];

Let us plot the data

Block[{qs = 12},
qs = Map[Quantile[#, Range[0, 1, 1/(qs - 1)]] &, Transpose[data]];
ListPointPlot3D[data, PlotRange -> All, PlotTheme -> "Detailed",
FaceGrids -> {{{0, 0, -1}, Most[qs]}, {{0, 1, 0},
qs[[{1, 3}]]}, {{-1, 0, 0}, Rest[qs]}}, ImageSize -> 700]
]

CloudOfRandomPointsByPoissonDistribution3D

On the plot the grid lines are derived from the quantiles of x, y and z coordinates of the data set.

Algorithm application step-by-step

1. Standardize the data
1.1. This step is generally not needed and in this implementation would complicate things.

2. Create a set of uniform angles. (Here we generate vectors that define 3D directions.)

nqs = 20;
dirs = N@Flatten[
Table[{Cos[\[Theta]] Cos[\[Phi]], Sin[\[Theta]] Cos[\[Phi]],
Sin[\[Phi]]}, {\[Theta], 2 \[Pi]/(10 nqs), 2 \[Pi],
2 \[Pi]/nqs}, {\[Phi], -\[Pi], \[Pi], 2 \[Pi]/nqs}], 1];

Graphics3D[Map[Arrow[{{0, 0, 0}, #}] &, dirs]]

QuantileEnvelopeDirections3D

3. Rotate the data for each direction vector, a ∈ A. (Here we just generate the rotation matrices.)

In[672]:= rmats = RotationMatrix[{{1, 0, 0}, #}] & /@ dirs;
Length[rmats]

Out[673]= 420

4. Find the q-quantile of the z-coordinates of the data rotated at direction a. Denote that point with Z[a,q].

In[674]:= qs = {0.95};
qDirPoints =
Flatten[Map[Function[{m}, Quantile[(m.Transpose[data])[[3]], qs]], rmats]];

5. Find the points z[a,q] corresponding to Z[a,q] in the original coordinate system.
5.1. In this implementation this is step is redundant, see the next step.

6. Find the quantile planes intersections
6.1. Instead of doing this step we simply use ImplicitRegion.

In[676]:= qRegion = ImplicitRegion[
MapThread[(#1.{x, y, z})[[3]] <= #2 &, {rmats, qDirPoints}], {x, y, z}];

In[677]:= Shallow[qRegion]

Out[677]//Shallow= ImplicitRegion[
LessEqual[<>] && LessEqual[<>] && LessEqual[<>] &&
LessEqual[<>] && LessEqual[<>] && LessEqual[<>] &&
LessEqual[<>] && LessEqual[<>] && LessEqual[<>] &&
LessEqual[<>] && <>, {x, y, z}]

Wrapping it in a function

Here is a definition of a function that wraps all of the algorithms steps in the previous section.


Clear[QuantileEnvelopeRegion]
QuantileEnvelopeRegion[points_?MatrixQ, quantile_?NumberQ, numberOfDirections_Integer] :=
Block[{nd = numberOfDirections, dirs, rmats, qDirPoints, qRegion},
dirs =
N@Flatten[
Table[{Cos[\[Theta]] Cos[\[Phi]], Sin[\[Theta]] Cos[\[Phi]],
Sin[\[Phi]]}, {\[Theta], 2 \[Pi]/(10 nd), 2 \[Pi],
2 \[Pi]/nd}, {\[Phi], -\[Pi], \[Pi], 2 \[Pi]/nd}], 1];
rmats = RotationMatrix[{{1, 0, 0}, #}] & /@ dirs;
qDirPoints =
Flatten[Map[
Function[{m}, Quantile[(m.Transpose[points])[[3]], quantile]],
rmats]];
qRegion =
ImplicitRegion[
MapThread[(#1.{x, y, z})[[3]] <= #2 &, {rmats, qDirPoints}], {x, y,
z}];
qRegion
] /; Dimensions[points][[2]] == 3 && 0 < quantile <= 1;

Visualizing

From the implicit region specification we can generate an approximation of the boundary surface using the built-in function BoundaryDiscretizeRegion.

qRegion = QuantileEnvelopeRegion[data, 0.95, 20];
qRegion2 = QuantileEnvelopeRegion[data, 0.8, 20];

qSurface = BoundaryDiscretizeRegion[qRegion];
qSurface2 = BoundaryDiscretizeRegion[qRegion2];

Grid[{{qSurface, qSurface2}}]
QuantileRegionSurfacesAt95and80

Now we can visualize the quantile surface together with the data points:

Block[{c = 3, opts, pntsgr},
opts = {PlotRange -> {{-c, c}, {0, c}, {-c, c}}, Boxed -> True, Axes -> True,
ImageSize -> {Automatic, 500}};
pntsgr = Graphics3D[Point[data]];
Grid[{{Show[{qSurface, pntsgr}, opts], Show[{qSurface2, pntsgr}, opts]}}]
]

DataPointsAndHalfQuantileRegionSurfacesAt95and80

Check

Now we can answer the question how many data points are inside the quantile directions regions. Again, this is easy with the built-in region functionalities of Mathematica version 10.

In[656]:= inPoints = Select[data, # \[Element] qRegion &];
Length[inPoints]/Length[data] // N

Out[657]= 0.5974

In[700]:= inPoints = Select[data, # \[Element] qRegion2 &];
Length[inPoints]/Length[data] // N

Out[701]= 0.1705

(In these code lines I kept “[\Element]” instead of replacing it with “∈” in order the lines to be copy and pasted. )
As we can see the algorithm would produce a region Subscript[R, q] which contains much a smaller number of points than the requested quantile, q. This property of the algorithm is discussed in [2]. (In the two dimensional space this property is less pronounced.)

Obviously, though, we would expect the bias of the algorithm to be monotonic with respect to the quantiles requested. Given a set of points P and quantiles u and v, 0<u<v<=1, for the respectively produced by the algorithm regions Ru and Rv we have |P ∩ Ru| < |P ∩ Rv| .

References

[1] Anton Antonov, Directional quantile envelopes, “Mathematica for prediction algorithms” blog at WordPress.com, 11/3/2014 .

[2] Linglong Kong, Ivan Mizera, “Quantile tomography: using quantiles with multivariate data”, 2013, arXiv:0805.0056v2 [stat.ME] URL: http://arxiv.org/abs/0805.0056 .

Directional quantile envelopes

Introduction

In this blog post I am going to discuss the computation of the so called directional quantile envelopes; see [5] for definitions, theorems, and concrete examples. The Mathematica package QuantileRegression.m, [1] has an experimental implementation, the function QuantileEnvelope.

This type of implementation investigation is a natural extension of the quantile regression implementations and utilization I have done before. (See the package [1] and the related guide [2]; blog posts [3,4].) I was looking for a 2D quantile regression technique that can be used as a tool for finding 2D outliers. From the exposition below it should be obvious that the technique can be generalized in higher dimensions.

The idea of directional quantile envelopes is conceptually simple (and straightforward to come up with). The calculation is also relatively simple: over a set of uniformly distributed directions we find the lines that separate the data according a quantile parameter q, 0<q<1, and with those lines we approximate the enveloping curve for data that corresponds to q.

The algorithm

Here is the full algorithm. (The figure below can be used as a visual aid).

Parameters: data as a matrix of two columns, a quantile parameter q, 0<q<1.
1. Standardize the data.
1.1. Choose a center of the data (e.g. means of the coordinates) and a scale (e.g. inter-quartile distances).
1.2. Transpose the data to the center and scale the obtained coordinates.
2. Create a set if uniform angles, A:={i*2*Pi/k}, 0<=i<=k-1 .
2.1. Denote with a[i]:=i*2*Pi/k, i∈[0,k-1].
3. Rotate the data for each angle, a[i]∈A.
4. Find the q-quantile of the y-coordinates of the data rotated at angle a. Denote that point with r[a,q].
5. Find the points y[a,q] corresponding to r[a,q] in the original coordinate system.
6. Find the quantile lines intersections
6.1. For each angle a∈A we denote with n[a] the vector corresponding to a.
6.2. For each a∈A we have a line l[a,q] with normal n[a] and passing through y[a,q].
6.3. Find the intersection points P[q] of the lines l[a,q].
6.3.1. Each point p[j]∈P[q], j∈[1,k] is an intersection of lines at consecutive angles.
6.3.2. In other words p[j]=l[a[j-1],q] ∩ l[a[j],q], j∈[1,k].
7. Apply the reverse of the transformation done at step 3 to P[q].
8. Connecting the consecutive points of P[q] gives the directional quantile envelope approximation.

Click this thumbnail for an image with better math notation:
TheDirectionalQuantileEnvelopeAlgorithm

Algorithm visual aid

Here is an example of an application of the described algorithm. The points in blue are the data points, the points in red are the points y[a,q], the red lines are the lines l[a,q], the points in black are the points P.
DirectionalQuantileEnvelopeAlgorithmAid

Remarks

Remark 1: The function QuantileEnvelope of [1] follows the described algorithm. The rotations are done using matrix multiplication; the function Reduce is used to find the intersections between the lines.

Remark 2: A possible alternative algorithm is to use “the usual” quantile regression with polar coordinates. (The article [5] refers to and briefly discusses such an approach.)

Example data

In order to demonstrate the function QuantileEnvelope let us define a couple of data sets.

The first data set is generated with the function SkewNormalDistribution, so the data set name is sndData.

npoints = 12000;
data = {RandomReal[{0, 2 \[Pi]}, npoints],
RandomVariate[SkewNormalDistribution[0.6, 0.3, 4], npoints]};
data = MapThread[#2*{Cos[#1], Sin[#1]} &, data];
rmat = RotationMatrix[-\[Pi]/2.5].DiagonalMatrix[{2, 1}];
data = Transpose[rmat.Transpose[data]];
data = TranslationTransform[{-Norm[Abs[#]/3], 0}][#] & /@ data;
sndData = Standardize[data];

Here is a plot of sndData with grid lines derived from the quantiles of x and y coordinates of the data set.

ListPlot[sndData, PlotRange -> All, AspectRatio -> 1, PlotTheme -> "Detailed",
GridLines -> Map[Quantile[#, Range[0, 1, 1/19]] &, Transpose[sndData]],
ImageSize -> 700]

SkewNormalDistributionRingData

The second uses the function PoissonDistribution, so the data set name is pdData.

npoints = 12000;
data = RandomReal[NormalDistribution[12, 5.6], npoints];
data = Transpose[{data,
data + RandomVariate[PoissonDistribution[12], npoints] +
RandomReal[{-0.2, 0.2}, npoints]}];
data = Select[data, #[[2]] > 0 &];
data[[All, 2]] = Log[data[[All, 2]]];
pdData = data;

Here is a plot of pdData with grid lines derived from the quantiles of x and y coordinates of the data set.

ListPlot[pdData, PlotRange -> All, AspectRatio -> 1, PlotTheme -> "Detailed",
GridLines -> Map[Quantile[#, Range[0, 1, 1/19]] &, Transpose[pdData]],
ImageSize -> 700]

NormalVsLogPoissonDistributionData

Examples of application (with QuantileEnvelope)

Let us demonstrate the function QuantileEnvelope. The function arguments are a data matrix, a quantile specification, and number of directions. The quantile specification can a number or a list of numbers. The function returns the intersection points of the quantile lines at different directions.

Here is an application to the dataset sndData over 8 different quantiles over 60 different uniformly spread directions.

In[630]:= qs = {0.7, 0.75, 0.8, 0.85, 0.90, 0.95, 0.98, .99} ;
AbsoluteTiming[qsPoints = QuantileEnvelope[sndData, qs, 60];]
Out[631]= {0.150104, Null}

The result is a list of 8 lists of 60 points.

In[632]:= Dimensions[qsPoints]
Out[632]= {8, 60, 2}

Let us plot the data and the direction quantile envelopes.

Block[{data = sndData},
Show[{ListPlot[data, AspectRatio -> 1, PlotTheme -> {"Detailed"},
GridLines ->
Map[Quantile[#, Range[0, 1, 1/(20 - 1)]] &, Transpose[data]],
PlotLegends ->
SwatchLegend[
Blend[{Red, Orange},
Rescale[#1, {Min[qs], Max[qs]}, {0, 1}]] & /@ qs, qs]],
Graphics[{PointSize[0.005], Thickness[0.0025],
MapThread[{Blend[{Red, Orange},
Rescale[#1, {Min[qs], Max[qs]}, {0, 1}]],
Tooltip[Line[Append[#2, #2[[1]]]], #1], Point[#2]} &, {qs,
qsPoints}]}]
}, ImageSize -> 700]]

SkewNormalDistributionRingDataWithDirectionalQuantileEnvelopesLegended

Let us similar calculations for the dataset pdData.

In[647]:= qs = {0.7, 0.75, 0.8, 0.85, 0.90, 0.95, 0.98, 0.99} ;
AbsoluteTiming[qsPoints = QuantileEnvelope[pdData, qs, 60];]
Out[648]= {0.154227, Null}

In[649]:= Dimensions[qsPoints]
Out[649]= {8, 60, 2}

Here is the plot of the dataset pdData and directional quantile envelopes.
NormalVsLogPoissonDistributionDataWithDirectionalQuantileEnvelopesLegended

Some more quantile curves

Here are couple of more plots of the quantile lines at different directions with a larger number of directions.

Lines for quantile 0.95 in 40 directions:
40DirectionalQuantileLinesFor95

Lines for quantiles 0.95 and 0.8 in 40 directions:
40DirectionalQuantileLinesFor95And80

References

[1] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[2] Anton Antonov, Quantile regression through linear programming, usage guide at GitHub, https://github.com/antononcube/MathematicaForPrediction, in the directory “Documentation”, (2013).

[3] Anton Antonov, Quantile regression through linear programming, “Mathematica for prediction algorithms” blog at WordPress.com, 12/16/2013.

[4] Anton Antonov, Quantile regression with B-splines, “Mathematica for prediction algorithms” blog at WordPress.com, 1/1/2014.

[5] Linglong Kong, Ivan Mizera, “Quantile tomography: using quantiles with multivariate data”, 2013, arXiv:0805.0056v2 [stat.ME] URL: http://arxiv.org/abs/0805.0056 .

Removal of sub-trees in tries

I recently updated the package TriesWithFrequencies.m with two new functions TrieRemoval and TrieNodeFrequencies. The function TrieNodeFrequencies reverses the outcomes of TrieNodeProbabilities; it was made to facilitate the implementation and usage of TrieRemoval.

The removal of a sub-tree from a trie is useful when we want to alternate the learning stored in the trie.

The new functions are illustrated with examples below. (These examples reuse and extend the ones in the blog post “Tries with frequencies for data mining”.)
Consider the trie made with the list of words

{"war", "ward", "warden", "work", "car", "cars", "carbs"}

Trie

 

The function TrieNodeFrequencies can be used to reverse the effect of TrieNodeProbabilities. Its first argument is a trie, the second one is a scaling parameter (which is 1 by default). Here is an example:

TrieForm[TrieNodeFrequencies[TrieNodeProbabilities[mytrie], 14]]
TrieNodeFrequenciesTrieNodeProbabilities14

The function TrieRemove can be used to remove sub-parts of a trie that correspond to a specified “word”. Here is an example using a trie with frequencies:

TrieForm[TrieRemove[mytrie, "car"]]

TrieRemoveOfcar

Here is an example using a trie with probabilities:

TrieForm[TrieRemove[TrieNodeProbabilities[mytrie], "car"]]

TrieRemoveOfTrieNodeProbabilitiesOfcar

TrieRemove has two options “HowManyTimes” and “FrequencyValues” with default values Automatic. The option “HowManyTimes” takes a number or Automatic and it is still under development (its functionality is partially correct). The option “FrequencyValues” takes True | False | Automatic and it is used for the interpretation of the sub-tree removal. If the values are frequencies the removal process is more straightforward. If the values are probabilities the removal process has to take into account that the values on the same level of the sub-tree are conditional probabilities that sum to 1. TrieRemove has two different implementations one for tries with frequencies and one for tries with probabilities. The option setting "FrequencyValues"->True|False is used to redirect to the correct implementation. When "FrequencyValues"->Automatic is given a simple heuristic is used to determine is the trie argument a trie with frequencies or a trie with probabilities.

Find Fit for Non-linear data

Three weeks ago I replied to a question in the Wolfram Community site about finding a function fit to points of non-linear data.

Here is the data:

Non-linear data

I proposed a different way of doing the non-linear fitting using Quantile regression with B-splines. The advantages of the approach are that it does not need guessing of the fit functions and it requires less experimentation.

Load the package QuantileRegression.m:

Needs["QuantileRegression`"]

After several experiments with the number of knots we can find a regression quantile using 3d order B-splines of 18 knots that approximates the data well.

qfunc = Simplify[QuantileRegression[data, 18, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-18RegularKnots

Here is the relative error estimate:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]

RelativeErrorNonLinearFitWithBSplineQuantileRegression-18RegularKnots

I would like to point out that if we put large number of knots we can get aliasing errors effect:

qfunc = Simplify[
QuantileRegression[data, 40, {0.5}, InterpolationOrder -> 2][[1]]];

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700];
Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-40RegularKnots

If we select a non-uniform grid of knots according to gradients of the data we get a good approximation with one attempt.


In[107]:= knots =
Join[
Rescale[Range[0, 1, 0.2], {0, 1}, {Min[data[[All, 1]]],
Max[data[[All, 1]]] 3/4}],
Rescale[Range[0, 1, 0.1], {0, 1}, {Max[data[[All, 1]]] 3/4,
Max[data[[All, 1]]]}]
]

Out[107]= {0., 3.75, 7.5, 11.25, 15., 18.75, 18.75, 19.375, 20., 20.625, 21.25, 21.875, 22.5, 23.125, 23.75, 24.375, 25.}

In[108]:= qfunc = QuantileRegression[data, knots, {0.5}][[1]];

The grid lines in the plot below are made with the knots.

qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}];
Show[{dGr, qfGr}, GridLines -> {knots, None},
GridLinesStyle -> Directive[GrayLevel[0.8], Dashed], Frame -> True,
ImageSize -> 700]

NonLinearFitWithBSplineQuantileRegression-17NonRegularKnots

Here are the relative errors:

ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data],
Filling -> Axis, Frame -> True, ImageSize -> 700]

RelativeErrorNonLinearFitWithBSplineQuantileRegression-17NonRegularKnots

Here is the approximation function after using PiecewiseExpand and Simplify:

qfunc = Evaluate[Simplify[PiecewiseExpand[qfunc[[1]]]]] &;
qfunc

NonLinearFitWithBSplineQuantileRegression-17NonRegularKnotsFunction

Follow-up

If it is desired to make the fitting with a certain, known model function then the function QuantileRegressionFit can be used. QuantileRegressionFit is provided by the same package, QuantileRegression.m, used in the blog post.

This older blog post of mine discusses the use of QuantileRegressionFit and points to a PDF with further details (theory and Mathematica commands).

Classification and association rules for census income data

Introduction

In this blog post I am going to show (some) analysis of census income data — the so called “Adult” data set, [1] — using three types of algorithms: decision tree classification, naive Bayesian classification, and association rules learning. Mathematica packages for all three algorithms can be found at the project MathematicaForPrediction hosted at GitHub, [2,3,4].

(The census income data set is also used in the description of the R package “arules”, [7].)

In the census data every record represents a person with 14 attributes, the last element of a record is one of the labels {“>=50K”,”<50K"}. The relationships between the categorical variables in that data set was described in my previous blog post, "Mosaic plots for data visualization".

For this data the questions I am most interested in are:
Question 1: Which of the variables (age, occupation, sex, etc.) are most decisive for determining the income of a person?
Question 2: Which values for which variables form conditions that would imply high income or low income? (I.e. ">50K" or "<=50K".)
Question 3: What conclusions or confirmations we can get from answering the previous two questions?

One way to answer Question 1 is to use following steps, [8].
1. Build a classifier with the training set.
2. Verify using the test set that good classification results are obtained.
3. If the number of variables (attributes) is k for each i, 1<=i<=k :
3.1. Shuffle the values of the i-th column of the test data and find the classification success rates.
4. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.
5. The variables for which the classification success rates are the worst are the most decisive.

Following these steps with a decision tree classifier, [2], I found that “marital-status” and “education-num” (years of education) are most decisive to give good prediction for the “>50K” label. Using a naive Bayesian classifier, [3], the most significant variables are “marital-status” and “relationship”. (More details are given in the sections “Application of decision trees” and “Application of naive Bayesian classifier”.)

One way to answer Question 2 is to find which values of the variables (e.g. “Wife”, “Peru”, “HS-grad”, “Exec-managerial”) associate most frequently with “>50K” and “<=50K" respectively and apply different Bayesian probability statistics on them. This is what the application of Associative rules learning gives, [9]. Another way is to use mosaic plots, [5,9], and prefix trees (also known as "tries") [6,11,12].

In order to apply Association rule learning we need to make the numerical variables categorical — we need to partition them into non-overlapping intervals. (This derived, "all categorical" data is also amenable to be input data for mosaic plots and prefix trees.)

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income, [1].

(This data set was also used in my previous blog post “Mosaic plots for data visualization”.)

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is a summary of the data:
Adult census income data summary

As it was mentioned in the introduction, only 24% of the labels are “>50K”. Also note that 2/3 of the records are for males.

Scatter plots and mosaic plots

Often scatter plots and mosaic plots can give a good idea of the general patterns that hold in the data. This sub-section has a couple of examples, but presenting extensive plots is beyond the scope of this blog post. Let me point out that it might be very beneficial to use these kind of plots with Mathematica‘s dynamic features (like Manipulate and Tooltip), or make a grid of mosaic plots.

Mosaic plots of the categorical variables of the data can be seen in my previous blog post “Mosaic plots for data visualization”.

Here is a table of the histograms for “age”, “education-num”, and “hours-per-week”:
adult-data-scatter-plots-age-education-num-hours-per-week

Here is a table with scatter plots for all numerical variables of the data:
adult-data-scatter-plots-age-education-num-capital-gain-capital-loss-hours-per-week

Application of decision trees

The building and classification with decision trees is straightforward. Since the label “>50K” is only a quarter of the records I consider the classification success rates for “>50K” to be more important.

adult-data-Decision-tree-classification-success-rates

I experimented with several sets of parameters for decision tree building. I did not get a classification success rate for “>50K” better than 0.644 . Using pruning based on the Minimal Description Length (MDL) principle did not give better results. (I have to say I find MDL pruning to be an elegant idea, but I am not convinced that it works that
well. I believe decision tree pruning based on test data would produce much better results. Only the MDL decision tree pruning is implemented in [2].)

The overall classification success rate is in line with the classification success ratios listed in explanation of the data set; see the file “adult.names” in [1].

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-Decision-tree-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-Decision-tree-classification-shuffled-success-rates-plot

We can see from the table and the plot that variables “marital-status”, “education-num”, “capital-gain”, “age”, and “occupation” are very decisive when it comes to determining high income. The variable “marital-status” is significantly more decisive than the others.

While considering the decisiveness of the variable “marital-status” we can bring the following questions:
1. Do people find higher paying jobs after they get married?
2. Are people with high paying jobs more likely to marry and stay married?

Both questions are probably answered with “Yes” and probably that is why “marital-status” is so decisive. It is hard to give quantified answers to these questions just using decision trees on this data — we would need to know the salary and marital status history of the individuals (different data) or to be able to imply it (different algorithm).

We can see the decisiveness of “age”, “education-num”, “occupation”, and “hours-per-week” as natural. Of course one is expected to receive a higher pay if he has studied longer, has a high paying occupation, is older (more experienced), and works more hours per week. Note that this statement explicitly states the direction of the correlation: we do assume that longer years of study bring higher pay. It is certainly a good idea to consider the alternative direction of the correlation, that people first get high paying jobs and that these high paying jobs allow them to get older and study longer.

Application of naive Bayesian classifiers

The naive Bayesian classifier, [3], produced better classification results than the decision trees for the label “>50K”:
adult-data-NBC-classification-success-rates

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-NBC-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-NBC-classification-shuffled-success-rates-plot

In comparison with the decision tree importance of variables experiments we can notice that:
1. “marital-status” is very decisive and it is the second most decisive variable;
2. the most decisive variable is “relationship” but it correlates with “marital-status”;
3. “age”, “occupation”, “hours-per-week”, “capital-gain”, and “sex” are decisive.

Shuffled classification rates plots comparison

Here are the two shuffled classification rates plots stacked together for easier comparison:
adult-data-Decision-tree-and-NBC-classification-shuffled-success-rates-plots

Data modification

In order to apply the association rules finding algorithm Apriori, [4], the data set have to be modified. The modification is to change the numerical variables “age”, “education-num”, and “age” into categorical. I just partitioned them into non-overlapping intervals, labeled the intervals, and assigned the labels according the variable values. Here is the summary of the modified data for just these variables:
adault-data-numerical-to-categorical-columns-summary

Finding association rules

Using the modified data I found a large number of association rules with the Apriori algorithm, [4]. I used the measure called “confidence” to extract the most significant rules. The confidence of an association rule AC with antecedent A and consequent C is defined to be the ratio P(AC)/P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, CA.)

Here is a table showing the rules with highest confidence for the consequent being “>50K”:
adult-data-association-rules-more-than-50K

From the table we can see for example that 2.1% of the data records (or 693 records) show that for a married man who has studied 14 years and originally from USA there is a 0.79 probability that he earns more than $50000.

Here is a table showing the rules with highest confidence for the consequent being “<=50K":
adult-data-association-rules-less-than-50K

The association rules in these tables confirm the findings with the classifiers: marital status, age, and education are good predictors of income labels “>50K” and “<=50K".

Conclusion

The analysis confirmed (and quantified) what is considered common sense:

Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).

Using the association rules we see for example that
(1) if a person earns more than $50000 he is very likely to be a married man with large number of years of education;
(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than $50000.

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Census Income Data Set, URL: http://archive.ics.uci.edu/ml/datasets/Census+Income .

[2] Antonov, A., Decision tree and random forest implementations in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AVCDecisionTreeForest.m, (2013).

[3] Antonov, A., Implementation of naive Bayesian classifier generation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package NaiveBayesianClassifier.m, (2013).

[4] Antonov, A., Implementation of the Apriori algorithm in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AprioriAlgorithm.m, (2013).

[5] Antonov, A., Mosaic plot for data visualization implementation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package MosaicPlot.m, (2014).

[6] Antonov, A., Tries with frequencies Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package TriesWithFrequencies.m, (2013).

[7] Hahsler, M. et al., Introduction to arules \[Dash] A computational environment for mining association rules and frequent item sets, (2012).

[8] Breiman, L. et al., Classification and regression trees, Chapman & Hall, 1984.

[9] Wikipedia, Association rules learning, http://en.wikipedia.org/wiki/Association_rule_learning .

[10] Antonov, A., Mosaic plots for data visualization, (March, 2014), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Mosaic%20plots%20for%20data%20visualization.pdf .

[11] Wikipedia, Trie, http://en.wikipedia.org/wiki/Trie .

[12] Antonov, A., Tries, (December, 2013), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Tries.pdf .

Enhancements of MosaicPlot

I made the following enhancements of the function MosaicPlot which I described (and proclaimed the implementation of) in my previous blog post:

1. Tooltips with precise contingency statistics.
2. If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics.
3. Coloring of the rectangles according to a list of index->color rules.

The document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub, combines the information of this blog post and previous one. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.

Tooltips with precise contingency statistics

I already proclaimed in my previous blog post the tooltips functionality — when hovering with the mouse over the rectangles then MosaicPlot, using Tooltip, gives a table with the exact co-occurrence (contingency) values. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Visualizing categorical columns + a numerical column

If the last data column is numerical then MosaicPlot can use it as pre-computed contingency statistics. This functionality is specified with the option “ExpandLastColumn”->True.

In order to explain the functionality we are going to use following interpretation. If the last of column of the data is numerical then we can treat the data as a contracted version of a longer list of records made only of the categorical columns. For example, consider the following table with observations of people’s hair and eyes color:
Hair and eyes color number of observations

The table above can be considered as a contracted version of this table:
Hair and eyes color observations

Setting the option “ExpandLastColumn” to True gives a mosaic plot corresponding to that latter, observations-expanded table:
Hair and eyes color mosaic plot

The last data column (which is numerical) does not need to be made of integers:
Hair and eyes color mosaic plot Mathematica code

Rectangle coloring

The rectangles can be colored using the option ColorRules which specifies how the colors of the rectangles are determined from the indices of the data columns.

More precisely, the values of the option ColorRules should be a list of rules, {i1->c1,i2->c2,…}, matching the form

{(_Integer->(_RGBColor|_GrayLevel))..} .

If coloring for only one column index is specified the value of ColorRules can be of the form

{_Integer->{(_RGBColor|_GrayLevel)..}} .

The colors are used with Blend in order to color the rectangles according to the order of the unique values of the specified data columns.

The default value for ColorRules is Automatic. When Automatic is given to ColorRules, MosaicPlot finds the data column with the largest number of unique values and colors them according to their order using ColorData[7,"ColorList"].

The grid of plots below shows mosaic plots of the same data with different values for the option ColorRules (given as plot labels).
Grid of mosaic plots for ColorRules values

Mosaic plots for data visualization

Introduction

This blog post has description and examples of using the function MosaicPlot of the Mathematica package MosaicPlot.m provided by the project MathematicaForPrediction at GitHub. (Also see the document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub. The document also has Mathematica code examples of usage and description of MosaicPlot‘s options.)

The function MosaicPlot summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values. (Note, that if a column is numerical but has a small number of different values it can be seen as categorical.)

I have read the descriptions of mosaic plots in the book “R in Action” by Robert Kabakoff and one of the references provided in the book (“What is a mosaic plot?” by Steve Simon). I was impressed how informative mosaic plots are and I figured they can be relatively easily implemented using Prefix trees (also known as “Tries”). I implemented MosaicPlot while working on a document analyzing the census income data from 1998, [6]. This is the reason that data set is used in this blog post. A good alternative set provided by ExampleData is {“Statistics”,”USCars1993″}.

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income.

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is the summary table of the data:
Adult census income data summary

On the summary table the numerical variables are described with min, max, and quartiles. The category variables are described with the tallies of their values. The tallies of values are ordered in decreasing order. The tallies of truncated values are summed under the value “(Other)”.

Note that:
— only 24% of the labels are “>50K”;
— 2/3 of the records are for males;
— “capital-gain” and “capital-loss” are very skewed.

Mosaic plot explanations

If we pick a categorical variable, say “sex”, we can visualize the frequencies of the appearance of the variable values with the following plot:
Adult census income data sex mosaic plot

The size of the rectangles depends on the frequencies of appearance of the values “Male” and “Female” in the data records. From the rectangle sizes we can see what we already knew from the data summary table: approximately 2/3 of the records are about males.

We can subdivide every rectangle r according to the frequencies of co-occurrence of r’s value with the values of a second categorical variable, say “relationship”:
Adult census income data sex-relationship mosaic plot

The labels corresponding to the values of “relationship” are rotated for legibility. The “relationship” labels are placed according to the co-occurrence with the value “Male” of the variable “sex”. The correspondent fractions of the pairs (“Female”,”Husband”), (“Female”,”Not-in-family”), etc., are deduced from order of the “relationship” labels.

Using colored mosaic plots can help distinguishing which rectangles correspond to which values. Here is the last plot with rectangles colored across the “relationship” data variable:
Adult census income data sex-relationship colored mosaic plot

From the visual representations of the “sex vs. relationship” mosaic plot we can see that large fraction of the males are husbands, none (or a very small fraction) of them are wives. We can also see that none (or a very small fraction) of the females are husbands, the largest fraction of them are “Not-in-family”, and they are approximately three times more than the females that are wives.

Let us make another mosaic plot of a different kind of relationship “sex vs. education”:
Adult census income data sex-education colored mosaic plot

By comparing the sizes of the rectangles corresponding to the values “Bachelors”, “Doctorate”, “Masters”, and “Some-college” on the “sex vs. education” mosaic plot we can see that the fraction of men that have finished college is larger than the fraction of women that have finished college.

We can further subdivide the rectangles according to the co-occurrence frequencies with a third categorical variable. We are going to choose that third variable to be “income”, the values of which can be seen as outcomes or consequents of the values of the first two variables of the mosaic plot.
Adult census income data sex-education-income colored mosaic plot

From the mosaic plot “sex vs. education vs. income” we can make the following observations.
1. Approximately 75% of the males with doctorate degrees or with a professional school degree earn more than $50000 per year.
2. Approximately 60% of the females with a doctorate degree earn more than $50000 per year.
3. Approximately 45% of the females with a professional school degree earn more than $50000.
4. Across all education type females are (much) less likely to earn more than $50000 per year.

Although I mentioned earlier that the “outcome” variable should be the last variable in the mosaic plot, it is also useful to start with the outcome variable to get an attribute breakdown perspective (using a different color scheme):
Adult census income data income-relationship-sex colored mosaic plot

Signature of MosaicPlot

MosaicPlot takes various options for tweaking the labels placement and style. Here is the Mathematica command:

MosaicPlot[censusData[[All, {9, 3, 5, 14}]], "Gap" -> 0.014,
"ColumnNamesOffset" -> 0.07,
"ColumnNames" ->
Map[Style[#, Blue, FontSize -> 15] &, columnNames[[{9, 3, 5, 14}]]],
"LabelRotation" -> {{3, 1}, {1, 1}}, ImageSize -> 900]

with which the following mosaic plot was made:
Adult census income data sex-education-maritalStatus-income mosaic plot colored

The option “Gap” used to regulate the gaps between the rectangle. The options “ColumnNames” and “ColumnNamesOffset” are for the specification of the variable names (in blue in the plot). The option “LabelRotation” specifies the rotation of the labels that correspond to the individual values of the variables. Also, MosaicPlot takes all the options of Graphics (since it is based on it).

Tooltip tables

The function MosaicPlot has an interactive feature using Tooltip that gives a table with the exact co-occurrence (contingency) values when hovering with the mouse over the rectangles. Here is an example:
Adult census income data sex-education-income colored mosaic plot with tooltips

Future plans

The current implementation of MosaicPlot uses coloring of the rectangles for easier plot reading. An alternative is to use coloring based on correlations statistics. I think though that the tooltip contingency tables with flexible coloring specification make the correlation coloring less needed.