Contingency tables creation examples

Introduction

In statistics contingency tables are matrices used to show the co-occurrence of variable values of multi-dimensional data. They are fundamental in many types of research. This document shows how to use several functions implemented in Mathematica for the construction of contingency tables.

Code

In this document we are going to use the implementations in the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub, [1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]

The implementation of CrossTabulate in [1] is based on Tally, GatherBy, and SparseArray. The implementation of xtabsViaRLink in [1] is based on R‘s function xtabs called via RLink.

Other package used in this document are [2] and [4] imported with these commands:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"]

For a different approach to implementing cross-tabulation than those taken in [1] see the Stack Overflow answer http://stackoverflow.com/a/8101951 by Mr.Wizard.

Using Titanic data

Getting data

titanicData = 
  Flatten@*List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"];
titanicData = DeleteCases[titanicData, {___, _Missing, ___}];

titanicColumnNames = 
  Flatten@*List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"];
aTitanicColumnNames = 
  AssociationThread[titanicColumnNames -> Range[Length[titanicColumnNames]]];

Note that we have removed the records with missing data (for simpler exposition).

Data summary

Dimensions[titanicData]
(* {1046, 4} *)

RecordsSummary[titanicData, titanicColumnNames]

"titanic-summary"

Using CrossTabulate

Assume that we want to group the people according to their passenger class and survival and we want to find the average age for each group.

We can do that by

1. finding the counts contingency table C for the variables "passenger class" and "passenger survival",

2. finding the age contingency table A for the same variables, and

3. do the element-wise division \frac{A}{C}.

ctCounts = 
  CrossTabulate[
   titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts

"ctCounts-matrix-form"

ctTotalAge = 
  CrossTabulate[
   titanicData[[All, 
    aTitanicColumnNames /@ {"passenger class", "passenger survival", 
      "passenger age"}]]];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctTotalAge

"ctTotalAge-matrix-form"

MatrixForm[
 ctTotalAge[[1]]/
  Normal[ctCounts[[1]]], 
 TableHeadings -> Values[Rest[ctTotalAge]]]

"mean-matrix-form"

(We have to make the sparse array ctCounts a regular array because otherwise we get warnings for division by zero because of the sparse array’s default value.)

Let us repeat the steps above by dividing the passengers before-hand according to their sex.

Association@
 Map[
  (mCount = 
     CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]; 
   mAge = CrossTabulate[#[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival", "passenger age"}]]];
   #[[1,  aTitanicColumnNames["passenger sex"]]] -> 
     MatrixForm[mAge[[1]]/Normal[mCount[[1]]], TableHeadings -> Values[Rest[mAge]]]) &, 
  GatherBy[titanicData, #[[aTitanicColumnNames["passenger sex"]]] &]]

"sex-cross-tables"

The alternative of CrossTabulate is xtabsViaRLink that is uses R’s function xtabs via RLink.

Needs["RLink`"]
RLinkResourcesInstall[]
InstallR[]

(* {Paclet[RLinkRuntime,9.0.0.0, <>]} *)

ctCounts = 
  FromRXTabsForm@
   xtabsViaRLink[
    titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]],
    {"passenger.sex", "passenger.survival"},
    " ~ passenger.sex + passenger.survival"];
MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctCounts

"xtabs-output"

Relation to mosaic plots

The graphical visualization of a dataset with mosaic plots, [2,3], is similar in spirit to contingency tables. Compare the following mosaic plot with the contingency table in the last section.

MosaicPlot[
 titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]] ]

"titanic-class-survival-mosaic-plot"

Straightforward calling of MatrixForm

At this point we might want to be able to call MatrixForm more directly for the output of CrossTabulate and FromRXTabsForm. One way to do this is to define an up-value for Association .

Unprotect[Association];
MatrixForm[x_Association /; (KeyExistsQ[x, "XTABMatrix"] || KeyExistsQ[x, "XTABTensor"])] ^:= (MatrixForm[#1, TableHeadings -> Rest[{##}]] & @@ x);
Protect[Association];

Now we can do this:

MatrixForm @
 CrossTabulate[titanicData[[All, aTitanicColumnNames /@ {"passenger class", "passenger survival"}]]]

"ctCounts-matrix-form"

Remark: Because of this up-value definition for Association with MatrixForm we have the associations returned by CrossTabulate and FromRXTabsForm to have the key "XTABMatrix" instead of "Matrix", the former assumed to be much more rarely to be used than the latter.

Using larger data

Let us consider an example with larger data that has larger number of unique values in its columns.

Getting online retail invoices data

The following dataset is taken from [6].

data = Import[ "/Volumes/WhiteSlimSeagate/Datasets/UCI Online Retail Data Set/Online Retail.csv"];
columnNames = First[data];
data = Rest[data];

aColumnNames = AssociationThread[columnNames -> Range[Length[columnNames]]];

Data summary

We have \approx 66000 rows and 8 columns:

Dimensions[data]
(* {65499, 8} *)

Here is a summary of the columns:

Magnify[#, 0.75] &@RecordsSummary[data, columnNames]

"online-retail-summary"

Contingency tables

Country vs. StockCode

There is no one-to-one correspondence between the values of the column "Description" and the column "StockCode" which can be seen with this command:

MinMax@Map[Length@*Union, 
  GatherBy[data[[All, aColumnNames /@ {"Description", "StockCode"}]], First]]
(* {1,144} *)

The way in which the column "StockCode" was ingested made it have multiple types for its values:

Tally[NumberQ /@ data[[All, aColumnNames["StockCode"]]]]
(* {{False,9009},{True,56490}} *)

So let us convert it to all strings:

data[[All, aColumnNames["StockCode"]]] = 
  ToString /@ data[[All, aColumnNames["StockCode"]]];

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using CrossTabulate:

AbsoluteTiming[
 ctRes = CrossTabulate[
    data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]]];
]
(* {0.256339,Null} *)

Here we find the contingency table for "Country" and "StockCode" over "Quantity" using xtabsViaRLink:

AbsoluteTiming[
 rres = xtabsViaRLink[
   data[[All, aColumnNames /@ {"Country", "StockCode", "Quantity"}]],
   {"Country", "StockCode", "Quantity"},
   "Quantity ~ Country + StockCode"]; 
 ctRRes = FromRXTabsForm[rres];
]
(* {0.843621,Null} *)

Both functions produce the same result:

ctRRes["matrix"] == N@ctRRes[[1]]
(* True *)

Note that xtabsViaRLink is slower but still fairly quick.

Here we plot the contingency table using MatrixPlot :

MatrixPlot[ctRRes["matrix"], AspectRatio -> 1/1.5, 
  FrameTicks -> {{#, #} &@ Table[{i, ctRRes["rownames"][[i]]}, {i, Length[ctRRes["rownames"]]}], 
  {Automatic, Automatic}}, ImageSize -> 550]

"online-retail-country-vs-stockcode-table"

Country vs. Quarter

Let us extend the data with columns that have months and quarters corresponding to the invoice dates.

The following commands computing date objects and extracting month and quarter values from them are too slow.

(*AbsoluteTiming[dobjs=DateObject[{#,{"Month","/","Day","/","Year"," \
","Hour",":","Minute"}}]&/@data[[All,aColumnNames["InvoiceDate"]]];
]*)
(* {30.2595,Null} *)

(*AbsoluteTiming[
dvals=DateValue[dobjs,{"MonthName","QuarterNameShort"}];
]*)
(* {91.1732,Null} *)

We can use the following ad hoc computation instead.

dvals = StringSplit[#, {"/", " ", ":"}] & /@ 
   data[[All, 
    aColumnNames["InvoiceDate"]]];

This summary shows that the second value in the dates is day of month, and the first value is most likely month.

Magnify[#, 0.75] &@ RecordsSummary[dvals[[All, 1 ;; 3]], "MaxTallies" -> 16]

"dvals-conjecture-summary"

These commands extend the data and the corresponding column-name-to-index association.

ms = DateValue[Table[DateObject[{2016, i, 1}], {i, 12}], "MonthName"];
dvals = Map[{ms[[#]], "Q" <> ToString[Quotient[#, 4] + 1]} &, ToExpression @ dvals[[All, 1]]];
dataM = MapThread[Join[#1, #2] &, {data, dvals}];
aColumnNamesM = 
  Join[aColumnNames, <|"MonthName" -> (Length[aColumnNames] + 1), "QuarterNameShort" -> (Length[aColumnNames] + 2)|>];
(* {0.054877,Null} *)

Here is the contingency table for "Country" vs "QuarterNameShort" over "Quantity".

ctRes = CrossTabulate[ dataM[[All, aColumnNamesM /@ {"Country", "QuarterNameShort", "Quantity"}]]];
Magnify[#, 0.75] &@ MatrixForm[#1, TableHeadings -> {#2, #3}] & @@ ctRes

"online-retail-country-vs-quarter-table"

Uniform tables

Often when making contingency tables over subsets of the data we obtain contingency tables with different rows and columns. For various reasons (programming, esthetics, comprehension) it is better to have the tables with the same rows and columns.

Here is an example of non-uniform contingency tables derived from the online retail data of the previous section. We split the data over the countries and find contingency tables of "MonthName" vs "QuarterNameShort" over "Quantity".

tbs = Association @
 Map[
    (xtab = CrossTabulate[#[[All, aColumnNamesM /@ {"MonthName", "QuarterNameShort", "Quantity"}]]];
     #[[1, aColumnNamesM["Country"]]] -> xtab) &,
    GatherBy[ dataM, #[[aColumnNamesM[ "Country"]]] &]];

Magnify[#, 0.75] &@
 Map[MatrixForm[#["Matrix"], TableHeadings -> (# /@ {"RowNames", "ColumnNames"})] &, tbs](*[[{1,12,-1}]]*)

"non-uniform-tables"

Using the object RSparseMatrix, see [4,5], we can impose row and column names on each table.

First we convert the contingency tables into RSparseMatrix objects:

tbs2 = Map[ ToRSparseMatrix[#["Matrix"], "RowNames" -> #["RowNames"], "ColumnNames" -> #["ColumnNames"]] &, tbs];

And then we impose the desired row and column names:

tbs2 = Map[ ImposeColumnNames[ ImposeRowNames[#, {"January", "December"}], {"Q1", "Q2", "Q3", "Q4"}] &, tbs2];
Magnify[#, 0.75] &@(MatrixForm /@ tbs2)

"uniform-tables"

Generalization : CrossTensorate

A generalization of CrossTabulate is the function CrossTensorate implemented in [1] that takes a "formula" argument similar to R’s xtabs.

This finds number of people of different sub-groups of Titanic data:

ctRes = CrossTensorate[Count == "passenger survival" + "passenger sex" + "passenger class", titanicData, aTitanicColumnNames];
MatrixForm[ctRes]

"cross-tensorate-result"

We can verify the result using Count:

Count[titanicData, {"1st", _, "female", "died"}]
(* 5 *)

Count[titanicData, {"2nd", _, "male", "survived"}]
(* 23 *)

To split the cross-tensor across its first variable we can use this command:

sctRes = Association@
  MapThread[Rule[#1, Join[<|"XTABTensor" -> #2|>, Rest@Rest@ctRes]] &, {ctRes[[2]], # & /@ ctRes["XTABTensor"]}];
MatrixForm /@ sctRes

"cross-tensorate-split"

Or we can call the more general function CrossTensorateSplit implemented in [1]:

Map[MatrixForm /@ CrossTensorateSplit[ctRes, #] &, Rest@Keys[ctRes]]

"crosstensoratesplit-example"

CrossTensorateSplit can also be called with one argument that is a variable name.This will produce a splitting function. For example, the above command can be re-written as :

Map[MatrixForm /@ CrossTensorateSplit[#] @ ctRes &, Rest@Keys[ctRes]]

References

[1] Anton Antonov, MathematicaForPrediction utilities, (2014), source code MathematicaForPrediction at GitHub, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Mosaic plot for data visualization implementation in Mathematica, (2014), MathematicaForPrediction at GitHub, package MosaicPlot.m.

[3] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2014/03/17/mosaic-plots-for-data-visualization/ .

[4] Anton Antonov, RSparseMatrix Mathematica package, (2015) MathematicaForPrediction at GitHub. URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Misc/RSparseMatrix.m .

[5] Anton Antonov, "RSparseMatrix for sparse matrices with named rows and columns", (2015), MathematicaForPrediction at WordPress blog. URL: https://mathematicaforprediction.wordpress.com/2015/10/08/rsparsematrix-for-sparse-matrices-with-named-rows-and-columns/ .

[6] Daqing Chen, Online Retail Data Set, (2015), UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/ml/datasets/Online+Retail .

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s