Classification and association rules for census income data

Introduction

In this blog post I am going to show (some) analysis of census income data — the so called “Adult” data set, [1] — using three types of algorithms: decision tree classification, naive Bayesian classification, and association rules learning. Mathematica packages for all three algorithms can be found at the project MathematicaForPrediction hosted at GitHub, [2,3,4].

(The census income data set is also used in the description of the R package “arules”, [7].)

In the census data every record represents a person with 14 attributes, the last element of a record is one of the labels {“>=50K”,”<50K”}. The relationships between the categorical variables in that data set was described in my previous blog post, “Mosaic plots for data visualization”.

For this data the questions I am most interested in are:
Question 1: Which of the variables (age, occupation, sex, etc.) are most decisive for determining the income of a person?
Question 2: Which values for which variables form conditions that would imply high income or low income? (I.e. “>50K” or “<=50K”.)
Question 3: What conclusions or confirmations we can get from answering the previous two questions?

One way to answer Question 1 is to use following steps, [8].
1. Build a classifier with the training set.
2. Verify using the test set that good classification results are obtained.
3. If the number of variables (attributes) is k for each i, 1<=i<=k :
3.1. Shuffle the values of the i-th column of the test data and find the classification success rates.
4. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.
5. The variables for which the classification success rates are the worst are the most decisive.

Following these steps with a decision tree classifier, [2], I found that “marital-status” and “education-num” (years of education) are most decisive to give good prediction for the “>50K” label. Using a naive Bayesian classifier, [3], the most significant variables are “marital-status” and “relationship”. (More details are given in the sections “Application of decision trees” and “Application of naive Bayesian classifier”.)

One way to answer Question 2 is to find which values of the variables (e.g. “Wife”, “Peru”, “HS-grad”, “Exec-managerial”) associate most frequently with “>50K” and “<=50K” respectively and apply different Bayesian probability statistics on them. This is what the application of Associative rules learning gives, [9]. Another way is to use mosaic plots, [5,9], and prefix trees (also known as “tries”) [6,11,12].

In order to apply Association rule learning we need to make the numerical variables categorical — we need to partition them into non-overlapping intervals. (This derived, “all categorical” data is also amenable to be input data for mosaic plots and prefix trees.)

Insights about the data set using Mosaic Plots can be found in my previous blog post “Mosaic plots for data visualization”, [13]. The use of Mosaic Plots in [13] is very similar to the Naive Bayesian Classifiers application discussed below.

Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income, [1].

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:
Adult census income data sample

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is a summary of the data:
Adult census income data summary

As it was mentioned in the introduction, only 24% of the labels are “>50K”. Also note that 2/3 of the records are for males.

Scatter plots and mosaic plots

Often scatter plots and mosaic plots can give a good idea of the general patterns that hold in the data. This sub-section has a couple of examples, but presenting extensive plots is beyond the scope of this blog post. Let me point out that it might be very beneficial to use these kind of plots with Mathematica‘s dynamic features (like Manipulate and Tooltip), or make a grid of mosaic plots.

Mosaic plots of the categorical variables of the data can be seen in my previous blog post “Mosaic plots for data visualization”.

Here is a table of the histograms for “age”, “education-num”, and “hours-per-week”:
adult-data-scatter-plots-age-education-num-hours-per-week

Here is a table with scatter plots for all numerical variables of the data:
adult-data-scatter-plots-age-education-num-capital-gain-capital-loss-hours-per-week

Application of decision trees

The building and classification with decision trees is straightforward. Since the label “>50K” is only a quarter of the records I consider the classification success rates for “>50K” to be more important.

adult-data-Decision-tree-classification-success-rates

I experimented with several sets of parameters for decision tree building. I did not get a classification success rate for “>50K” better than 0.644 . Using pruning based on the Minimal Description Length (MDL) principle did not give better results. (I have to say I find MDL pruning to be an elegant idea, but I am not convinced that it works that
well. I believe decision tree pruning based on test data would produce much better results. Only the MDL decision tree pruning is implemented in [2].)

The overall classification success rate is in line with the classification success ratios listed in explanation of the data set; see the file “adult.names” in [1].

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-Decision-tree-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-Decision-tree-classification-shuffled-success-rates-plot

We can see from the table and the plot that variables “marital-status”, “education-num”, “capital-gain”, “age”, and “occupation” are very decisive when it comes to determining high income. The variable “marital-status” is significantly more decisive than the others.

While considering the decisiveness of the variable “marital-status” we can bring the following questions:
1. Do people find higher paying jobs after they get married?
2. Are people with high paying jobs more likely to marry and stay married?

Both questions are probably answered with “Yes” and probably that is why “marital-status” is so decisive. It is hard to give quantified answers to these questions just using decision trees on this data — we would need to know the salary and marital status history of the individuals (different data) or to be able to imply it (different algorithm).

We can see the decisiveness of “age”, “education-num”, “occupation”, and “hours-per-week” as natural. Of course one is expected to receive a higher pay if he has studied longer, has a high paying occupation, is older (more experienced), and works more hours per week. Note that this statement explicitly states the direction of the correlation: we do assume that longer years of study bring higher pay. It is certainly a good idea to consider the alternative direction of the correlation, that people first get high paying jobs and that these high paying jobs allow them to get older and study longer.

Application of naive Bayesian classifiers

The naive Bayesian classifier, [3], produced better classification results than the decision trees for the label “>50K”:
adult-data-NBC-classification-success-rates

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):
adult-data-NBC-classification-shuffled-success-rates-table

Here is a plot of the “>50K” success rates from the table above:
adult-data-NBC-classification-shuffled-success-rates-plot

In comparison with the decision tree importance of variables experiments we can notice that:
1. “marital-status” is very decisive and it is the second most decisive variable;
2. the most decisive variable is “relationship” but it correlates with “marital-status”;
3. “age”, “occupation”, “hours-per-week”, “capital-gain”, and “sex” are decisive.

Shuffled classification rates plots comparison

Here are the two shuffled classification rates plots stacked together for easier comparison:
adult-data-Decision-tree-and-NBC-classification-shuffled-success-rates-plots

Data modification

In order to apply the association rules finding algorithm Apriori, [4], the data set have to be modified. The modification is to change the numerical variables “age”, “education-num”, and “age” into categorical. I just partitioned them into non-overlapping intervals, labeled the intervals, and assigned the labels according the variable values. Here is the summary of the modified data for just these variables:
adault-data-numerical-to-categorical-columns-summary

Finding association rules

Using the modified data I found a large number of association rules with the Apriori algorithm, [4]. I used the measure called “confidence” to extract the most significant rules. The confidence of an association rule AC with antecedent A and consequent C is defined to be the ratio P(AC)/P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, CA.)

Here is a table showing the rules with highest confidence for the consequent being “>50K”:
adult-data-association-rules-more-than-50K

From the table we can see for example that 2.1% of the data records (or 693 records) show that for a married man who has studied 14 years and originally from USA there is a 0.79 probability that he earns more than $50000.

Here is a table showing the rules with highest confidence for the consequent being “<=50K”:
adult-data-association-rules-less-than-50K

The association rules in these tables confirm the findings with the classifiers: marital status, age, and education are good predictors of income labels “>50K” and “<=50K”.

Conclusion

The analysis confirmed (and quantified) what is considered common sense:

Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).

Using the association rules we see for example that
(1) if a person earns more than $50000 he is very likely to be a married man with large number of years of education;
(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than $50000.

References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Census Income Data Set, URL: http://archive.ics.uci.edu/ml/datasets/Census+Income .

[2] Antonov, A., Decision tree and random forest implementations in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AVCDecisionTreeForest.m, (2013).

[3] Antonov, A., Implementation of naive Bayesian classifier generation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package NaiveBayesianClassifier.m, (2013).

[4] Antonov, A., Implementation of the Apriori algorithm in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AprioriAlgorithm.m, (2013).

[5] Antonov, A., Mosaic plot for data visualization implementation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package MosaicPlot.m, (2014).

[6] Antonov, A., Tries with frequencies Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package TriesWithFrequencies.m, (2013).

[7] Hahsler, M. et al., Introduction to arules [Dash] A computational environment for mining association rules and frequent item sets, (2012).

[8] Breiman, L. et al., Classification and regression trees, Chapman & Hall, 1984.

[9] Wikipedia, Association rules learning, http://en.wikipedia.org/wiki/Association_rule_learning .

[10] Antonov, A., Mosaic plots for data visualization, (March, 2014), MathematicaForPrediction at GitHub, URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Mosaic%20plots%20for%20data%20visualization.pdf .

[11] Wikipedia, Trie, http://en.wikipedia.org/wiki/Trie .

[12] Antonov, A., Tries, (December, 2013), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Tries.pdf .

[13] Antonov, A., Mosaic plots for data visualization, (March, 2014) MathematicaForPrediction at WordPress.

Advertisements

Classification of handwritten digits

In this blog post I show some experiments with algorithmic recognition of images of handwritten digits.

I followed the algorithm described in Chapter 10 of the book “Matrix Methods in Data Mining and Pattern Recognition” by Lars Elden.

The algorithm described uses the so called thin Singular Value Decomposition (SVD).

  1. Training phase
    1.1. Rasterize each training image into an array of 16 x 16 pixels.
    1.2. Each raster image is linearized — the rows are aligned into a one dimensional array. In other words, each raster image is mapped into a R^256 vector space. We will call these one dimensional arrays raster vectors.
    1.3. From each set of images corresponding to a digit make a matrix with 256 columns of the corresponding raster vectors.
    1.4. Using the matrices in step 1.3 use thin SVD to derive orthogonal bases that describe the image data for each digit.

  2. Recognition phase
    2.1. Given an image of an unknown digit derive its raster vector, R.
    2.2. Find the residuals of the approximations of R with each of the bases found in 1.4.
    2.3. The digit with the minimal residual is the recognition result.

The algorithm is programmed very easily with Mathematica. I did some experiments using training and test digit drawings made with the iPad app Zen Brush. I applied both the SVD recognition algorithm described above and I also applied decision trees in the same way as described in the previous blog post.

Here is a table of the training images:

DigitImagesWithZenBrush-TrainingSet

And here is table of the test images:

DigitImagesWithZenBrush-TestSet

Note that the third row is with images drawn with a thinner brush, and the fourth row is with images drawn with a thicker brush.

Here are raster images of the top row of the test drawings:

RasteredDigitDrawings

Here are several plots showing raster vectors:

Coordinates of 0 in R^256   Coordinates of 3 in R^256   Coordinates of 7 in R^256

As I mentioned earlier, raster vectors are very similar to the wave samples described in the previous blog post, so we can apply decision trees to them.

The SVD algorithm misclassified only 3 images out of 36 test digit images, success ratio 92%. Here is a table with digit drawings and list plots of the residuals:

DigitsRecognitionWithSVDBases

It is interesting to look at the residuals obtained for different recognition instances. For example, the plot on the first row and first column for the recognition of a drawing of “2”  shows that the residual corresponding to 2 is the smallest and the residual for 8 is the next smallest one. The residual for 2 is the clear outlier. On the second row and third column we can see that a drawing of “4” has been classified correctly as 4, but the residual for 9 is very close to the residual for 4, we almost had a misclassification. We can see that for the other three test images with “4” the residuals for 4 are clearly separated from the rest, which can be explained with “4” being drawn more slanted, and its angle being more pronounced. Examining the misclassifications in similar way explains why they occurred.

Here are the misclassified images:

DigitsRecognitionWithDecisionTree-MisclassifiedDrawings

Note the misclassified image of 7 is quite different from the training images for 7.

The decision tree misclassified 42% of the images and here is are table of them:

DigitsRecognitionWithSVDBases-MisclassifiedDrawings

Note that the decision trees would probably perform better if larger training data is used, not just nine drawings per digit. I also experimented with building the classifiers over the “negative” images and aligning the columns of the raster images instead of aligning the rows. The classification results were not better.

Some  details about the image preprocessing follow.

As I said, I drew the images using the Zen Brush app. For each digit I drew nine instances on Zen Brush’ canvas and exported to an image — here is an example:

Anton 7 with ZenBrush

Then I used Mathematica‘s function ImagePartition to partition the image into 9 singe digit drawings, and then applied ImageCrop to all them. Of course the same procedure is done for the testing images.

Further developments

Further developments with the MNIST data set are described and discussed in the blog post “Handwritten digits recognition by matrix factorization” and the forum discussion “[Mathematica-vs-R] Handwritten digits recognition by matrix factorization”.

Waveform recognition with decision trees

Few weeks ago I programmed in Mathematica the set-up of the waveform recognition problem as described in Chapter 2, Section 2.6.2 in the book “Classification And Regression Trees” by Breiman et al. Here is the document that describes the problem formulation and the classification experiments in detail: Waveform recognition with decision trees. The rest of this post is some sort of an introduction to the problem.

We have three waveforms h1, h2, h3 that are piecewise linear functions shown on this plot:

Three base waveforms

We have data array D with n rows and 21 columns. The rows of D are linear combinations of the form
 Wave sample generation equation,
in which the last term is noise generated with the normal distribution centered around 0 and standard deviation 1.

This figure shows how the rows of D look and how they can be interpreted:

Sample of wave samples

The blue points represent the vectors generated with the formula above. The dashed red lines show the corresponding “clean” waves, with the noise vector ξ removed. The plot labels tell the corresponding waveform combination class labels.

The problem is:
Given D and a vector v generated with the equation above we want to determine which base waveforms have been used to generate v.

This is a classification problem, and to solve it we can construct classifiers using decision trees. Here is a short decision tree made over 300 rows of D:

A short decision tree build over 300 wave samples

As it was mentioned above, a more detailed exposition is given here: Waveform recognition with decision trees. In that document the classification problem is solved using both (i) decision trees with different tuning parameters, and (ii) random forests specific to the problem. With the random forests it is possible to attain 78-85% successful recognition. The experiments in the document also illustrate how to use the functions of the decision trees and forests package of the project MathematicaForPrediction at GitHub.

GDP per capita analysis

Let us assume that using demographic and economic data of all countries we can find correlations with high GDP per capita. In this blog post with “high GDP per capita” I mean “GDP per capita larger than $30,000.”

I used decision trees for these correlation experiments, and more specifically the implementation at the “MathematicaForPrediction” project at GitHub (https://github.com/antononcube/MathematicaForPrediction).

I built several decision trees and forests. Here is a sample of the training data:

Training data sample

And here it can be seen how it was labeled:

Training data sample labeling

We have 176 countries labeled “low” (meaning with low GDP per capita) and 40 countries labeled “high”. (I used Mathematicas CountryData function.)

A great feature of decision trees is that they are easy to interpret — here is  a decision tree over the data discussed above:

Decision tree for GDP per capita

Each non-leaf node of the tree has the format {impurity, splitting value, splitting variable, variable type, number of rows}. The leaf nodes are numbered, each leaf shows a label and how many countries adhere to predicate formed from the root to the leaf.

Following the edges from the root of the tree to Leaf 18 we can see that countries that have life expectancy higher that 79, birth rate fraction less than 0.014, median age higher than 33, and literacy fraction higher than 0.94 are with high GDP per capita. This rule holds for more than 50% of the countries with high GDP per capita.

Following the edges from the root of the tree to Leaf 0 we can see that countries that have life expectancy less than 74 and median age less than 33 are with low GDP per capita. This rule holds for more than 65% of the countries with low GDP per capita.

I made decision trees with more comprehensive sets of variables.  Here is a sample of the training data:

Sample of high GDP per capita countries with more data variables

And here is the resulting decision tree:

Decision tree for GDP per capita more variables