Mathematica vs. R at GitHub

In brief

This post is to announce the repository MathematicaVsR at GitHub that has example projects, code, and documents for comparing Mathematica with R.

My plan is to proclaim new completed Mathematica-vs-R projects here, in this blog post, and when appropriate make separate blog posts about them.

Mission statement

The development in the MathematicaVsR at GitHub repository aims to provide a collection of relatively simple but non-trivial example projects that illustrate the use of Mathematica and R in different statistical, machine learning, scientific, and software engineering programming activities.

Each of the projects has implementations and documents made with both Mathematica and R — hopefully that would allow comparison and knowledge transfer.

Where to begin

This presentation, "Mathematica vs. R" given at the Wolfram Technology Conference 2015 is probably a good start.

As a warm-up of how to do the comparison see this mind-map (which is made for Mathematica users):

"Mathematica-vs-R-mind-map-for-Mathematica-users"

Projects overview

  1. BrowsingDataWithChernoffFaces

  2. DataWrangling

  3. DistributionExtractionAFromGaussianNoisedMixture

  4. HandwrittenDigitsClassificationByMatrixFactorization

  5. ODEsWithSeasonalities

  6. ProgressiveJackpotModeling

  7. RegressionWithROC

  8. StatementsSaliencyInPodcasts

  9. TextAnalysisOfTrumpTweets

  10. TimeSeriesAnalysisWithQuantileRegression

Future projects

The future projects are listed in order of their completion time proximity — the highest in the list would be committed the soonest.

Advertisements

GDP per capita analysis

Let us assume that using demographic and economic data of all countries we can find correlations with high GDP per capita. In this blog post with “high GDP per capita” I mean “GDP per capita larger than $30,000.”

I used decision trees for these correlation experiments, and more specifically the implementation at the “MathematicaForPrediction” project at GitHub (https://github.com/antononcube/MathematicaForPrediction).

I built several decision trees and forests. Here is a sample of the training data:

Training data sample

And here it can be seen how it was labeled:

Training data sample labeling

We have 176 countries labeled “low” (meaning with low GDP per capita) and 40 countries labeled “high”. (I used Mathematicas CountryData function.)

A great feature of decision trees is that they are easy to interpret — here is  a decision tree over the data discussed above:

Decision tree for GDP per capita

Each non-leaf node of the tree has the format {impurity, splitting value, splitting variable, variable type, number of rows}. The leaf nodes are numbered, each leaf shows a label and how many countries adhere to predicate formed from the root to the leaf.

Following the edges from the root of the tree to Leaf 18 we can see that countries that have life expectancy higher that 79, birth rate fraction less than 0.014, median age higher than 33, and literacy fraction higher than 0.94 are with high GDP per capita. This rule holds for more than 50% of the countries with high GDP per capita.

Following the edges from the root of the tree to Leaf 0 we can see that countries that have life expectancy less than 74 and median age less than 33 are with low GDP per capita. This rule holds for more than 65% of the countries with low GDP per capita.

I made decision trees with more comprehensive sets of variables.  Here is a sample of the training data:

Sample of high GDP per capita countries with more data variables

And here is the resulting decision tree:

Decision tree for GDP per capita more variables