RSparseMatrix for sparse matrices with named rows and columns

Introduction

In the last few years I have used a lot R’s base library Matrix that has implementation of sparse matrix objects and efficient computations. To the sparse matrices from R’s Matrix library one can assign and retrieve row names and column names with the functions colnames and rownames. Sometimes I miss this in Mathematica so I started a Mathematica package that implements similar functionalities. The package is named RSparseMatrix.m has purely Mathematica language implementations (i.e. it does not use RLink ). It can be loaded/downloaded from MathematicaForPrediction at GitHub:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"]

The package provides functions to create and do operations over  RSparseMatrix objects of that are basically SparseArray objects with row and column names. A major design decision is to restrict these functionalities to two dimensional sparse arrays and lists of strings as row and column names. (Note that the package is not finished and in some functions the row and column names are ignored.)

The package attempts to cover as many as possible of the functionalities for sparse matrix objects that are provided by R’s Matrix library. (Sub-matrix extraction by row and column names, row and column names propagation for dot products, row and column binding sparse matrices, row and column sums, etc.) This document has examples and tests for RSparseMatrix.m .

My participation in WTC 2015 with a talk about Mathematica and R comparison was one the main motivators to write this blog post. Another is this Mathematica StackExchange discussion. (And a third one is seeing tonight the impressive movie “The Martian” — such a display of the triumph of the humans over space and nature using technology and science in a creative way made me wanna discuss how to make some programming objects more convenient.)

Basic examples

Creation

rmat = MakeRSparseMatrix[
 {{1, 1} -> 1, {2, 2} -> 2, {4, 3} -> 3, {1, 4} -> 4, {3, 5} -> 2},
 "ColumnNames" -> {"a", "b", "c", "d", "e"},
 "RowNames" -> {"A", "B", "C", "D"},
 "DimensionNames" -> {"U", "V"}]

The function MatrixForm shows the RSparseMatrix objects with their row and column names:

rmat // MatrixForm
rmat-MatrixForm

The RSparseMatrix objects can be created from SparseArray objects:

Query functions

These functions can be used to retrieve the names of rows, columns, and dimensions. They correspond to R’s functions rownames, colnames, dimnames.

In[154]:= RowNames[rmat]
Out[154]= {"A", "B", "C", "D"}
In[155]:= ColumnNames[rmat]
Out[155]= {"a", "b", "c", "d", "e"}
In[156]:= DimensionNames[rmat]
Out[156]= {"U", "V"}

Functions that work on SparseArray

Of course since RSparseMatrix is based on SparseArray we would expect the functions that work on SparseArray objects to work RSpaseMatrix objects too. E.g. Dimensions, ArrayRules, Transpose, Total, and others.

In[157]:= Dimensions[rmat]
Out[157]= {4, 5}
In[158]:= ArrayRules[rmat]
Out[158]= {{1, 1} -> 1, {1, 4} -> 4, {2, 2} -> 2, {3, 5} -> 2, {4, 3} -> 3, {_, _} -> 0}

Dot product

Row names and column names are respected for dot products if that leads to meaningful assignments. The examples below demonstrate a general principle:

When a matrix operation can be performed on the underlying sparse arrays but the row names or column names do not coincide the names are dropped.

In the tables with examples below the last rows show the heads of the results.

Matrix by vector

RSparseMartix-Matrix-by-vector-examples-grid

Matrix by matrix

RSparseMartix-Matrix-by-matrix-SA-examples-grid

RSparseMartix-Matrix-by-matrix-examples-grid

Part

A major useful feature is to have Part work with row and column names. The implementation of that additional functionality for Part is demonstrated below.

In the cases when the dimension drops sparse arrays or numbers are returned. In R the operation “[” has the parameter “drop” — the expression “smat[1,,drop=F]” is going to be a sparse matrix, the expression “smat[1,,drop=T]” is going to be a dense vector. The corresponding implementation is to have the option “Drop->True|False” for Part, but that does not seem a good idea. And we can easily emulate the “drop” option in R using “{_?AtomQ}” inside Part.

RSparseMartix-Part-scenarios-examples-grid

Neat example

Consider this incidence matrix that represents a bi-partite graph of relationships of actors starring in movies:

Bi-partite-matrix-for-Movies-Actors-graph

We can use a RSparseMatrix object of it with named rows and columns (rBiMat).

Here is the corresponding graph:

Movies-Actors-graph

If we want to see which actors have participated in movies together with Orlando Bloom we can do the following:

Actors-starring-with-Orlando-Bloom

Movie genre associations

In this post we are going to look at genre associations deduced by extracting association rules from a catalog of movies. For example, we might want to confirm that most romance movies are also dramas, and we want to find similar rules. For more details see this user guide https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/MovieLens%20genre%20associations.pdf at Mathematica for Prediction at GitHub .

The movie data was taken from the page MovieLens Data Sets (http://www.grouplens.org/taxonomy/term/14) of the site GroupLens Research (http://www.grouplens.org). More precisely, the data set named “MovieLens 10M Data Set” was taken.

We are interested in the movie-genre relations only and if we look only at the movie-genre relations of “MovieLens 10M Data Set” the movies are poorly interconnected. Approximately 40% of the movies have only one genre. We use MovieLens since it is publicly available, easy to ingest, and the presented results can be reproduced.

Here is a sample of the movie-genre data:
MovieLens 10k movie-genre data sample

Let us first look into some descriptive statistics of the data set.

We have 10681 movies, and 18 genres. Here is a breakdown of the movies across the genres:
MovieLens 10k movie-genre data genre breakdown

Here are a table of descriptive statistics and a histogram of the distribution of the number of genres:
MovieLens 10k Descriptive statistics for the number of genres

Here are a table of descriptive statistics and a histogram with the distribution of the movies across the release years:
MovieLens 10k Release years descriptive statistics

I applied to the movie-genre data the algorithm Apriori which is an associative rules learning algorithm. The Mathematica implementation is available at this link: https://github.com/antononcube/MathematicaForPrediction/blob/master/AprioriAlgorithm.m

With the Apriori algorithm we can find frequent association genre sets. In order to apply Apriori from each data row only the genres are taken. In this way we can see each movie as a “basket” of genres or as a “transaction” of genres, and the total movie catalog as a set of transactions.

In order to extract association rules from each frequent set we apply different measures. The GitHub package provides five measures: Support, Confidence, Lift, Leverage, and Conviction. The measure Support follows the standard mathematical definition (fraction of the total number of transactions) and it is used to find the association sets. Conviction is considered to be the best for uncovering interesting rules. The definition and interpretation of the measures are given in these tables:
Tables of definitions and properties of association rules measures

I implemented a dynamic interface to browse the association sets that have support higher than 0.25% :
Association sets dynamic interface

This 2×2 table of interface snapshots shows the association sets that have the largest support:
Association sets interface snapshots

We can see that — as expected — “Romance” and “Drama” are highly associated. Other expected associations are {“Comedy”, “Drama”, “Romance”} , {“Crime”, “Drama”, “Thriller”}, and {“Action”, “Crime”, “Thriller”}.

I also implemented a dynamic interface for browsing the association rules extracted from the frequent sets. Here is a list of snapshots of that interface:
1. Association rules of 2 items for all genres ordered by Conviction:
2 item rules for All ordered by Conviction
2. Association rules of 3 items for all genres ordered by Conviction:
3 item rules for All ordered by Conviction
3. Association rules of 2 items with “Drama” ordered by Conviction:
2 item rules for Drama ordered by Conviction
4. Association rules of 3 items with “Drama” ordered by Conviction:
3 item rules for Drama ordered by Conviction

Again, the results we see are expected. For example, looking at the measure Confidence we can see that for the MovieLens 10k catalog 82% of the romance-war movies are also dramas, and 73% of the war movies are dramas. In a certain sense, “War” and {“Romance”, “War”} function like sub-genres of “Drama”.

Digit recognition interface with an RPN calculator

This post is a follow-up of my previous blog post “Classification of handwritten†digits” about recognition of digit drawings.

A friend of mine pointed out that “the recognition of digits was one of the assignments in the Coursera Machine Learning course, using Matlab or Octave.” (Octave is free, open source version of Matlab.)

So I decided to do a digit recognition and interpretation interactive interface in/with Mathematica, which is not something that can be done with Matlab or Octave. Here is a link to a video demonstrating the interface: http://youtu.be/iPF5Apa6OjY .

Here is a screenshot:
Digit recognition with an RPN calculator screenshot with 7