Movie genre associations

In this post we are going to look at genre associations deduced by extracting association rules from a catalog of movies. For example, we might want to confirm that most romance movies are also dramas, and we want to find similar rules. For more details see this user guide https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/MovieLens%20genre%20associations.pdf at Mathematica for Prediction at GitHub .

The movie data was taken from the page MovieLens Data Sets (http://www.grouplens.org/taxonomy/term/14) of the site GroupLens Research (http://www.grouplens.org). More precisely, the data set named “MovieLens 10M Data Set” was taken.

We are interested in the movie-genre relations only and if we look only at the movie-genre relations of “MovieLens 10M Data Set” the movies are poorly interconnected. Approximately 40% of the movies have only one genre. We use MovieLens since it is publicly available, easy to ingest, and the presented results can be reproduced.

Here is a sample of the movie-genre data:
MovieLens 10k movie-genre data sample

Let us first look into some descriptive statistics of the data set.

We have 10681 movies, and 18 genres. Here is a breakdown of the movies across the genres:
MovieLens 10k movie-genre data genre breakdown

Here are a table of descriptive statistics and a histogram of the distribution of the number of genres:
MovieLens 10k Descriptive statistics for the number of genres

Here are a table of descriptive statistics and a histogram with the distribution of the movies across the release years:
MovieLens 10k Release years descriptive statistics

I applied to the movie-genre data the algorithm Apriori which is an associative rules learning algorithm. The Mathematica implementation is available at this link: https://github.com/antononcube/MathematicaForPrediction/blob/master/AprioriAlgorithm.m

With the Apriori algorithm we can find frequent association genre sets. In order to apply Apriori from each data row only the genres are taken. In this way we can see each movie as a “basket” of genres or as a “transaction” of genres, and the total movie catalog as a set of transactions.

In order to extract association rules from each frequent set we apply different measures. The GitHub package provides five measures: Support, Confidence, Lift, Leverage, and Conviction. The measure Support follows the standard mathematical definition (fraction of the total number of transactions) and it is used to find the association sets. Conviction is considered to be the best for uncovering interesting rules. The definition and interpretation of the measures are given in these tables:
Tables of definitions and properties of association rules measures

I implemented a dynamic interface to browse the association sets that have support higher than 0.25% :
Association sets dynamic interface

This 2×2 table of interface snapshots shows the association sets that have the largest support:
Association sets interface snapshots

We can see that — as expected — “Romance” and “Drama” are highly associated. Other expected associations are {“Comedy”, “Drama”, “Romance”} , {“Crime”, “Drama”, “Thriller”}, and {“Action”, “Crime”, “Thriller”}.

I also implemented a dynamic interface for browsing the association rules extracted from the frequent sets. Here is a list of snapshots of that interface:
1. Association rules of 2 items for all genres ordered by Conviction:
2 item rules for All ordered by Conviction
2. Association rules of 3 items for all genres ordered by Conviction:
3 item rules for All ordered by Conviction
3. Association rules of 2 items with “Drama” ordered by Conviction:
2 item rules for Drama ordered by Conviction
4. Association rules of 3 items with “Drama” ordered by Conviction:
3 item rules for Drama ordered by Conviction

Again, the results we see are expected. For example, looking at the measure Confidence we can see that for the MovieLens 10k catalog 82% of the romance-war movies are also dramas, and 73% of the war movies are dramas. In a certain sense, “War” and {“Romance”, “War”} function like sub-genres of “Drama”.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s