# Movie genre associations

In this post we are going to look at genre associations deduced by extracting association rules from a catalog of movies. For example, we might want to confirm that most romance movies are also dramas, and we want to find similar rules. For more details see this user guide https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/MovieLens%20genre%20associations.pdf at Mathematica for Prediction at GitHub .

The movie data was taken from the page MovieLens Data Sets (http://www.grouplens.org/taxonomy/term/14) of the site GroupLens Research (http://www.grouplens.org). More precisely, the data set named “MovieLens 10M Data Set” was taken.

We are interested in the movie-genre relations only and if we look only at the movie-genre relations of “MovieLens 10M Data Set” the movies are poorly interconnected. Approximately 40% of the movies have only one genre. We use MovieLens since it is publicly available, easy to ingest, and the presented results can be reproduced.

Here is a sample of the movie-genre data:

Let us first look into some descriptive statistics of the data set.

We have 10681 movies, and 18 genres. Here is a breakdown of the movies across the genres:

Here are a table of descriptive statistics and a histogram of the distribution of the number of genres:

Here are a table of descriptive statistics and a histogram with the distribution of the movies across the release years:

I applied to the movie-genre data the algorithm Apriori which is an associative rules learning algorithm. The Mathematica implementation is available at this link: https://github.com/antononcube/MathematicaForPrediction/blob/master/AprioriAlgorithm.m

With the Apriori algorithm we can find frequent association genre sets. In order to apply Apriori from each data row only the genres are taken. In this way we can see each movie as a “basket” of genres or as a “transaction” of genres, and the total movie catalog as a set of transactions.

In order to extract association rules from each frequent set we apply different measures. The GitHub package provides five measures: Support, Confidence, Lift, Leverage, and Conviction. The measure Support follows the standard mathematical definition (fraction of the total number of transactions) and it is used to find the association sets. Conviction is considered to be the best for uncovering interesting rules. The definition and interpretation of the measures are given in these tables:

I implemented a dynamic interface to browse the association sets that have support higher than 0.25% :

This 2×2 table of interface snapshots shows the association sets that have the largest support:

We can see that — as expected — “Romance” and “Drama” are highly associated. Other expected associations are {“Comedy”, “Drama”, “Romance”} , {“Crime”, “Drama”, “Thriller”}, and {“Action”, “Crime”, “Thriller”}.

I also implemented a dynamic interface for browsing the association rules extracted from the frequent sets. Here is a list of snapshots of that interface:
1. Association rules of 2 items for all genres ordered by Conviction:

2. Association rules of 3 items for all genres ordered by Conviction:

3. Association rules of 2 items with “Drama” ordered by Conviction:

4. Association rules of 3 items with “Drama” ordered by Conviction:

Again, the results we see are expected. For example, looking at the measure Confidence we can see that for the MovieLens 10k catalog 82% of the romance-war movies are also dramas, and 73% of the war movies are dramas. In a certain sense, “War” and {“Romance”, “War”} function like sub-genres of “Drama”.

## One thought on “Movie genre associations”

This site uses Akismet to reduce spam. Learn how your comment data is processed.