Statistical thesaurus from NPR podcasts

Five months ago I worked with transcripts of National Public Radio (NPR) podcasts. The transcripts are available at http://www.npr.org — see for example “From child actor to artist…“.

Using nearly 5000 transcripts I experimented with topic extraction and statistical thesaurus derivation. The topics are too bulky to show here, but I am going to show some of the statistical thesaurus entries.

I used dimension reduction with Non-Negative Matrix Factorization (NNMF). For more detailed explanations, code for computations, and experimental results see this paper “Topic and thesaurus extraction from a document collection” provided by the MathematicaForPrediction project at GitHub. (The code for NNMF is also provided by the MathematicaForPrediction project at GitHub.)

First let me describe the data. The collection has 5123 transcripts.

Here is a sample of the transcripts (only the first 400 characters of each are taken):
NPR podcast sample 400 characters per podcast

Here is the distribution of the string lengths of the transcripts:
5123 NPR podcasts string length

I removed custom selected stop words from the transcripts. I also stemmed the words using the stemmer called snowball, see http://snowball.tartarus.org. The stemmed words are called “terms” below.

Here are descriptive statistics and the distribution of the number of transcripts per term:
Transcripts per term

Here are descriptive statistics and the distribution of the number of terms per transcript:
Terms per transcript

I did not compute the whole statistical thesaurus. Instead I made a function that computes the thesaurus entry of a given word using the right NNMF factor with proper normalization.

Here are sample results of the thesaurus entry “retrieval” (note that the right column contains word stems):
Statistical thesaurus entries

Advertisements

One thought on “Statistical thesaurus from NPR podcasts

  1. Pingback: Comparison of PCA and NNMF over image de-noising | Mathematica for prediction algorithms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s