Five months ago I worked with transcripts of National Public Radio (NPR) podcasts. The transcripts are available at http://www.npr.org — see for example “From child actor to artist…“.
Using nearly 5000 transcripts I experimented with topic extraction and statistical thesaurus derivation. The topics are too bulky to show here, but I am going to show some of the statistical thesaurus entries.
I used dimension reduction with Non-Negative Matrix Factorization (NNMF). For more detailed explanations, code for computations, and experimental results see this paper “Topic and thesaurus extraction from a document collection” provided by the MathematicaForPrediction project at GitHub. (The code for NNMF is also provided by the MathematicaForPrediction project at GitHub.)
First let me describe the data. The collection has 5123 transcripts.
Here is a sample of the transcripts (only the first 400 characters of each are taken):
Here is the distribution of the string lengths of the transcripts:
I removed custom selected stop words from the transcripts. I also stemmed the words using the stemmer called snowball, see http://snowball.tartarus.org. The stemmed words are called “terms” below.
Here are descriptive statistics and the distribution of the number of transcripts per term:
Here are descriptive statistics and the distribution of the number of terms per transcript:
I did not compute the whole statistical thesaurus. Instead I made a function that computes the thesaurus entry of a given word using the right NNMF factor with proper normalization.
Here are sample results of the thesaurus entry “retrieval” (note that the right column contains word stems):
Pingback: Comparison of PCA and NNMF over image de-noising | Mathematica for prediction algorithms
Pingback: The Great conversation in USA presidential speeches | Mathematica for prediction algorithms