Using nearly 5000 transcripts I experimented with topic extraction and statistical thesaurus derivation. The topics are too bulky to show here, but I am going to show some of the statistical thesaurus entries.
I used dimension reduction with Non-Negative Matrix Factorization (NNMF). For more detailed explanations, code for computations, and experimental results see this paper “Topic and thesaurus extraction from a document collection” provided by the MathematicaForPrediction project at GitHub. (The code for NNMF is also provided by the MathematicaForPrediction project at GitHub.)
First let me describe the data. The collection has 5123 transcripts.
I removed custom selected stop words from the transcripts. I also stemmed the words using the stemmer called snowball, see http://snowball.tartarus.org. The stemmed words are called “terms” below.
I did not compute the whole statistical thesaurus. Instead I made a function that computes the thesaurus entry of a given word using the right NNMF factor with proper normalization.