The Great conversation in USA presidential speeches

Introduction

This document shows a way to chart in Mathematica / WL the evolution of topics in collections of texts. The making of this document (and related code) is primarily motivated by the fascinating concept of the Great Conversation, [Wk1, MA1]. In brief, all western civilization books are based on 103 great ideas; if we find the great ideas each significant book is based on we can construct a time-line (spanning centuries) of the great conversation between the authors; see [MA1, MA2, MA3].

Instead of finding the great ideas in a text collection we extract topics statistically, using dimension reduction with Non-Negative Matrix Factorization (NNMF), [AAp3, AA1, AA2].

The presented computational results are based on the text collections of State of the Union speeches of USA presidents [D2]. The code in this document can be easily configured to use the much smaller text collection [D1] available online and in Mathematica/WL. (The collection [D1] is fairly small, 51 documents; the collection [D2] is much larger, 2453 documents.)

The procedures (and code) described in this document, of course, work on other types of text collections. For example: movie reviews, podcasts, editorial articles of a magazine, etc.

A secondary objective of this document is to illustrate the use of the monadic programming pipeline as a Software design pattern, [AA3]. In order to make the code concise in this document I wrote the package MonadicLatentSemanticAnalysis.m, [AAp5]. Compare with the code given in [AA1].

The very first version of this document was written for the 2017 summer course “Data Science for the Humanities” at the University of Oxford, UK.

Outline of the procedure applied

The procedure described in this document has the following steps.

  1. Get a collection of documents with known dates of publishing.
    • Or other types of tags associated with the documents.
  2. Do preliminary analysis of the document collection.
    • Number of documents; number of unique words.

    • Number of words per document; number of documents per word.

    • (Some of the statistics of this step are done easier after the Linear vector space representation step.)

  3. Optionally perform Natural Language Processing (NLP) tasks.

    1. Obtain or derive stop words.

    2. Remove stop words from the texts.

    3. Apply stemming to the words in the texts.

  4. Linear vector space representation.

    • This means that we represent the collection with a document-word matrix.

    • Each unique word is a basis vector in that space.

    • For each document the corresponding point in that space is derived from the number of appearances of document’s words.

  5. Extract topics.

    • In this document NNMF is used.

    • In order to obtain better results with NNMF some experimentation and refinements of the topics search have to be done.

  6. Map the documents over the extracted topics.

    • The original matrix of the vector space representation is replaced with a matrix with columns representing topics (instead of words.)
  7. Order the topics according to their presence across the years (or other related tags).
    • This can be done with hierarchical clustering.

    • Alternatively,

      1. for a given topic find the weighted mean of the years of the documents that have that topic, and

      2. order the topics according to those mean values.

  8. Visualize the evolution of the documents according to their topics.

    1. This can be done by simply finding the contingency matrix year vs topic.

    2. For the president speeches we can use the president names for time-line temporal axis instead of years.

      • Because the corresponding time intervals of president office occupation do not overlap.

Remark: Some of the functions used in this document combine several steps into one function call (with corresponding parameters.)

Packages

This loads the packages [AAp1-AAp8]:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicLatentSemanticAnalysis.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicTracing.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/RSparseMatrix.m"];

(Note that some of the packages that are imported automatically by [AAp5].)

The functions of the central package in this document, [AAp5], have the prefix “LSAMon”. Here is a sample of those names:

Short@Names["LSAMon*"]

(* {"LSAMon", "LSAMonAddToContext", "LSAMonApplyTermWeightFunctions", <>, "LSAMonUnit", "LSAMonUnitQ", "LSAMonWhen"} *)

Data load

In this section we load a text collection from a specified source.

The text collection from “Presidential Nomination Acceptance Speeches”, [D1], is small and can be used for multiple code verifications and re-runnings. The “State of Union addresses of USA presidents” text collection from [D2] was converted to a Mathematica/WL object by Christopher Wolfram (and sent to me in a private communication.) The text collection [D2] provides far more interesting results (and they are shown below.)

If[True,
  speeches = ResourceData[ResourceObject["Presidential Nomination Acceptance Speeches"]];
  names = StringSplit[Normal[speeches[[All, "Person"]]][[All, 2]], "::"][[All, 1]],

  (*ELSE*)
  (*State of the union addresses provided by Christopher Wolfram. *)      
  Get["~/MathFiles/Digital humanities/Presidential speeches/speeches.mx"];
  names = Normal[speeches[[All, "Name"]]];
];

dates = Normal[speeches[[All, "Date"]]];
texts = Normal[speeches[[All, "Text"]]];

Dimensions[speeches]

(* {2453, 4} *)

Basic statistics for the texts

Using different contingency matrices we can derive basic statistical information about the document collection. (The document-word matrix is a contingency matrix.)

First we convert the text data in long-form:

docWordRecords = 
  Join @@ MapThread[
    Thread[{##}] &, {Range@Length@texts, names, 
     DateString[#, {"Year"}] & /@ dates, 
     DeleteStopwords@*TextWords /@ ToLowerCase[texts]}, 1];

Here is a sample of the rows of the long-form:

GridTableForm[RandomSample[docWordRecords, 6], 
 TableHeadings -> {"document index", "name", "year", "word"}]

Here is a summary:

Multicolumn[
 RecordsSummary[docWordRecords, {"document index", "name", "year", "word"}, "MaxTallies" -> 8], 4, Dividers -> All, Alignment -> Top]

Using the long form we can compute the document-word matrix:

ctMat = CrossTabulate[docWordRecords[[All, {1, -1}]]];
MatrixPlot[Transpose@Sort@Map[# &, Transpose[ctMat@"XTABMatrix"]], 
 MaxPlotPoints -> 300, ImageSize -> 800, 
 AspectRatio -> 1/3]

Here is the president-word matrix:

ctMat = CrossTabulate[docWordRecords[[All, {2, -1}]]];
MatrixPlot[Transpose@Sort@Map[# &, Transpose[ctMat@"XTABMatrix"]], MaxPlotPoints -> 300, ImageSize -> 800, AspectRatio -> 1/3]

Here is an alternative way to compute text collection statistics through the document-word matrix computed within the monad LSAMon:

LSAMonUnit[texts]⟹LSAMonEchoTextCollectionStatistics[];

Procedure application

Stop words

Here is one way to obtain stop words:

stopWords = Complement[DictionaryLookup["*"], DeleteStopwords[DictionaryLookup["*"]]];
Length[stopWords]
RandomSample[stopWords, 12]

(* 304 *)

(* {"has", "almost", "next", "WHO", "seeming", "together", "rather", "runners-up", "there's", "across", "cannot", "me"} *)

We can complete this list with additional stop words derived from the collection itself. (Not done here.)

Linear vector space representation and dimension reduction

Remark: In the rest of the document we use “term” to mean “word” or “stemmed word”.

The following code makes a document-term matrix from the document collection, exaggerates the representations of the terms using “TF-IDF”, and then does topic extraction through dimension reduction. The dimension reduction is done with NNMF; see [AAp3, AA1, AA2].

SeedRandom[312]

mObj =
  LSAMonUnit[texts]⟹
   LSAMonMakeDocumentTermMatrix[{}, stopWords]⟹
   LSAMonApplyTermWeightFunctions[]⟹
   LSAMonTopicExtraction[Max[5, Ceiling[Length[texts]/100]], 60, 12, "MaxSteps" -> 6, "PrintProfilingInfo" -> True];

This table shows the pipeline commands above with comments:

Detailed description

The monad object mObj has a context of named values that is an Association with the following keys:

Keys[mObj⟹LSAMonTakeContext]

(* {"texts", "docTermMat", "terms", "wDocTermMat", "W", "H", "topicColumnPositions", "automaticTopicNames"} *)

Let us clarify the values by briefly describing the computational steps.

  1. From texts we derive the document-term matrix \text{docTermMat}\in \mathbb{R}^{m \times n}, where n is the number of documents and m is the number of terms.
    • The terms are words or stemmed words.

    • This is done with LSAMonMakeDocumentTermMatrix.

  2. From docTermMat is derived the (weighted) matrix wDocTermMat using “TF-IDF”.

    • This is done with LSAMonApplyTermWeightFunctions.
  3. Using docTermMat we find the terms that are present in sufficiently large number of documents and their column indices are assigned to topicColumnPositions.

  4. Matrix factorization.

    1. Assign to \text{wDocTermMat}[[\text{All},\text{topicsColumnPositions}]], \text{wDocTermMat}[[\text{All},\text{topicsColumnPositions}]]\in \mathbb{R}^{m_1 \times n}, where m_1 = |topicsColumnPositions|.

    2. Compute using NNMF the factorization \text{wDocTermMat}[[\text{All},\text{topicsColumnPositions}]]\approx H W, where W\in \mathbb{R}^{k \times n}, H\in \mathbb{R}^{k \times m_1}, and k is the number of topics.

    3. The values for the keys “W, “H”, and “topicColumnPositions” are computed and assigned by LSAMonTopicExtraction.

  5. From the top terms of each topic are derived automatic topic names and assigned to the key automaticTopicNames in the monad context.

    • Also done by LSAMonTopicExtraction.

Statistical thesaurus

At this point in the object mObj we have the factors of NNMF. Using those factors we can find a statistical thesaurus for a given set of words. The following code calculates such a thesaurus, and echoes it in a tabulated form.

queryWords = {"arms", "banking", "economy", "education", "freedom", 
   "tariff", "welfare", "disarmament", "health", "police"};

mObj⟹
  LSAMonStatisticalThesaurus[queryWords, 12]⟹
  LSAMonEchoStatisticalThesaurus[];

By observing the thesaurus entries we can see that the words in each entry are semantically related.

Note, that the word “welfare” strongly associates with “[applause]”. The rest of the query words do not, which can be seen by examining larger thesaurus entries:

thRes =
  mObj⟹
   LSAMonStatisticalThesaurus[queryWords, 100]⟹
   LSAMonTakeValue;
Cases[thRes, "[applause]", Infinity]

(* {"[applause]", "[applause]"} *)

The second “[applause]” associated word is “education”.

Detailed description

The statistical thesaurus is computed by using the NNMF’s right factor H.

For a given term, its corresponding column in H is found and the nearest neighbors of that column are found in the space \mathbb{R}^{m_1} using Euclidean norm.

Extracted topics

The topics are the rows of the right factor H of the factorization obtained with NNMF .

Let us tabulate the topics found above with LSAMonTopicExtraction :

mObj⟹ LSAMonEchoTopicsTable["NumberOfTerms" -> 6, "MagnificationFactor" -> 0.8, Appearance -> "Horizontal"];

Map documents over the topics

The function LSAMonTopicsRepresentation finds the top outliers for each row of NNMF’s left factor W. (The outliers are found using the package [AAp4].) The obtained list of indices gives the topic representation of the collection of texts.

Short@(mObj⟹LSAMonTopicsRepresentation[]⟹LSAMonTakeContext)["docTopicIndices"]

{{53}, {47, 53}, {25}, {46}, {44}, {15, 42}, {18}, <>, {30}, {33}, {7, 60}, {22, 25}, {12, 13, 25, 30, 49, 59}, {48, 57}, {14, 41}}

Further we can see that if the documents have tags associated with them — like author names or dates — we can make a contingency matrix of tags vs topics. (See [AAp8, AA4].) This is also done by the function LSAMonTopicsRepresentation that takes tags as an argument. If the tags argument is Automatic, then the tags are simply the document indices.

Here is a an example:

rsmat = mObj⟹LSAMonTopicsRepresentation[Automatic]⟹LSAMonTakeValue;
MatrixPlot[rsmat]

Here is an example of calling the function LSAMonTopicsRepresentation with arbitrary tags.

rsmat = mObj⟹LSAMonTopicsRepresentation[DateString[#, "MonthName"] & /@ dates]⟹LSAMonTakeValue;
MatrixPlot[rsmat]

Note that the matrix plots above are very close to the charting of the Great conversation that we are looking for. This can be made more obvious by observing the row names and columns names in the tabulation of the transposed matrix rsmat:

Magnify[#, 0.6] &@MatrixForm[Transpose[rsmat]]

Charting the great conversation

In this section we show several ways to chart the Great Conversation in the collection of speeches.

There are several possible ways to make the chart: using a time-line plot, using heat-map plot, and using appropriate tabulation (with MatrixForm or Grid).

In order to make the code in this section more concise the package RSparseMatrix.m, [AAp7, AA5], is used.

Topic name to topic words

This command makes an Association between the topic names and the top topic words.

aTopicNameToTopicTable = 
  AssociationThread[(mObj⟹LSAMonTakeContext)["automaticTopicNames"], 
   mObj⟹LSAMonTopicsTable["NumberOfTerms" -> 12]⟹LSAMonTakeValue];

Here is a sample:

Magnify[#, 0.7] &@ aTopicNameToTopicTable[[1 ;; 3]]

Time-line plot

This command makes a contingency matrix between the documents and the topics (as described above):

rsmat = ToRSparseMatrix[mObj⟹LSAMonTopicsRepresentation[Automatic]⟹LSAMonTakeValue]

This time-plot shows great conversation in the USA presidents state of union speeches:

TimelinePlot[
 Association@
  MapThread[
   Tooltip[#2, aTopicNameToTopicTable[#2]] -> dates[[ToExpression@#1]] &, 
   Transpose[RSparseMatrixToTriplets[rsmat]]], 
 PlotTheme -> "Detailed", ImageSize -> 1000, AspectRatio -> 1/2, PlotLayout -> "Stacked"]

The plot is too cluttered, so it is a good idea to investigate other visualizations.

Topic vs president heatmap

We can use the USA president names instead of years in the Great Conversation chart because the USA presidents terms do not overlap.

This makes a contingency matrix presidents vs topics:

rsmat2 = ToRSparseMatrix[
   mObj⟹LSAMonTopicsRepresentation[
     names]⟹LSAMonTakeValue];

Here we compute the chronological order of the presidents based on the dates of their speeches:

nameToMeanYearRules = 
  Map[#[[1, 1]] -> Mean[N@#[[All, 2]]] &, 
   GatherBy[MapThread[List, {names, ToExpression[DateString[#, "Year"]] & /@ dates}], First]];
ordRowInds = Ordering[RowNames[rsmat2] /. nameToMeanYearRules];

This heat-map plot uses the (experimental) package HeatmapPlot.m, [AAp6]:

Block[{m = rsmat2[[ordRowInds, All]]},
 HeatmapPlot[SparseArray[m], RowNames[m], 
  Thread[Tooltip[ColumnNames[m], aTopicNameToTopicTable /@ ColumnNames[m]]],
  DistanceFunction -> {None, Sort}, ImageSize -> 1000, 
  AspectRatio -> 1/2]
 ]

Note the value of the option DistanceFunction: there is not re-ordering of the rows and columns are reordered by sorting. Also, the topics on the horizontal names have tool-tips.

References

Text data

[D1] Wolfram Data Repository, "Presidential Nomination Acceptance Speeches".

[D2] US Presidents, State of the Union Addresses, Trajectory, 2016. ‪ISBN‬1681240009, 9781681240008‬.

[D3] Gerhard Peters, "Presidential Nomination Acceptance Speeches and Letters, 1880-2016", The American Presidency Project.

[D4] Gerhard Peters, "State of the Union Addresses and Messages", The American Presidency Project.

Packages

[AAp1] Anton Antonov, MathematicaForPrediction utilities, (2014), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Implementation of document-term matrix construction and re-weighting functions in Mathematica(2013), MathematicaForPrediction at GitHub.

[AAp3] Anton Antonov, Implementation of the Non-Negative Matrix Factorization algorithm in Mathematica, (2013), MathematicaForPrediction at GitHub.

[AAp4] Anton Antonov, Implementation of one dimensional outlier identifying algorithms in Mathematica, (2013), MathematicaForPrediction at GitHub.

[AAp5] Anton Antonov, Monadic latent semantic analysis Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp6] Anton Antonov, Heatmap plot Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp7] Anton Antonov, RSparseMatrix Mathematica package, (2015), MathematicaForPrediction at GitHub.

[AAp8] Anton Antonov, Cross tabulation implementation in Mathematica, (2017), MathematicaForPrediction at GitHub.

Books and articles

[AA1] Anton Antonov, "Topic and thesaurus extraction from a document collection", (2013), MathematicaForPrediction at GitHub.

[AA2] Anton Antonov, "Statistical thesaurus from NPR podcasts", (2013), MathematicaForPrediction at WordPress blog.

[AA3] Anton Antonov, "Monad code generation and extension", (2017), MathematicaForPrediction at GitHub.

[AA4] Anton Antonov, "Contingency tables creation examples", (2016), MathematicaForPrediction at WordPress blog.

[AA5] Anton Antonov, "RSparseMatrix for sparse matrices with named rows and columns", (2015), MathematicaForPrediction at WordPress blog.

[Wk1] Wikipedia entry, Great Conversation.

[MA1] Mortimer Adler, "The Great Conversation Revisited," in The Great Conversation: A Peoples Guide to Great Books of the Western World, Encyclopædia Britannica, Inc., Chicago,1990, p. 28.

[MA2] Mortimer Adler, "Great Ideas".

[MA3] Mortimer Adler, "How to Think About the Great Ideas: From the Great Books of Western Civilization", 2000, Open Court.

Advertisements

Natural language processing with functional parsers

Natural language Processing (NLP) can be done with a structural approach using grammar rules. (The other type of NLP is using statistical methods.) In this post I discuss the use of functional parsers for the parsing and interpretation of small sets of natural language sentences within specific contexts. Functional parsing is also known as monadic parsing and parsing combinators.

Generally, I am using functional parsers to make Domain-Specific Languages (DSL’s). I use DSL’s to make command interfaces to search and recommendation engines and also to design and prototype conversational engines. I use extensively the so called Backus-Naur Form (BNF) for the grammar specifications. Clearly a DSL can be very close to natural language and provide sufficient means for interpretation within a given (narrow) context. (Like function integration in Calculus, smart phone directory browsing and selection, or search for something to eat nearby.)

I implemented and uploaded a package for construction of functional parsers: see FunctionalParsers.m hosted by the project MathematicaForPrediction at GitHub.

The package provides ability to quickly program parsers using a core system of functional parsers as described in the article “Functional parsers” by Jeroen Fokker .

The parsers (in both the package and the article) are categorized in the groups: basic, combinators, and transformers. Immediate interpretation can be done with transformer parsers, but the package also provides functions for evaluation of parser output within a context of data and functions.

Probably most importantly, the package provides functions for automatic generation of parsers from grammars in EBNF.

Here is an example of parsing the sentences of an integration requests language:

Interpretation of integration requests

Here is the grammar:
Integration requests EBNF grammar

The grammar can be visualized with this mind map:

Integration command

The mind map was hand made with MindNode Pro. Generally, the branches represent alternatives, but if two branches are connected the direction of the arrow connecting them shows a sequence combination.

With the FunctionalParsers.m package we can automatically generate a
mind map parsing the string of the grammar in EBNF to OMPL:

IntegrationRequestsGenerated

(And here is a PDF of the automatically generated mind map: IntegrationRequestsGenerated . )

I also made a slide show that gives an introduction to how the package is used: “Functional parsers for an integration requests language grammar”.

A more complicated example is this conversational engine for manipulation of time series data. (Data loading, finding outliers and trends. More details in the next blog post.)