Text analysis of Trump tweets

Introduction

This post is to proclaim the MathematicaVsR at GitHub project “Text analysis of Trump tweets” in which we compare Mathematica and R over text analyses of Twitter messages made by Donald Trump (and his staff) before the USA president elections in 2016.

The project follows and extends the exposition and analysis of the R-based blog post "Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half" by David Robinson at VarianceExplained.org; see [1].

The blog post [1] links to several sources that claim that during the election campaign Donald Trump tweeted from his Android phone and his campaign staff tweeted from an iPhone. The blog post [1] examines this hypothesis in a quantitative way (using various R packages.)

The hypothesis in question is well summarized with the tweet:

Every non-hyperbolic tweet is from iPhone (his staff).
Every hyperbolic tweet is from Android (from him). pic.twitter.com/GWr6D8h5ed
— Todd Vaziri (@tvaziri) August 6, 2016

This conjecture is fairly well supported by the following mosaic plots, [2]:

TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Sentiment-Device TextAnalysisOfTrumpTweets-iPhone-MosaicPlot-Device-Weekday-Sentiment

We can see the that Twitter messages from iPhone are much more likely to be neutral, and the ones from Android are much more polarized. As Christian Rudder (one of the founders of OkCupid, a dating website) explains in the chapter "Death by a Thousand Mehs" of the book "Dataclysm", [3], having a polarizing image (online persona) is as a very good strategy to engage online audience:

[…] And the effect isn’t small-being highly polarizing will in fact get you about 70 percent more messages. That means variance allows you to effectively jump several "leagues" up in the dating pecking order – […]

(The mosaic plots above were made for the Mathematica-part of this project. Mosaic plots and weekday tags are not used in [1].)

Concrete steps

The Mathematica-part of this project does not follow closely the blog post [1]. After the ingestion of the data provided in [1], the Mathematica-part applies alternative algorithms to support and extend the analysis in [1].

The sections in the R-part notebook correspond to some — not all — of the sections in the Mathematica-part.

The following list of steps is for the Mathematica-part.

  1. Data ingestion
    • The blog post [1] shows how to do in R the ingestion of Twitter data of Donald Trump messages.

    • That can be done in Mathematica too using the built-in function ServiceConnect, but that is not necessary since [1] provides a link to the ingested data used [1]:
      load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))

    • Which leads to the ingesting of an R data frame in the Mathematica-part using RLink.

  2. Adding tags

    • We have to extract device tags for the messages — each message is associated with one of the tags "Android", "iPad", or "iPhone".

    • Using the message time-stamps each message is associated with time tags corresponding to the creation time month, hour, weekday, etc.

    • Here is summary of the data at this stage:

    "trumpTweetsTbl-Summary"

  3. Time series and time related distributions

    • We can make several types of time series plots for general insight and to support the main conjecture.

    • Here is a Mathematica made plot for the same statistic computed in [1] that shows differences in tweet posting behavior:

    "TimeSeries"

    • Here are distributions plots of tweets per weekday:

    "ViolinPlots"

  4. Classification into sentiments and Facebook topics

    • Using the built-in classifiers of Mathematica each tweet message is associated with a sentiment tag and a Facebook topic tag.

    • In [1] the results of this step are derived in several stages.

    • Here is a mosaic plot for conditional probabilities of devices, topics, and sentiments:

    "Device-Topic-Sentiment-MosaicPlot"

  5. Device-word association rules

    • Using Association rule learning device tags are associated with words in the tweets.

    • In the Mathematica-part these associations rules are not needed for the sentiment analysis (because of the built-in classifiers.)

    • The association rule mining is done mostly to support and extend the text analysis in [1] and, of course, for comparison purposes.

    • Here is an example of derived association rules together with their most important measures:

    "iPhone-Association-Rules"

In [1] the sentiments are derived from computed device-word associations, so in [1] the order of steps is 1-2-3-5-4. In Mathematica we do not need the steps 3 and 5 in order to get the sentiments in the 4th step.

Comparison

Using Mathematica for sentiment analysis is much more direct because of the built-in classifiers.

The R-based blog post [1] uses heavily the "pipeline" operator %>% which is kind of a recent addition to R (and it is both fashionable and convenient to use it.) In Mathematica the related operators are Postfix (//), Prefix (@), Infix (~~), Composition (@*), and RightComposition (/*).

Making the time series plots with the R package "ggplot2" requires making special data frames. I am inclined to think that the Mathematica plotting of time series is more direct, but for this task the data wrangling codes in Mathematica and R are fairly comparable.

Generally speaking, the R package "arules" — used in this project for Associations rule learning — is somewhat awkward to use:

  • it is data frame centric, does not work directly with lists of lists, and

  • requires the use of factors.

The Apriori implementation in “arules” is much faster than the one in “AprioriAlgorithm.m” — “arules” uses a more efficient algorithm implemented in C.

References

[1] David Robinson, "Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half", (2016), VarianceExplained.org.

[2] Anton Antonov, "Mosaic plots for data visualization", (2014), MathematicaForPrediction at WordPress.

[3] Christian Rudder, Dataclysm, Crown, 2014. ASIN: B00J1IQUX8 .

Creating and programming domain specific languages

Introduction

In this blog post I will provide links to documents, packages, blog posts, and discussions for creating and utilizing Domain Specific Languages (DSLs). I have discussed a few DSLs in previous blog posts (linked below). This blog post provides a more general, higher level view on the application and creation of DSLs. The concrete examples are with Mathematica, but the steps are general and can be done with any programming languages and tools.

When to apply DSLs

Here are some situations for applying DSLs.

  1. When designing conversational engines.
  2.  When there are too many usage scenarios and tuning options for the developed algorithms.
    • For example, we have a bunch of search, recommendation, and interaction algorithms for a dating site. A different, User Experience Department (UED) designs interactive user interfaces for these algorithms. We make a natural language DSL that invokes the different algorithms according to specified outcomes. With the DSL the different designs produced by UED are much easily prototyped, implemented, or fleshed out. The DSL also gives to UED easier to understand view on the functionalities provided by the algorithms.
  3. When designing an API for a collection of algorithms.
    • Just designing a DSL can bring clarity of what signatures should be in the API.
    • NIntegrate‘s Method option was designed and implemented using a DSL. See this video between 25:00 and 27:30.

Designing DSLs

  1. Decide what kind of sentences the DSL is going to have.
    • Are natural language sentences going to be used?
    • Are the language words known beforehand or not?
  2. Prepare, create, or accumulate a list of representative sentences.
    • In some cases using Morphological Analysis can greatly help for coming up with use cases and the corresponding sentences.
  3. Create a context free grammar that describes the sentences from the previous step. (Or a large subset of them.)
    • At this stage I use exclusively Extended Backus-Naur Form (EBNF).
    • In some cases the grammar terminals are not know at the design stage and have to retrieved in some way. (From a database or though natural language processing.)
    • Some conversational engine systems allow or require to the grammar specification to be done in XML. I would still do BNF and then move to XML
      •  It is not that hard to write a parser-and-interpreter that translates BNF into XML. See the end of this blog post for that kind of translation of BNF into OMPL.
  4. Program parser(s) for the grammar.
    • I use most of the time functional parsers.
    • The package FunctionalParsers.m provides a Mathematica implementation of this kind of parsing.
    • The package can automatically generate parsers from a grammar given in EBNF. (See the coding example below.)
    • I have programmed versions of this package in R and Lua.
  5. Program an interpreter for the parsed sentences.
    • At this stage the parsed sentences are hooked to the algorithms of the problem domain.
    • The package FunctionalParsers.m allows this to be done fairly easy.
  6. Test the parsing and interpretation.

See the code example below illustrating steps 3-6.

Introduction to using DSLs in Mathematica

  1. This blog post “Natural language processing with functional parsers” gives an introduction to the DSL application in Mathematica.
  2. This detailed slide-show presentation “Functional parsers for an integration requests language grammar” shows how to use the package FunctionalParsers.m over a small grammar.
  3. The answer of the MSE question “How to parse a clojure expression?” gives a good introduction with a simple grammar and shows both direct parser programming and automatic generation from EBNF.

Advanced example

The blog post “Simple time series conversational engine” discusses the creation (design and programming) of a simple conversational engine for time series analysis (data loading, finding outliers and trends.)

Here is a movie demonstrating that conversational engine: http://youtu.be/wlZ5ANglVI4.

Other discussions

  1. A small part, from 17:30 to 21:00, of the WTC 2012 “Spatial Access Methods and Route Finding” presentation shows a DSL for points of interest queries.
  2. The answer of the MSE question “CSS Selectors for Symbolic XML” uses FunctionalParsers.m .
  3. This Quantile Regression presentation is aided by the  “Simple time series conversational engine” mentioned above.

Coding example

This coding example demonstrates steps 3-6 discussed above.

EBNF-and-parsers-for-LoveFood

Interpreters-and-parsing-for-LoveFood

Natural language processing with functional parsers

Natural language Processing (NLP) can be done with a structural approach using grammar rules. (The other type of NLP is using statistical methods.) In this post I discuss the use of functional parsers for the parsing and interpretation of small sets of natural language sentences within specific contexts. Functional parsing is also known as monadic parsing and parsing combinators.

Generally, I am using functional parsers to make Domain-Specific Languages (DSL’s). I use DSL’s to make command interfaces to search and recommendation engines and also to design and prototype conversational engines. I use extensively the so called Backus-Naur Form (BNF) for the grammar specifications. Clearly a DSL can be very close to natural language and provide sufficient means for interpretation within a given (narrow) context. (Like function integration in Calculus, smart phone directory browsing and selection, or search for something to eat nearby.)

I implemented and uploaded a package for construction of functional parsers: see FunctionalParsers.m hosted by the project MathematicaForPrediction at GitHub.

The package provides ability to quickly program parsers using a core system of functional parsers as described in the article “Functional parsers” by Jeroen Fokker .

The parsers (in both the package and the article) are categorized in the groups: basic, combinators, and transformers. Immediate interpretation can be done with transformer parsers, but the package also provides functions for evaluation of parser output within a context of data and functions.

Probably most importantly, the package provides functions for automatic generation of parsers from grammars in EBNF.

Here is an example of parsing the sentences of an integration requests language:

Interpretation of integration requests

Here is the grammar:
Integration requests EBNF grammar

The grammar can be visualized with this mind map:

Integration command

The mind map was hand made with MindNode Pro. Generally, the branches represent alternatives, but if two branches are connected the direction of the arrow connecting them shows a sequence combination.

With the FunctionalParsers.m package we can automatically generate a
mind map parsing the string of the grammar in EBNF to OMPL:

IntegrationRequestsGenerated

(And here is a PDF of the automatically generated mind map: IntegrationRequestsGenerated . )

I also made a slide show that gives an introduction to how the package is used: “Functional parsers for an integration requests language grammar”.

A more complicated example is this conversational engine for manipulation of time series data. (Data loading, finding outliers and trends. More details in the next blog post.)

Statistical thesaurus from NPR podcasts

Five months ago I worked with transcripts of National Public Radio (NPR) podcasts. The transcripts are available at http://www.npr.org — see for example “From child actor to artist…“.

Using nearly 5000 transcripts I experimented with topic extraction and statistical thesaurus derivation. The topics are too bulky to show here, but I am going to show some of the statistical thesaurus entries.

I used dimension reduction with Non-Negative Matrix Factorization (NNMF). For more detailed explanations, code for computations, and experimental results see this paper “Topic and thesaurus extraction from a document collection” provided by the MathematicaForPrediction project at GitHub. (The code for NNMF is also provided by the MathematicaForPrediction project at GitHub.)

First let me describe the data. The collection has 5123 transcripts.

Here is a sample of the transcripts (only the first 400 characters of each are taken):
NPR podcast sample 400 characters per podcast

Here is the distribution of the string lengths of the transcripts:
5123 NPR podcasts string length

I removed custom selected stop words from the transcripts. I also stemmed the words using the stemmer called snowball, see http://snowball.tartarus.org. The stemmed words are called “terms” below.

Here are descriptive statistics and the distribution of the number of transcripts per term:
Transcripts per term

Here are descriptive statistics and the distribution of the number of terms per transcript:
Terms per transcript

I did not compute the whole statistical thesaurus. Instead I made a function that computes the thesaurus entry of a given word using the right NNMF factor with proper normalization.

Here are sample results of the thesaurus entry “retrieval” (note that the right column contains word stems):
Statistical thesaurus entries