Conference abstracts similarities

Introduction

In this MathematicaVsR project we discuss and exemplify finding and analyzing similarities between texts using Latent Semantic Analysis (LSA). Both Mathematica and R codes are provided.

The LSA workflows are constructed and executed with the software monads LSAMon-WL, [AA1, AAp1], and LSAMon-R, [AAp2].

The illustrating examples are based on conference abstracts from rstudio::conf and Wolfram Technology Conference (WTC), [AAd1, AAd2]. Since the number of rstudio::conf abstracts is small and since rstudio::conf 2020 is about to start at the time of preparing this project we focus on words and texts from RStudio’s ecosystem of packages and presentations.

Statistical thesaurus for words from RStudio’s ecosystem

Consider the focus words:

{"cloud","rstudio","package","tidyverse","dplyr","analyze","python","ggplot2","markdown","sql"}

Here is a statistical thesaurus for those words:

0az70qt8noeqf
0az70qt8noeqf

Remark: Note that the computed thesaurus entries seem fairly “R-flavored.”

Similarity analysis diagrams

As expected the abstracts from rstudio::conf tend to cluster closely – note the square formed top-left in the plot of a similarity matrix based on extracted topics:

1d5a83m8cghew
1d5a83m8cghew

Here is a similarity graph based on the matrix above:

09y26s6kr3bv9
09y26s6kr3bv9

Here is a clustering (by “graph communities”) of the sub-graph highlighted in the plot above:

0rba3xgoknkwi
0rba3xgoknkwi

Notebooks

Comparison observations

LSA pipelines specifications

The packages LSAMon-WL, [AAp1], and LSAMon-R, [AAp2], make the comparison easy – the codes of the specified workflows are nearly identical.

Here is the Mathematica code:

lsaObj =
  LSAMonUnit[aDesriptions]⟹
   LSAMonMakeDocumentTermMatrix[{}, Automatic]⟹
   LSAMonEchoDocumentTermMatrixStatistics⟹
   LSAMonApplyTermWeightFunctions["IDF", "TermFrequency", "Cosine"]⟹
   LSAMonExtractTopics["NumberOfTopics" -> 36, "MinNumberOfDocumentsPerTerm" -> 2, Method -> "ICA", MaxSteps -> 200]⟹
   LSAMonEchoTopicsTable["NumberOfTableColumns" -> 6];

Here is the R code:

lsaObj <- 
  LSAMonUnit(lsDescriptions) %>% 
  LSAMonMakeDocumentTermMatrix( stemWordsQ = FALSE, stopWords = stopwords::stopwords() ) %>% 
  LSAMonApplyTermWeightFunctions( "IDF", "TermFrequency", "Cosine" ) 
  LSAMonExtractTopics( numberOfTopics = 36, minNumberOfDocumentsPerTerm = 5, method = "NNMF", maxSteps = 20, profilingQ = FALSE ) %>% 
  LSAMonEchoTopicsTable( numberOfTableColumns = 6, wideFormQ = TRUE ) 

Graphs and graphics

Mathematica’s built-in graph functions make the exploration of the similarities much easier. (Than using R.)

Mathematica’s matrix plots provide more control and are more readily informative.

Sparse matrix objects with named rows and columns

R’s built-in sparse matrices with named rows and columns are great. LSAMon-WL utilizes a similar, specially implemented sparse matrix object, see [AA1, AAp3].

References

Articles

[AA1] Anton Antonov, A monad for Latent Semantic Analysis workflows, (2019), MathematicaForPrediction at GitHub.

[AA2] Anton Antonov, Text similarities through bags of words, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

Data

[AAd1] Anton Antonov, RStudio::conf-2019-abstracts.csv, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

[AAd2] Anton Antonov, Wolfram-Technology-Conference-2016-to-2019-abstracts.csv, (2020), SimplifiedMachineLearningWorkflows-book at GitHub.

Packages

[AAp1] Anton Antonov, Monadic Latent Semantic Analysis Mathematica package, (2017), MathematicaForPrediction at GitHub.

[AAp2] Anton Antonov, Latent Semantic Analysis Monad R package, (2019), R-packages at GitHub.

[AAp3] Anton Antonov, SSparseMatrix Mathematica package, (2018), MathematicaForPrediction at GitHub.