Robust LLM pipelines

Posted on September 22, 2024 by Anton Antonov Antonov

… or “Making Robust LLM Computational Pipelines from Software Engineering Perspective”

Abstract

Large Language Models (LLMs) are powerful tools with diverse capabilities, but from Software Engineering (SE) Point Of View (POV) they are unpredictable and slow. In this presentation we consider five ways to make more robust SE pipelines that include LLMs. We also consider a general methodological workflow for utilizing LLMs in “every day practice.”

Here are the five approaches we consider:

DSL for configuration-execution-conversion
- Infrastructural, language-design level solution
Detailed, well crafted prompts
- AKA “Prompt engineering”
Few-shot training with examples
Via a Question Answering System (QAS) and code templates
Grammar-LLM chain of responsibility
Testings with data types and shapes over multiple LLM results

Compared to constructing SE pipelines, Literate Programming (LP) offers a dual or alternative way to use LLMs. For that it needs support and facilitation of:

Convenient LLM interaction (or chatting)
Document execution (weaving and tangling)

The discussed LLM workflows methodology is supported in Python, Raku, Wolfram Language (WL). The support in R is done via Python (with “reticulate”, [TKp1].)

The presentation includes multiple examples and showcases.

Modeling of the LLM utilization process is hinted but not discussed.

Here is a mind-map of the presentation:

Here are the notebook used in the presentation:

Presentation’s notebook
Robust-LLM-Pipelines-SouthFL-DSSG-Python.ipynb
Robust-LLM-Pipelines-SouthFL-DSSG-Raku.ipynb

General structure of LLM-based workflows

All systematic approaches of unfolding and refining workflows based on LLM functions, will include several decision points and iterations to ensure satisfactory results.

This flowchart outlines such a systematic approach:

References

Articles, blog posts

[AA1] Anton Antonov, “Workflows with LLM functions”, (2023), RakuForPrediction at WordPress.

Notebooks

[AAn1] Anton Antonov, “Workflows with LLM functions (in Raku)”, (2023), Wolfram Community.

[AAn2] Anton Antonov, “Workflows with LLM functions (in Python)”, (2023), Wolfram Community.

[AAn3] Anton Antonov, “Workflows with LLM functions (in WL)”, (2023), Wolfram Community.

Packages

Raku

[AAp1] Anton Antonov, LLM::Functions Raku package, (2023-2024), GitHub/antononcube. (raku.land)

[AAp2] Anton Antonov, LLM::Prompts Raku package, (2023-2024), GitHub/antononcube. (raku.land)

[AAp3] Anton Antonov, Jupyter::Chatbook Raku package, (2023-2024), GitHub/antononcube. (raku.land)

Python

[AAp4] Anton Antonov, LLMFunctionObjects Python package, (2023-2024), PyPI.org/antononcube.

[AAp5] Anton Antonov, LLMPrompts Python package, (2023-2024), GitHub/antononcube.

[AAp6] Anton Antonov, JupyterChatbook Python package, (2023-2024), GitHub/antononcube.

[MWp1] Marc Wouts, jupytext Python package, (2021-2024), GitHub/mwouts.

R

[TKp1] Tomasz Kalinowski, Kevin Ushey, JJ Allaire, RStudio, Yuan Tang, reticulate R package, (2016-2024)

Videos

[AAv1] Anton Antonov, “Robust LLM pipelines (Mathematica, Python, Raku)”, (2024), YouTube/@AAA4Predictions.

[AAv2] Anton Antonov, “Integrating Large Language Models with Raku”, (2023), The Raku Conference 2023 at YouTube.

Age at creation for programming languages stats

Posted on May 22, 2024 by Anton Antonov Antonov

Introduction

In this blog post (notebook) we ingest programming languages creation data from “Programming Language DataBase” and visualize several statistics of it.

We do not examine the data source and we do not want to reason too much about the data using the stats. We started this notebook by just wanting to make the bubble charts (both 2D and 3D.) Nevertheless, we are tempted to say and justify statements like:

Pareto holds, as usual.
Language creators tend to do it more than once.
Beware the Second system effect.

References

Here are reference links with explanations and links to dataset files:

The Ages of Programming Language Creators (pldb.io)
- Short note about data and related statistics; provides a link to a TSV file with the data.
The Ages of Programming Language Creators (datawrapper.dwcdn.net)
- “Just a plot”; provides a link to a CSV file with the data.
The Ages of Programming Language Creators (Reddit)
- Link(s) and discussion.

Data ingestion

Here we get the TSC file with Wolfram Function Repository (WFR) function ImportCSVToDataset:

url = "https://pldb.io/posts/age.tsv";
dsData = ResourceFunction["ImportCSVToDataset"][url, "Dataset", "FieldSeparators" -> "\t"];
dsData[[1 ;; 4]]

Here we summarize the data using the WFR function RecordsSummary:

ResourceFunction["RecordsSummary"][dsData, "MaxTallies" -> 12]

Here is a list of languages we use to “get orientated” in the plots below:

lsFocusLangs = {"C++", "Fortran", "Java", "Mathematica", "Perl 6", "Raku", "SQL", "Wolfram Language"};

Here we find the most important tags (used in the plots below):

lsTopTags = ReverseSortBy[Tally[Normal@dsData[All, "tags"]], Last][[1 ;; 7, 1]]

(*{"pl", "textMarkup", "dataNotation", "grammarLanguage", "queryLanguage", "stylesheetLanguage", "protocol"}*)

Here we add the column “group” based on the focus languages and most important tags:

dsData = dsData[All, Append[#, "group" -> Which[MemberQ[lsFocusLangs, #id], "focus", MemberQ[lsTopTags, #tags], #tags, True, "other"]] &];

Distributions

Here are the distributions of the variables/columns:

age at creation
- i.e. “How old was the creator?”
appeared”
- i.e. “In what year the programming language was proclaimed?”

Association @ Map[# -> Histogram[Normal@dsData[All, #], 20, "Probability", Sequence[ImageSize -> Medium, PlotTheme -> "Detailed"]] &, {"ageAtCreation", "appeared"}]

Here are corresponding Box-Whisker plots together with tables of their statistics:

aBWCs = Association@
   Map[# -> BoxWhiskerChart[Normal@dsData[All, #], "Outliers", Sequence[BarOrigin -> Left, ImageSize -> Medium, AspectRatio -> 1/2, PlotRange -> Full]] &, {"ageAtCreation", "appeared"}];

Pareto principle manifestation

Number of creations

Here is the Pareto principle plot of for the number of created (or renamed) programming languages per creator (using the WFR function ParetoPrinciplePlot):

ResourceFunction["ParetoPrinciplePlot"][Association[Rule @@@ Tally[Normal@dsData[All, "creators"]]], ImageSize -> Large]

We can see that ≈25% of the creators correspond to ≈50% of the languages.

Popularity

Obviously, programmers can and do use more than one programming language. Nevertheless, it is interesting to see the Pareto principle plot for the languages “mind share” based on the number of users estimates.

ResourceFunction["ParetoPrinciplePlot"][Normal@dsData[Association, #id -> #numberOfUsersEstimate &], ImageSize -> Large]

Remark: Again, the plot above is “wrong” — programmers use more than one programming language.

Correlations

In order to see meaningful correlation, pairwise plots we take logarithms of the large value columns:

dsDataVar = dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "numberOfJobsEstimate", "rank", "measurements", "pldbScore"}];
dsDataVar = dsDataVar[All, Append[#, <|"numberOfUsersEstimate" -> Log10[#numberOfUsersEstimate + 1], "numberOfJobsEstimate" -> Log10[#numberOfJobsEstimate + 1]|>] &];

Remark: Note that we “cheat” by adding 1 before taking the logarithms.

We obtain the tables of correlations plots using the newly introduced, experimental PairwiseListPlot. If we remove the rows with zeroes some of the correlations become more obvious. Here is the corresponding tab view of the two correlation tables:

TabView[{
   "data" -> PairwiseListPlot[dsDataVar, PlotTheme -> "Business", ImageSize -> 800], 
   "zero-free data" -> PairwiseListPlot[dsDataVar[Select[FreeQ[Values[#], 0] &]], PlotTheme -> "Business", ImageSize -> 800]}]

Remark: Given the names of the data columns and the corresponding obvious interpretations we can say that the stronger correlations make sense.

Bubble chart 2D

In this section we make an informative 2D bubble chart with (tooltips).

First, note that not all triplets of “appeared”,”ageAtCreation”, and “numberOfUsersEstimate” are unique:

ReverseSortBy[Tally[Normal[dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}]]], Last][[1 ;; 3]]

(*{{<|"appeared" -> 2017, "ageAtCreation" -> 33, "numberOfUsersEstimate" -> 420|>, 2}, {<|"appeared" -> 2023, "ageAtCreation" -> 39, "numberOfUsersEstimate" -> 11|>, 1}, {<|"appeared" -> 2022, "ageAtCreation" -> 55, "numberOfUsersEstimate" -> 6265|>, 1}}*)

Hence we make two datasets: (1) one for the core bubble chart, (2) the other for the labeling function:

aData = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate"}] &];
aData2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the labeling function (see the section “Applications” of the function page of BubbleChart):

Clear[LangLabeler];
LangLabeler[v_, {r_, c_}, ___] := Placed[Grid[{
      {Style[aData2[[r, c]]["id"], Bold, 12], SpanFromLeft}, 
      {"Creator(s):", aData2[[r, c]]["creators"]}, 
      {"Appeared:", aData2[[r, c]]["appeared"]}, 
      {"Age at creation:", aData2[[r, c]]["ageAtCreation"]}, 
      {"Number of users:", aData2[[r, c]]["numberOfUsersEstimate"]} 
     }, Alignment -> Left], Tooltip];

Here is the bubble chart:

BubbleChart[
  aData, 
  FrameLabel -> {"Age at Creation", "Appeared"}, 
  PlotLabel -> "Number of users estimate", 
  BubbleSizes -> {0.05, 0.14}, 
  LabelingFunction -> LangLabeler, 
  AspectRatio -> 1/2.5, 
  ChartStyle -> 7, 
  PlotTheme -> "Detailed", 
  ChartLegends -> {Keys[aData], None}, 
  ImageSize -> 1000 
 ]

Remark: The programming language J is a clear outlier because of creators’ ages.

Bubble chart 3D

In this section we a 3D bubble chart.

As in the previous section we define two datasets: for the core plot and for the tooltips:

aData3D = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate"}] &];
aData3D2 = GroupBy[Normal@dsData, #group &, KeyTake[#, {"appeared", "ageAtCreation", "measurements", "numberOfUsersEstimate", "id", "creators"}] &];

Here is the corresponding labeling function:

Clear[LangLabeler3D];
LangLabeler3D[v_, {r_, c_}, ___] := Placed[Grid[{
      {Style[aData3D2[[r, c]]["id"], Bold, 12], SpanFromLeft}, 
      {"Creator(s):", aData3D2[[r, c]]["creators"]}, 
      {"Appeared:", aData3D2[[r, c]]["appeared"]}, 
      {"Age at creation:", aData3D2[[r, c]]["ageAtCreation"]}, 
      {"Number of users:", aData3D2[[r, c]]["numberOfUsersEstimate"]} 
     }, Alignment -> Left], Tooltip];

Here is the 3D chart:

BubbleChart3D[
  aData3D, 
  AxesLabel -> {"appeared", "ageAtCreation", "measuremnts"}, 
  PlotLabel -> "Number of users estimate", 
  BubbleSizes -> {0.02, 0.07}, 
  LabelingFunction -> LangLabeler3D, 
  BoxRatios -> {1, 1, 1}, 
  ChartStyle -> 7, 
  PlotTheme -> "Detailed", 
  ChartLegends -> {Keys[aData], None}, 
  ImageSize -> 1000 
 ]

Remark: In the 3D bubble chart plot “Mathematica” and “Wolfram Language” are easier to discern.

Second system effect traces

In this section we try — and fail — to demonstrate that the more programming languages a team of creators makes the less successful those languages are. (Maybe, because they are more cumbersome and suffer the Second system effect?)

Remark: This section is mostly made “for fun.” It is not true that each sets of languages per creators team is made of comparable languages. For example, complementary languages can be in the same set. (See, HTTP, HTML, URL.) Some sets are just made of the same language but with different names. (See, Perl 6 and Raku, and Mathematica and Wolfram Language.) Also, older languages would have the First mover advantage.

Make creators to index association:

aCreators = KeySort@Association[Rule @@@ Select[Tally[Normal@dsData[All, "creators"]], #[[2]] > 1 &]];
aNameToIndex = AssociationThread[Keys[aCreators], Range[Length[aCreators]]];

Make a bubble chart with relative popularity per creators team:

aNUsers = Normal@GroupBy[dsData, #creators &, (m = Max[1, Max[Sqrt@KeyTake[#, "numberOfUsersEstimate"]]]; Map[Tooltip[{#appeared, #creators /. aNameToIndex, Sqrt[#numberOfUsersEstimate]/m}, Grid[{{Style[#id, Black, Bold], SpanFromLeft}, {"Creator(s):", #creators}, {"Users:", #numberOfUsersEstimate}}, Alignment -> Left]] &, #]) &];
aNUsers = KeySort@Select[aNUsers, Length[#] > 1 &];
BubbleChart[aNUsers, AspectRatio -> 2, BubbleSizes -> {0.02, 0.05}, ChartLegends -> Keys[aNUsers], ImageSize -> Large, GridLines -> {None, Values[aNameToIndex]}, FrameTicks -> {{Reverse /@ (List @@@ Normal[aNameToIndex]), None}, {Automatic, Automatic}}]

From the plot above we cannot decisively say that:

The most recent creation of a team of programming language creators is not team's most popular creation.

That statement, though, does hold for a fair amount of cases.

Instead of conclusions

Consider:

Making an interactive interface for the variables, types of plots, etc.
Placing callouts for the focus languages in bubble charts.

Re-exploring the structure of Chinese character images

Posted on May 3, 2022 by Anton Antonov Antonov

Introduction

In this notebook we show information retrieval and clustering
techniques over images of Unicode collection of Chinese characters. Here
is the outline of notebook’s exposition:

Get Chinese character images.
Cluster “image vectors” and demonstrate that the obtained
clusters have certain explainability elements.
Apply Latent Semantic Analysis (LSA) workflow to the character
set.
Show visual thesaurus through a recommender system. (That uses
Cosine similarity.)
Discuss graph and hierarchical clustering using LSA matrix
factors.
Demonstrate approximation of “unseen” character images with an
image basis obtained through LSA over a small set of (simple)
images.
Redo character approximation with more “interpretable” image
basis.

Remark: This notebook started as an (extended)
comment for the Community discussion “Exploring
structure of Chinese characters through image processing”, [SH1].
(Hence the title.)

Get Chinese character images

This code is a copy of the code in the original
Community post by Silvia Hao, [SH1]:

Module[{fsize = 50, width = 64, height = 64}, 
  lsCharIDs = Map[FromCharacterCode[#, "Unicode"] &, 16^^4E00 - 1 + Range[width height]]; 
 ]

charPage = Module[{fsize = 50, width = 64, height = 64}, 
    16^^4E00 - 1 + Range[width height] // pipe[
      FromCharacterCode[#, "Unicode"] & 
      , Characters, Partition[#, width] & 
      , Grid[#, Background -> Black, Spacings -> {0, 0}, ItemSize -> {1.5, 1.2}, Alignment -> {Center, Center}, Frame -> All, FrameStyle -> Directive[Red, AbsoluteThickness[3 \[Lambda]]]] & 
      , Style[#, White, fsize, FontFamily -> "Source Han Sans CN", FontWeight -> "ExtraLight"] & 
      , Rasterize[#, Background -> Black] & 
     ] 
   ];
chargrid = charPage // ColorDistance[#, Red] & // Image[#, "Byte"] & // Sign //Erosion[#, 5] &;
lmat = chargrid // MorphologicalComponents[#, Method -> "BoundingBox", CornerNeighbors -> False] &;
chars = ComponentMeasurements[{charPage // ColorConvert[#, "Grayscale"] &, lmat}, "MaskedImage", #Width > 10 &] // Values // Map@RemoveAlphaChannel;
chars = Module[{size = chars // Map@ImageDimensions // Max}, ImageCrop[#, {size, size}] & /@ chars];

Here is a sample of the obtained images:

SeedRandom[33];
RandomSample[chars, 5]

Vector representation of
images

Define a function that represents an image into a linear vector space
(of pixels):

Clear[ImageToVector];
ImageToVector[img_Image] := Flatten[ImageData[ColorConvert[img, "Grayscale"]]];
ImageToVector[img_Image, imgSize_] := Flatten[ImageData[ColorConvert[ImageResize[img, imgSize], "Grayscale"]]];
ImageToVector[___] := $Failed;

Show how vector represented images look like:

Table[BlockRandom[
   img = RandomChoice[chars]; 
   ListPlot[ImageToVector[img], Filling -> Axis, PlotRange -> All, PlotLabel -> img, ImageSize -> Medium, AspectRatio -> 1/6], 
   RandomSeeding -> rs], {rs, {33, 998}}]

\[AliasDelimiter]

Data preparation

In this section we represent the images into a linear vector space.
(In which each pixel is a basis vector.)

Make an association with images:

aCImages = AssociationThread[lsCharIDs -> chars];
Length[aCImages]

(*4096*)

Make flat vectors with the images:

AbsoluteTiming[
  aCImageVecs = ParallelMap[ImageToVector, aCImages]; 
 ]

(*{0.998162, Null}*)

Do matrix plots a random sample of the image vectors:

SeedRandom[32];
MatrixPlot[Partition[#, ImageDimensions[aCImages[[1]]][[2]]]] & /@ RandomSample[aCImageVecs, 6]

Clustering over the image
vectors

In this section we cluster “image vectors” and demonstrate that the
obtained clusters have certain explainability elements. Expected Chinese
character radicals are observed using image multiplication.

Cluster the image vectors and show summary of the clusters
lengths:

SparseArray[Values@aCImageVecs]

SeedRandom[334];
AbsoluteTiming[
  lsClusters = FindClusters[SparseArray[Values@aCImageVecs] -> Keys[aCImageVecs], 35, Method -> {"KMeans"}]; 
 ]
Length@lsClusters
ResourceFunction["RecordsSummary"][Length /@ lsClusters]

(*{24.6383, Null}*)

(*35*)

For each cluster:

Take 30 different small samples of 7 images
Multiply the images in each small sample
Show three “most black” the multiplication results

SeedRandom[33];
Table[i -> TakeLargestBy[Table[ImageMultiply @@ RandomSample[KeyTake[aCImages, lsClusters[[i]]], UpTo[7]], 30], Total@ImageToVector[#] &, 3], {i, Length[lsClusters]}]

Remark: We can see that the clustering above
produced “semantic” clusters – most of the multiplied images show
meaningful Chinese characters radicals and their “expected
positions.”

Here is one of the clusters with the radical “mouth”:

KeyTake[aCImages, lsClusters[[26]]]

LSAMon application

In this section we apply the “standard” LSA workflow, [AA1, AA4].

Make a matrix with named rows and columns from the image vectors:

mat = ToSSparseMatrix[SparseArray[Values@aCImageVecs], "RowNames" -> Keys[aCImageVecs], "ColumnNames" -> Automatic]

The following Latent Semantic Analysis (LSA) monadic pipeline is used
in [AA2, AA2]:

SeedRandom[77];
AbsoluteTiming[
  lsaAllObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[mat]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 60, Method -> "SVD", "MaxSteps" -> 15, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]

(*{7.60828, Null}*)

Remark: LSAMon’s corresponding theory and design are
discussed in [AA1, AA4]:

Get the representation matrix:

W2 = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeW

Get the topics matrix:

H = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeH

Cluster the reduced dimension
representations and show summary of the clusters
lengths:

AbsoluteTiming[
  lsClusters = FindClusters[Normal[SparseArray[W2]] -> RowNames[W2], 40, Method -> {"KMeans"}]; 
 ]
Length@lsClusters
ResourceFunction["RecordsSummary"][Length /@ lsClusters]

(*{2.33331, Null}*)

(*40*)

Show cluster interpretations:

AbsoluteTiming[aAutoRadicals = Association@Table[i -> TakeLargestBy[Table[ImageMultiply @@ RandomSample[KeyTake[aCImages, lsClusters[[i]]], UpTo[8]], 30], Total@ImageToVector[#] &, 3], {i, Length[lsClusters]}]; 
 ]
aAutoRadicals

(*{0.878406, Null}*)

Using FeatureExtraction

I experimented with clustering and approximation using WL’s function
FeatureExtraction.
Result are fairly similar as the above; timings a different (a few times
slower.)

Visual thesaurus

In this section we use Cosine similarity to find visual nearest
neighbors of Chinese character images.

matPixels = WeightTermsOfSSparseMatrix[lsaAllObj\[DoubleLongRightArrow]LSAMonTakeWeightedDocumentTermMatrix, "IDF", "None", "Cosine"];
matTopics = WeightTermsOfSSparseMatrix[lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Left]\[DoubleLongRightArrow]LSAMonTakeW, "None", "None", "Cosine"];

smrObj = SMRMonUnit[]\[DoubleLongRightArrow]SMRMonCreate[<|"Topic" -> matTopics, "Pixel" -> matPixels|>];

Consider the character “團”:

aCImages["團"]

Here are the nearest neighbors for that character found by using both
image topics and image pixels:

(*focusItem=RandomChoice[Keys@aCImages];*)
  focusItem = {"團", "仼", "呔"}[[1]]; 
   smrObj\[DoubleLongRightArrow]
     SMRMonEcho[Style["Nearest neighbors by pixel topics:", Bold, Purple]]\[DoubleLongRightArrow]
     SMRMonSetTagTypeWeights[<|"Topic" -> 1, "Pixel" -> 0|>]\[DoubleLongRightArrow]
     SMRMonRecommend[focusItem, 8, "RemoveHistory" -> False]\[DoubleLongRightArrow]
     SMRMonEchoValue\[DoubleLongRightArrow]
     SMRMonEchoFunctionValue[AssociationThread[Values@KeyTake[aCImages, Keys[#]], Values[#]] &]\[DoubleLongRightArrow]
     SMRMonEcho[Style["Nearest neighbors by pixels:", Bold, Purple]]\[DoubleLongRightArrow]
     SMRMonSetTagTypeWeights[<|"Topic" -> 0, "Pixel" -> 1|>]\[DoubleLongRightArrow]
     SMRMonRecommend[focusItem, 8, "RemoveHistory" -> False]\[DoubleLongRightArrow]
     SMRMonEchoFunctionValue[AssociationThread[Values@KeyTake[aCImages, Keys[#]], Values[#]] &];

Remark: Of course, in the recommender pipeline above
we can use both pixels and pixels topics. (With their contributions
being weighted.)

Graph clustering

In this section we demonstrate the use of graph communities to find
similar groups of Chinese characters.

Here we take a sub-matrix of the reduced dimension matrix computed
above:

W = lsaAllObj\[DoubleLongRightArrow]LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]LSAMonTakeW;

Here we find the similarity matrix between the characters and remove
entries corresponding to “small” similarities:

matSym = Clip[W . Transpose[W], {0.78, 1}, {0, 1}];

Here we plot the obtained (clipped) similarity matrix:

MatrixPlot[matSym]

Here we:

Take array rules of the sparse similarity matrix
Drop the rules corresponding to the diagonal elements
Convert the keys of rules into uni-directed graph edges
Make the corresponding graph
Find graph’s connected components
Show the number of connected components
Show a tally of the number of nodes in the components

gr = Graph[UndirectedEdge @@@ DeleteCases[Union[Sort /@ Keys[SSparseMatrixAssociation[matSym]]], {x_, x_}]];
lsComps = ConnectedComponents[gr];
Length[lsComps]
ReverseSortBy[Tally[Length /@ lsComps], First]

(*138*)

(*{{1839, 1}, {31, 1}, {27, 1}, {16, 1}, {11, 2}, {9, 2}, {8, 1}, {7, 1}, {6, 5}, {5, 3}, {4, 8}, {3, 14}, {2, 98}}*)

Here we demonstrate the clusters of Chinese characters make
sense:

aPrettyRules = Dispatch[Map[# -> Style[#, FontSize -> 36] &, Keys[aCImages]]]; CommunityGraphPlot[Subgraph[gr, TakeLargestBy[lsComps, Length, 10][[2]]], Method -> "SpringElectrical", VertexLabels -> Placed["Name", Above],AspectRatio -> 1, ImageSize -> 1000] /. aPrettyRules

Remark: By careful observation of the clusters and
graph connections we can convince ourselves that the similarities are
based on pictorial sub-elements (i.e. radicals) of the characters.

Hierarchical clustering

In this section we apply hierarchical clustering to the reduced
dimension representation of the Chinese character images.

Here we pick a cluster:

lsFocusIDs = lsClusters[[12]];
Magnify[ImageCollage[Values[KeyTake[aCImages, lsFocusIDs]]], 0.4]

Here is how we can make a dendrogram plot (not that useful here):

(*smat=W2\[LeftDoubleBracket]lsClusters\[LeftDoubleBracket]13\[RightDoubleBracket],All\[RightDoubleBracket];
Dendrogram[Thread[Normal[SparseArray[smat]]->Map[Style[#,FontSize->16]&,RowNames[smat]]],Top,DistanceFunction->EuclideanDistance]*)

Here is a heat-map plot with hierarchical clustering dendrogram (with
tool-tips):

gr = HeatmapPlot[W2[[lsFocusIDs, All]], DistanceFunction -> {CosineDistance, None}, Dendrogram -> {True, False}];
gr /. Map[# -> Tooltip[Style[#, FontSize -> 16], Style[#, Bold, FontSize -> 36]] &, lsFocusIDs]

Remark: The plot above has tooltips with larger
character images.

Representing
all characters with smaller set of basic ones

In this section we demonstrate that a relatively small set of simpler
Chinese character images can be used to represent (or approxumate) the
rest of the images.

Remark: We use the following heuristic: the simpler
Chinese characters have the smallest amount of white pixels.

Obtain a training set of images – that are the darkest – and show a
sample of that set :

{trainingInds, testingInds} = TakeDrop[Keys[SortBy[aCImages, Total[ImageToVector[#]] &]], 800];
SeedRandom[3];
RandomSample[KeyTake[aCImages, trainingInds], 12]

Show all training characters with an image collage:

Magnify[ImageCollage[Values[KeyTake[aCImages, trainingInds]], Background -> Gray, ImagePadding -> 1], 0.4]

Apply LSA monadic pipeline with the training characters only:

SeedRandom[77];
AbsoluteTiming[
  lsaPartialObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[SparseArray[Values@KeyTake[aCImageVecs, trainingInds]]]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 80, Method -> "SVD", "MaxSteps" -> 120, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]

(*{0.826489, Null}*)

Get the matrix and basis interpretation of the extracted image
topics:

H = 
   lsaPartialObj\[DoubleLongRightArrow]
    LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
    LSAMonTakeH;
lsBasis = ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@ SparseArray[H];

Approximation of “unseen”
characters

Pick a Chinese character image as a target image and pre-process
it:

ind = RandomChoice[testingInds];
imgTest = aCImages[ind];
matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aCImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
imgTest

Find its representation with the chosen feature extractor (LSAMon
object here):

matReprsentation = lsaPartialObj\[DoubleLongRightArrow]LSAMonRepresentByTopics[matImageTest]\[DoubleLongRightArrow]LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[MapIndexed[Tooltip[#1, lsBasis[[#2[[1]]]]] &, lsCoeff], Filling -> Axis, PlotRange -> All]

Show representation coefficients outliers:

lsBasis[[OutlierPosition[Abs[lsCoeff], TopOutliers@*HampelIdentifierParameters]]]

Show the interpretation of the found representation:

vecReprsentation = lsCoeff . SparseArray[H];
reprImg = Image[Unitize@Clip[#, {0.45, 1}, {0, 1}] &@Rescale[Partition[vecReprsentation, ImageDimensions[aCImages[[1]]][[1]]]]];
{reprImg, imgTest}

See the closest characters using image distances:

KeyMap[# /. aCImages &, TakeSmallest[ImageDistance[reprImg, #] & /@ aCImages, 4]]

Remark: By applying the approximation procedure to
all characters in testing set we can convince ourselves that small,
training set provides good retrieval. (Not done here.)

Finding more interpretable
bases

In this section we show how to use LSA workflow with Non-Negative
Matrix Factorization (NNMF) over an image set extended with already
extracted “topic” images.

Cleaner automatic radicals

aAutoRadicals2 = Map[Dilation[Binarize[DeleteSmallComponents[#]], 0.5] &, First /@ aAutoRadicals]

Here we take an image union in order to remove the “duplicated”
radicals:

aAutoRadicals3 = AssociationThread[Range[Length[#]], #] &@Union[Values[aAutoRadicals2], SameTest -> (ImageDistance[#1, #2] < 14.5 &)]

LSAMon pipeline with NNMF

Make a matrix with named rows and columns from the image vectors:

mat1 = ToSSparseMatrix[SparseArray[Values@aCImageVecs], "RowNames" -> Keys[aCImageVecs], "ColumnNames" -> Automatic]

Enhance the matrix with radicals instances:

mat2 = ToSSparseMatrix[SparseArray[Join @@ Map[Table[ImageToVector[#], 100] &, Values[aAutoRadicals3]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
mat3 = RowBind[mat1, mat2];

Apply the LSAMon workflow pipeline with NNMF for topic
extraction:

SeedRandom[77];
AbsoluteTiming[
  lsaAllExtendedObj = 
    LSAMonUnit[]\[DoubleLongRightArrow]
     LSAMonSetDocumentTermMatrix[mat3]\[DoubleLongRightArrow]
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]\[DoubleLongRightArrow]
     LSAMonExtractTopics["NumberOfTopics" -> 60, Method -> "NNMF", "MaxSteps" -> 15, "MinNumberOfDocumentsPerTerm" -> 0]\[DoubleLongRightArrow]
     LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
     LSAMonEcho[Style["Obtained basis:", Bold, Purple]]\[DoubleLongRightArrow]
     LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@SparseArray[#H] &]; 
 ]

(*{155.289, Null}*)

Remark: Note that NNMF “found” the interpretable
radical images we enhanced the original image set with.

Get the matrix and basis interpretation of the extracted image
topics:

H = 
   lsaAllExtendedObj\[DoubleLongRightArrow]
    LSAMonNormalizeMatrixProduct[Normalized -> Right]\[DoubleLongRightArrow]
    LSAMonTakeH;
lsBasis = ImageAdjust[Image[Partition[#, ImageDimensions[aCImages[[1]]][[1]]]]] & /@ SparseArray[H];

Approximation

Pick a Chinese character image as a target image and pre-process
it:

SeedRandom[43];
ind = RandomChoice[testingInds];
imgTest = aCImages[ind];
matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aCImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic];
imgTest

Find its representation with the chosen feature extractor (LSAMon
object here):

matReprsentation = lsaAllExtendedObj\[DoubleLongRightArrow]LSAMonRepresentByTopics[matImageTest]\[DoubleLongRightArrow]LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[MapIndexed[Tooltip[#1, lsBasis[[#2[[1]]]]] &, lsCoeff], Filling -> Axis, PlotRange -> All]

Show representation coefficients outliers:

lsBasis[[OutlierPosition[Abs[lsCoeff], TopOutliers@*QuartileIdentifierParameters]]]

Remark: Note that expected
radical images are in the outliers.

Show the interpretation of the found representation:

vecReprsentation = lsCoeff . SparseArray[H];
reprImg = Image[Unitize@Clip[#, {0.45, 1}, {0, 1}] &@Rescale[Partition[vecReprsentation, ImageDimensions[aCImages[[1]]][[1]]]]];
{reprImg, imgTest}

See the closest characters using image distances:

KeyMap[# /. aCImages &, TakeSmallest[ImageDistance[reprImg, #] & /@ aCImages, 4]]

Setup

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicLatentSemanticAnalysis.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicSparseMatrixRecommender.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

References

[SH1] Silvia Hao, “Exploring
structure of Chinese characters through image processing”, (2022),
Wolfram Community.

[AA1] Anton Antonov, “A monad for
Latent Semantic Analysis workflows”, (2019), Wolfram Community.

[AA2] Anton Antonov, “LSA methods
comparison over random mandalas deconstruction – WL”, (2022), Wolfram Community.

[AA3] Anton Antonov, “Bethlehem
stars: classifying randomly generated mandalas”, (2020), Wolfram Community.

[AA4] Anton Antonov, “Random mandalas deconstruction in R, Python, and Mathematica”, (2022), MathematicaForPrediction at WordPress.

[AAp1] Anton Antonov, LSAMon
for Image Collections Mathematica package, (2022), MathematicaForPrediction
at GitHub.

Random mandalas deconstruction in R, Python, and Mathematica

Posted on March 1, 2022 by Anton Antonov Antonov

Today (2022-02-28) I gave a presentation Greater Boston useR Meetup titled “Random mandalas deconstruction with R, Python, and Mathematica”. (Link to the video recording.)

Here is the abstract:

In this presentation we discuss the application of different dimension reduction algorithms over collections of random mandalas. We discuss and compare the derived image bases and show how those bases explain the underlying collection structure. The presented techniques and insights (1) are applicable to any collection of images, and (2) can be included in larger, more complicated machine learning workflows. The former is demonstrated with a handwritten digits recognition
application; the latter with the generation of random Bethlehem stars. The (parallel) walk-through of the core demonstration is in all three programming languages: Mathematica, Python, and R.

Here is the related RStudio project: “RandomMandalasDeconstruction”.

Here is a link to the R-computations notebook converted to HTML: “LSA methods comparison in R”.

The Mathematica notebooks are placed in project’s folder “notebooks-WL”.

See the work plan status in the org-mode file “Random-mandalas-deconstruction-presentation-work-plan.org”.

Here is the mind-map for the presentation:

The comparison workflow implemented in the notebooks of this project is summarized in the following flow chart:

References

Cryptocurrencies data explorations

Posted on June 22, 2021 by Anton Antonov Antonov

Introduction

The main goal of this notebook is to provide some basic views and insights into the landscape of cryptocurrencies. The “landscape” we consider consists of price action and trading volume time series for cryptocurrencies found in Yahoo Finance.

Here is the work plan followed in this notebook:

Get cryptocurrency data
Do basic data analysis over suitable date ranges
Gather important cryptocurrency events
Plot together cryptocurrency prices and trading volume time series together with the events
Make observations and conjectures over the plots
Find “global” correlations between the different cryptocurrencies
Find clusters of cryptocurrencies based on time series correlations

Here are some details for the steps above:

The procedure of obtaining the cryptocurrencies data, point 1, is explained in detail in [AA1].
- There is a dedicated resource object CrypocurrencyData that provides cryptocurrency data and related documentation.
The cryptocurrency events data, point 3, is taken from different news sites.
- Links are provided in the corresponding dataset.
The points 6 and 7 follow similar explorations (and code) described in [AA2, AA3].
- Those two articles deal with COVID-19 time series.

Remark: Note that in this notebook we do not discuss philosophical, macro-economic, and environmental issues with cryptocurrencies. We only discuss financial time series data.

Cryptocurrencies data

The cryptocurrencies data used in this notebook is obtained from found in Yahoo Finance . The procedure of obtaining the cryptocurrencies data is explained in detail in [AA1]. There is a dedicated resource object CrypocurrencyData that provides the cryptocurrency data and related documentation.

Here are all cryptocurrencies we have data for:

ResourceFunction["CryptocurrencyData"]["CryptocurrencyNames"]

(*<|"BTC" -> "Bitcoin", "ETH" -> "Ethereum", "USDT" -> "Tether", "BNB" -> "BinanceCoin", "ADA" -> "Cardano", "XRP" -> "XRP", "USDC" -> "Coin", "DOGE" -> "Dogecoin", "DOT1" -> "Polkadot", "HEX" -> "HEX", "UNI3" -> "Uniswap", "BCH" -> "BitcoinCash", "LTC" -> "Litecoin", "LINK" -> "Chainlink", "SOL1" -> "Solana", "MATIC" -> "MaticNetwork", "THETA" -> "THETA", "XLM" -> "Stellar", "VET" -> "VeChain", "ICP1" -> "InternetComputer", "ETC" -> "EthereumClassic", "TRX" -> "TRON", "FIL" -> "FilecoinFutures", "XMR" -> "Monero", "EOS" -> "EOS"|>*)

Remark: FinancialData is “aware” of 10 cryptocurrencies, but that is not documented (as far as I can tell) and only prices are provided. (For more details see the discussion in CrypocurrencyData.) Here are examples:

Row[DateListPlot[FinancialData[#, "Jan 1 2021"], ImageSize -> Medium, AspectRatio -> 1/4, PlotLabel -> #] & /@ {"BTC", "ETH"}]

Significant cryptocurrencies

In this section we analyze the summaries of cryptocurrencies data in order to derive a list of the most significant ones.

We choose the phrase “significant cryptocurrency” to mean “a cryptocurrency with high market capitalization, price, or trading volume.”

Together with the summaries we look into the Pareto principle adherence of the corresponding values.

Remark: The Pareto principle adherence should be interpreted carefully here – the cryptocurrencies are not mutually exclusive when in comes to money invested and trading volumes. Nevertheless, we can interpret the corresponding value ratios as indicators of “mind share” or “significance.”

By summaries

Here is a summary of the cryptocurrencies we consider (from Yahoo Finance) ordered by “Market Cap” (largest first):

dsCCSummary = ResourceFunction["CryptocurrencyData"][All, "Summary"]

Here is the summary of summary dataset above:

ResourceFunction["RecordsSummary"][dsCCSummary]

Here is a Pareto principle adherence plot for the cryptocurrency market caps:

aMCaps = Normal[dsCCSummary[Association, StringSplit[#Symbol, "-"][[1]] -> #["Market Cap"] &]]; ResourceFunction["ParetoPrinciplePlot"][aMCaps, PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency market caps"]

Here is the Pareto statistic for the top 12 cryptocurrencies:

Take[AssociationThread[Keys@aMCaps, Accumulate[Values@aMCaps]]/Total[aMCaps], 12]

(*<|"BTC" -> 0.521221, "ETH" -> 0.71188, "USDT" -> 0.765931, "BNB" -> 0.800902, "ADA" -> 0.833777, "XRP" -> 0.856467, "USDC" -> 0.878274, "DOGE" -> 0.899587, "DOT1" -> 0.9121, "HEX" -> 0.924055, "UNI3" -> 0.932218, "BCH" -> 0.939346|>*)

By price

Get the mean daily closing prices data for the last two weeks and show the corresponding data summary:

startDate = DatePlus[Now, -Quantity[2, "Weeks"]]; aMeans = ReverseSort[Association[# -> Mean[ResourceFunction["CryptocurrencyData"][#, "Close", startDate]["Values"]] & /@ ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]]];
ResourceFunction["RecordsSummary"][aMeans, Thread -> True]

Pareto principle adherence plot:

ResourceFunction["ParetoPrinciplePlot"][aMeans, PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency closing prices"]

Here are the Pareto statistic values for the top 12 cryptocurrencies:

aCCTop = Take[AssociationThread[Keys@aMeans, Accumulate[Values@aMeans]]/Total[aMeans], 12]

(*<|"BTC" -> 0.902595, "ETH" -> 0.959915, "BCH" -> 0.974031, "BNB" -> 0.982414, "XMR" -> 0.988689, "LTC" -> 0.992604, "FIL" -> 0.99426, "ICP1" -> 0.995683, "ETC" -> 0.997004, "SOL1" -> 0.997906, "LINK" -> 0.998449, "UNI3" -> 0.998987|>*)

Plot the daily closing prices of top cryptocurrencies since January 2018:

DateListPlot[Log10 /@ Association[# -> ResourceFunction["CryptocurrencyData"][#, "Close", "Jan 1, 2018"] & /@ Keys[aCCTop]], PlotLabel -> "lg of crytocurrencies daily closing prices, USD", PlotTheme -> "Detailed", PlotRange -> All]

By trading volume

Get the mean daily trading volumes data for the last two weeks and show the corresponding data summary:

startDate = DatePlus[Now, -Quantity[2, "Weeks"]]; aMeans = ReverseSort[Association[# -> Mean[ResourceFunction["CryptocurrencyData"][#, "Volume", startDate]["Values"]] & /@ ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]]];
ResourceFunction["RecordsSummary"][aMeans, Thread -> True]

Pareto principle adherence plot:

ResourceFunction["ParetoPrinciplePlot"][aMeans, PlotRange -> {0, 1.1},PlotRange -> All, PlotLabel -> "Pareto principle for cryptocurrency trading volumes"]

Here are the Pareto statistic values for the top 12 cryptocurrencies:

aCCTop = N@Take[AssociationThread[Keys@aMeans, Accumulate[Values@aMeans]]/Total[aMeans], 12]

(*<|"USDT" -> 0.405697, "BTC" -> 0.657918, "ETH" -> 0.817959, "XRP" -> 0.836729, "ADA" -> 0.853317, "ETC" -> 0.868084, "LTC" -> 0.882358, "DOGE" -> 0.896621, "BNB" -> 0.910013, "USDC" -> 0.923379, "BCH" -> 0.933938, "DOT1" -> 0.944249|>*)

Plot the daily closing prices of top cryptocurrencies since January 2018:

DateListPlot[Log10 /@ Association[# -> ResourceFunction["CryptocurrencyData"][#, "Volume", "Jan 1, 2018"] & /@ Keys[aCCTop]], PlotLabel -> "lg of cryptocurrencies trading volumes", PlotTheme -> "Detailed", PlotRange -> {5, Automatic}]

In this section we make a dataset that has the dates of certain cryptocurrency related events and links to their news announcements.

The events were taken by observing cryptocurrency board stories in the news aggregation site slashdot.org.

lsEventData = {
    {"Jun 18, 2021", "China to shut down over 90% of its Bitcoin mining capacity after local bans", "https://www.globaltimes.cn/page/202106/1226598.shtml"}, 
    {"Jun 10, 2021", "Global banking regulators call for toughest rules for cryptocurrencies", "https://www.theguardian.com/technology/2021/jun/10/global-banking-regulators-cryptocurrencies-bitcoin"}, 
    {"June 10, 2021", "IMF sees legal, economic issues with El Salvador's bitcoin move","https://www.reuters.com/business/finance/imf-sees-legal-economic-issues-with-el-salvador-bitcoin-move-2021-06-10/"}, 
    {"June 8, 2021", "El Salvador Becomes First Country To Adopt Bitcoin as Legal Tender After Passing Law", "https://www.cnbc.com/2021/06/09/el-salvador-proposes-law-to-make-bitcoin-legal-tender.html"}, 
    {"June 8, 2021", "US recovers millions in cryptocurrency paid to Colonial Pipeline ransomware hackers", "https://edition.cnn.com/2021/06/07/politics/colonial-pipeline-ransomware-recovered/"}, 
    {"June 4, 2021", "Start of Bitcoin 2021: World\[CloseCurlyQuote]s Largest Cryptocurrency Conference Coming To Wynwood", "https://miami.cbslocal.com/2021/06/04/bitcoin-2021-worlds-largest-cryptocurrency-conference-coming-to-wynwood/"}, 
    {"June 6, 2021", "End of Bitcoin 2021: World\[CloseCurlyQuote]s Largest Cryptocurrency Conference Coming To Wynwood", "https://miami.cbslocal.com/2021/06/04/bitcoin-2021-worlds-largest-cryptocurrency-conference-coming-to-wynwood/"}, 
    {"May 28, 2021", "Iran Bans Crypto Mining After Months of Blackouts", "https://gizmodo.com/iran-bans-crypto-mining-after-months-of-blackouts-1846991039"}, 
    {"May 19, 2021", "Bitcoin, Ethereum prices in free fall as China plans crackdown on mining and trading", "https://www.cnet.com/news/bitcoin-ethereum-prices-in-freefall-as-china-plans-crackdown-on-mining-and-trading/#ftag=CAD590a51e"} 
   };
dsEventData = Dataset[lsEventData][All, AssociationThread[{"Date", "Event", "URL"}, #] &];
dsEventData = dsEventData[All, Join[Prepend[#, "DateObject" -> DateObject[#Date]], <|"URL" -> URL[#URL]|>] &];
dsEventData = dsEventData[SortBy[#DateObject &]]

Cryptocurrency time series with events

In this section we discuss possible correlation and causation effects of reported cryptocurrency events.

Remark: The discussion is based on time series and events only, without considering other operational properties of the cryptocurrencies.

Here is a date range:

dateRange = {"May 15 2021", "Jun 21 2021"};

Here get time series for the daily opening and closing prices for the selected date range:

aBTCPrices = ResourceFunction["CryptocurrencyData"]["BTC", {"Open", "Close"}, dateRange];
aETHPrices = ResourceFunction["CryptocurrencyData"]["ETH", {"Open", "Close"}, dateRange];
aCCVolume = ResourceFunction["CryptocurrencyData"][{"BTC", "ETH"}, "Volume", dateRange];

Here are the summaries for prices:

ResourceFunction["GridTableForm"][Map[ResourceFunction["RecordsSummary"][#["Values"], "USD"] &, #] & /@ <|"BTC" -> aBTCPrices, "ETH" -> aETHPrices|>]

Here are the summaries for trading volumes:

ResourceFunction["RecordsSummary"][#["Values"], "USD"] & /@ aCCVolume

Here we plot the cryptocurrency events with together with the Bitcoin (BTC) price time series:

CryptocurrencyPlot[{aBTCPrices, dsEventData}, PlotLabel -> "BTC daily prices", ImageSize -> 1200]

Here we plot the cryptocurrency events with together with the Ether (ETH) price time series:

CryptocurrencyPlot[{aETHPrices, dsEventData}, PlotRange -> {0.95, 1.05} MinMax[aETHPrices[[1]]["Values"]], PlotLabel -> "BTC daily prices", ImageSize -> 1200]

Here we plot the cryptocurrency events with together with the BTC trading volume time series:

CryptocurrencyPlot[{aCCVolume, dsEventData}, PlotLabel -> "BTC and ETH trading volumes", ImageSize -> 1200]

Observations

Going down

We can see that opening prices and volume going down correlate with:

The news announcement that China plans to crackdown on mining and trading
The news announcement Iran bans crypto mining
The Sichuan Provincial Development and Reform Commission and the Sichuan Energy Bureau issue of a joint notice, ordering local electricity companies to “screen, clean up and terminate” mining operations
The start of the “Bitcoin 2021” conference

Related conjectures:

We can easily conjecture that 1 and 2 made cryptocurrencies (Bitcoin) less attractive to miners or traders in China and Iran, hence the price and the volume went down.
The most active Bitcoin traders were attending the “Bitcoin 2021” conference, hence the price and volume went down.

Going up

We can see the prices and volume going up correlate with:

The news announcement of El Salvador adopting BTC as legal tender currency
The news announcement that US Justice Department recovered most of the ransom paid to the Colonial Pipeline hackers
The end of the “Bitcoin 2021” conference

Related conjectures:

Of course, a country deciding to use BTC as legal tender would make (some) traders willing to invest in BTC.
The announcement that USA Justice Department, have made (some) traders to more confidently invest in BTC.
- Although, the opposite could also happen – for some people if BTC can be recovered by law enforcement, then BTC is less attractive for financial transactions.
After the end of “Bitcoin 2021” conference the attending traders resumed their usual activity.
- That conjecture and the “start of Bitcoin 2021” conjecture above support each other.
- The same pattern is observed for both BTC and ETH trading volumes.

Time series correlations

In this section we compute and visualize correlations between the time series of a set of cryptocurrencies.

Getting time series data

Here are the cryptocurrencies we consider:

lsCCFocus = ResourceFunction["CryptocurrencyData"]["Cryptocurrencies"]

(*{"ADA", "BCH", "BNB", "BTC", "DOGE", "DOT1", "EOS", "ETC", "ETH", "FIL", "HEX", "ICP1", "LINK", "LTC", "MATIC", "SOL1", "THETA", "TRX", "UNI3", "USDC", "USDT", "VET", "XLM", "XMR", "XRP"}*)

The start date we use is the one that was 90 days ago:

startDate = DatePlus[Date[], -Quantity[90, "Days"]]

(*{2021, 3, 24, 13, 24, 42.303}*)

aTSOpen = ResourceFunction["CryptocurrencyData"][lsCCFocus, "Open", startDate];
aTSVolume = ResourceFunction["CryptocurrencyData"][lsCCFocus, "Volume", startDate];

dateRange = {startDate, Date[]};
aTSOpen2 = Quiet@TimeSeriesResample[#, Append[dateRange, "Day"]] & /@ aTSOpen;
aTSVolume2 = Quiet@TimeSeriesResample[#, Append[dateRange, "Day"]] & /@ aTSVolume;

Opening price time series

Show heat-map plot corresponding to the max-normalized time series with clustering:

matVals = Association["SparseMatrix" -> SparseArray[Values@Map[#["Values"]/Max[#["Values"]] &, aTSOpen2]],"RowNames" -> Keys[aTSOpen2], "ColumnNames" -> Range[Length[aTSOpen2[[1]]["Times"]]]];
HeatmapPlot[Map[# /. x_Association :> Keys[x] &, matVals], Dendrogram -> {True, False}, DistanceFunction -> {CosineDistance, None}, ImageSize -> 1200]

Derive correlation triplets using SpearmanRho :

lsCorTriplets = Flatten[Outer[{#1, #2, SpearmanRho[aTSOpen2[#1]["Values"], aTSOpen2[#2]["Values"]]} &, Keys@aTSOpen2, Keys@aTSOpen2], 1];
dsCorTriplets = Dataset[lsCorTriplets][All, AssociationThread[{"TS1", "TS2", "Correlation"}, #] &];
dsCorTriplets = dsCorTriplets[Select[#TS1 != #TS2 &]];

Show summary of the correlation triplets:

ResourceFunction["RecordsSummary"][dsCorTriplets]

Show correlations that too high or too low:

Dataset[Union[Normal@dsCorTriplets[Select[Abs[#Correlation] > 0.85 &]], "SameTest" -> (Sort[Values@#1] == Sort[Values@#2] &)]][ReverseSortBy[#Correlation &]]

Cross tabulate the correlation triplets and show the corresponding dataset:

dsMatCor = ResourceFunction["CrossTabulate"][dsCorTriplets]

Cross tabulate the correlation triplets and plot the corresponding matrix with heat-map plot:

matCor1 = ResourceFunction["CrossTabulate"][dsCorTriplets, "Sparse" -> True];
gr1 = HeatmapPlot[matCor1, Dendrogram -> {True, True}, DistanceFunction -> {CosineDistance, CosineDistance}, ImageSize -> Medium, PlotLabel -> "Opening price"]

Trading volume time series

Show heat-map plot corresponding to the max-normalized time series with clustering:

matVals = Association["SparseMatrix" -> SparseArray[Values@Map[#["Values"]/Max[#["Values"]] &, aTSVolume2]], "RowNames" -> Keys[aTSOpen2], "ColumnNames" -> Range[Length[aTSVolume2[[1]]["Times"]]]];
HeatmapPlot[Map[# /. x_Association :> Keys[x] &, matVals], Dendrogram -> {True, False}, DistanceFunction -> {CosineDistance, None}, ImageSize -> 1200]

Derive correlation triplets using SpearmanRho :

lsCorTriplets = Flatten[Outer[{#1, #2, SpearmanRho[aTSVolume2[#1]["Values"], aTSVolume2[#2]["Values"]]} &, Keys@aTSVolume2, Keys@aTSVolume2], 1];
dsCorTriplets = Dataset[lsCorTriplets][All, AssociationThread[{"TS1", "TS2", "Correlation"}, #] &];
dsCorTriplets = dsCorTriplets[Select[#TS1 != #TS2 &]];

Show summary of the correlation triplets:

ResourceFunction["RecordsSummary"][dsCorTriplets]

Show correlations that too high or too low:

Dataset[Union[Normal@dsCorTriplets[Select[Abs[#Correlation] > 0.85 &]], "SameTest" -> (Sort[Values@#1] == Sort[Values@#2] &)]][ReverseSortBy[#Correlation &]]

Cross tabulate the correlation triplets and show the corresponding dataset:

dsMatCor = ResourceFunction["CrossTabulate"][dsCorTriplets]

Cross tabulate the correlation triplets and plot the corresponding matrix with heat-map plot:

matCor2 = ResourceFunction["CrossTabulate"][dsCorTriplets, "Sparse" -> True];
gr2 = HeatmapPlot[matCor2, Dendrogram -> {True, True}, DistanceFunction -> {CosineDistance, CosineDistance}, ImageSize -> Medium, PlotLabel -> "Trading volume"]

Observations

Here are the correlation matrix plots above placed next to each other:

Row[{gr1, gr2}]

Generally speaking, the two clustering patterns are different. This is one of the reasons to do the nearest neighbor graph clusterings below.

Nearest neighbors graphs

In this section we create nearest neighbor graphs of the correlation matrices computed above and plot clusterings of the nodes.

Graphs overview

Here we create the nearest neighbor graphs:

aNNGraphsVertexRules = Association@MapThread[#2 -> Association[Thread[Rule[Normal[Transpose[#SparseMatrix]], #ColumnNames]]] &, {{matCor1, matCor2}, {"Open", "Volume"}}];

aNNGraphs = Association@MapThread[(gr = NearestNeighborGraph[Normal[Transpose[#SparseMatrix]], 4, GraphLayout -> "SpringEmbedding", VertexLabels -> Normal[aNNGraphsVertexRules[#2]]]; #2 -> Graph[EdgeList[gr], VertexLabels -> Normal[aNNGraphsVertexRules[#2]], ImageSize -> Large]) &, {{matCor1, matCor2}, {"Open", "Volume"}}];

Here we plot the graphs with clusters:

ResourceFunction["GridTableForm"][List @@@ Normal[CommunityGraphPlot[#, ImageSize -> 800] & /@ aNNGraphs], TableHeadings -> {"Property", "Communities of nearest neighbors graph"}, Background -> White, Dividers -> All]

Here are the corresponding time series plots for each cluster:

aClusterPlots = 
   Association@Map[
     Function[{prop}, 
      prop -> Map[
        DateListPlot[Log10 /@ ResourceFunction["CryptocurrencyData"][#, prop, dateRange]] &, 
        FindGraphCommunities[aNNGraphs[prop]] /. aNNGraphsVertexRules[prop]] 
     ], 
     Keys[aNNGraphs] 
    ];

ResourceFunction["GridTableForm"][List @@@ Normal[aClusterPlots], TableHeadings -> {"Property", "Cluster plots"}, Background -> White, Dividers -> All]

Other types of analysis

I investigated the data with several other methods:

Clustering with different methods and distance functions
Clustering after the application of Independent Component Analysis (ICA), [AAw5]
Time series analysis with Quantile Regression (QR), [AAw6]

None of the outcomes provided some “immediate”, notable insight. The analyses with ICA and QR, though, seem to provide some interesting and fruitful future explorations.

Load packages

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/SSparseMatrix.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Definitions

Clear[CryptocurrencyPlot];
CryptocurrencyPlot[{aCryptoCurrenciesData_Association, dsEventData_Dataset}, opts : OptionsPattern[]] := 
   Block[{aEventDateObject, aEventURL, aEventRank, grGrid, lsVals}, 
    
    aEventDateObject = Normal@dsEventData[Association, {#Event -> AbsoluteTime[#DateObject]} &]; 
    aEventURL = Normal@dsEventData[Association, {#Event -> Button[Mouseover[Style[#Event, Gray, FontSize -> 10], Style[#Event, Pink, FontSize -> 10]], NotebookLocate[{#URL, None}], Appearance -> None]} &]; aEventRank = Block[{k = 1}, Normal@dsEventData[Association, {#Event -> (k++)/Length[dsEventData]} &]]; 
    
    lsVals = Flatten@Map[#["Values"] &, Values@aCryptoCurrenciesData];
    grGrid = 
     DateListPlot[
      KeyValueMap[Callout[{#2, Rescale[aEventRank[#1], {0, 1}, MinMax[lsVals]]}, aEventURL[#1], Right] &, Sort@aEventDateObject], 
      PlotStyle -> {Gray, Opacity[0.3], PointSize[0.0035]}, 
      Joined -> False, 
      GridLines -> {Sort@Values[aEventDateObject], None} 
     ]; 
    Show[
     DateListPlot[
      aCryptoCurrenciesData, 
      opts, 
      GridLines -> {Sort@Values[aEventDateObject], None}, 
      PlotRange -> All, 
      AspectRatio -> 1/4, 
      ImageSize -> Large 
     ], 
     grGrid 
    ] 
   ];
CryptocurrencyPlot[___] := $Failed;

References

Articles

[AA1] Anton Antonov, “Crypto-currencies data acquisition with visualization”, (2021), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub.

[AA3] Anton Antonov, “Apple mobility trends data visualization”, (2020), SystemModeling at GitHub.

Packages

[AAp1] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

Resource functions

[AAw1] Anton Antonov, CryptocurrencyData, (2021).

[AAw2] Anton Antonov, RecordsSummary, (2019).

[AAw3] Anton Antonov, ParetoPrinciplePlot, (2019).

[AAw4] Anton Antonov, CrossTabulate, (2019).

[AAw5] Anton Antonov, IndependentComponentAnalysis, (2019).

[AAw6] Anton Antonov, QuantileRegression, (2019).

Crypto-currencies data acquisition with visualization

Posted on June 19, 2021 by Anton Antonov Antonov

Introduction

In this notebook we show how to obtain crypto-currencies data from several data sources and make some basic time series visualizations. We assume the described data acquisition workflow is useful for doing more detailed (exploratory) analysis.

There are multiple crypto-currencies data sources, but a small proportion of them give a convenient way of extracting crypto-currencies data automatically. I found the easiest to work with to be https://finance.yahoo.com/cryptocurrencies, [YF1]. Another easy to work with Bitcoin-only data source is https://data.bitcoinity.org , [DBO1].

(I also looked into using https://www.coindesk.com/coindesk20. )

Remark: The code below is made with certain ad-hoc inductive reasoning that brought meaningful results. This means the code has to be changed if the underlying data organization in [YF1, DBO1] is changed.

Yahoo! Finance

Getting cryptocurrencies symbols and summaries

In this section we get all crypto-currencies symbols and related metadata.

Get the data of all crypto-currencies in [YF1]:

AbsoluteTiming[
  lsData = Import["https://finance.yahoo.com/cryptocurrencies", "Data"]; 
 ]

(*{6.18067, Null}*)

Locate the data:

pos = First@Position[lsData, {"Symbol", "Name", "Price (Intraday)", "Change", "% Change", ___}];
dsCryptoCurrenciesColumnNames = lsData[[Sequence @@ pos]]
Length[dsCryptoCurrenciesColumnNames]

(*{"Symbol", "Name", "Price (Intraday)", "Change", "% Change", "Market Cap", "Volume in Currency (Since 0:00 UTC)", "Volume in Currency (24Hr)", "Total Volume All Currencies (24Hr)", "Circulating Supply", "52 Week Range", "1 Day Chart"}*)

(*12*)

Get the data:

dsCryptoCurrencies = lsData[[Sequence @@ Append[Most[pos], 2]]];
Dimensions[dsCryptoCurrencies]

(*{25, 10}*)

Make a dataset:

dsCryptoCurrencies = Dataset[dsCryptoCurrencies][All, AssociationThread[dsCryptoCurrenciesColumnNames[[1 ;; -3]], #] &]

Get all time series

In this section we get all the crypto-currencies time series from [YF1].

AbsoluteTiming[
  ccNow = Round@AbsoluteTime[Date[]] - AbsoluteTime[{1970, 1, 1, 0, 0, 0}]; 
  aCryptoCurrenciesDataRaw = 
   Association@
    Map[
     # -> ResourceFunction["ImportCSVToDataset"]["https://query1.finance.yahoo.com/v7/finance/download/" <> # <>"?period1=1410825600&period2=" <> ToString[ccNow] <> "&interval=1d&events=history&includeAdjustedClose=true"] &, Normal[dsCryptoCurrencies[All, "Symbol"]] 
    ]; 
 ]

(*{5.98745, Null}*)

Remark: Note that in the code above we specified the upper limit of the time span to be the current date. (And shifted it with respect to the epoch start 1970-01-01 used by [YF1].)

Check we good the data with dimensions retrieval:

Dimensions /@ aCryptoCurrenciesDataRaw

(*<|"BTC-USD" -> {2468, 7}, "ETH-USD" -> {2144, 7}, "USDT-USD" -> {2307, 7}, "BNB-USD" -> {1426, 7}, "ADA-USD" -> {1358, 7}, "DOGE-USD" -> {2468, 7}, "XRP-USD" -> {2468, 7}, "USDC-USD" -> {986, 7}, "DOT1-USD" -> {304, 7}, "HEX-USD" -> {551, 7}, "UNI3-USD" -> {81, 7},"BCH-USD" -> {1428, 7}, "LTC-USD" -> {2468, 7}, "SOL1-USD" -> {436, 7}, "LINK-USD" -> {1369, 7}, "THETA-USD" -> {1250, 7}, "MATIC-USD" -> {784, 7}, "XLM-USD" -> {2468, 7}, "ICP1-USD" -> {32, 7}, "VET-USD" -> {1052, 7}, "ETC-USD" -> {1792, 7}, "FIL-USD" -> {1285, 7}, "TRX-USD" -> {1376, 7}, "XMR-USD" -> {2468, 7}, "EOS-USD" -> {1450, 7}|>*)

Check we good the data with random sample:

RandomSample[#, 6] & /@ KeyTake[aCryptoCurrenciesDataRaw, RandomChoice[Keys@aCryptoCurrenciesDataRaw]]

Here we add the crypto-currencies symbols and convert date strings into date objects.

AbsoluteTiming[
  aCryptoCurrenciesData = Association@KeyValueMap[Function[{k, v}, k -> v[All, Join[<|"Symbol" -> k, "DateObject" -> DateObject[#Date]|>, #] &]], aCryptoCurrenciesDataRaw]; 
 ]

(*{8.27865, Null}*)

Summary

In this section we compute the summary over all datasets:

ResourceFunction["RecordsSummary"][Join @@ Values[aCryptoCurrenciesData], "MaxTallies" -> 30]

Plots

Here we plot the “Low” and “High” price time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
  Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]}, 
    DateListPlot[{
      Normal[dsTemp[All, {"DateObject", "Low"}][Values]], 
      Normal[dsTemp[All, {"DateObject", "High"}][Values]]}, 
     PlotLegends -> {"Low", "High"}, 
     AspectRatio -> 1/4, 
     PlotRange -> All] 
   ] &, 
  aCryptoCurrenciesData 
 ]

Here we plot the volume time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
  Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]}, 
    DateListPlot[{
      Normal[dsTemp[All, {"DateObject", "Volume"}][Values]]}, 
     PlotLabel -> "Volume", 
     AspectRatio -> 1/4, 
     PlotRange -> All] 
   ] &, 
  aCryptoCurrenciesData 
 ]

data.bitcoinity.org

In this section we ingest crypto-currency data from data.bitcoinity.org, [DBO1].

Metadata

In this sub-section we assign different metadata elements used in data.bitcoinity.org.

The currencies and exchanges we obtained by examining the output of:

Import["https://data.bitcoinity.org/markets/price/30d/USD?t=l", "Plaintext"]

Assignments

lsCurrencies = {"all", "AED", "ARS", "AUD", "BRL", "CAD", "CHF", "CLP", "CNY", "COP", "CZK", "DKK", "EUR", "GBP", "HKD", "HRK", "HUF", "IDR", "ILS", "INR", "IRR", "JPY", "KES", "KRW", "MXN", "MYR", "NOK", "NZD", "PHP", "PKR", "PLN", "RON", "RUB", "RUR", "SAR", "SEK", "SGD", "THB", "TRY", "UAH", "USD", "VEF", "XAU", "ZAR"};

lsExchanges = {"all", "bit-x", "bit2c", "bitbay", "bitcoin.co.id", "bitcoincentral", "bitcoinde", "bitcoinsnorway", "bitcurex", "bitfinex", "bitflyer", "bithumb", "bitmarketpl", "bitmex", "bitquick", "bitso", "bitstamp", "btcchina", "btce", "btcmarkets", "campbx", "cex.io", "clevercoin", "coinbase", "coinfloor", "exmo", "gemini", "hitbtc", "huobi", "itbit", "korbit", "kraken", "lakebtc", "localbitcoins", "mercadobitcoin", "okcoin", "paymium", "quadrigacx", "therocktrading", "vaultoro", "wallofcoins"};

lsTimeSpans = {"10m", "1h", "6h", "24h", "3d", "30d", "6m", "2y", "5y", "all"};

lsTimeUnit = {"second", "minute", "hour", "day", "week", "month"};

aDataTypeDescriptions = Association@{"price" -> "Prince", "volume" -> "Trading Volume", "rank" -> "Rank", "bidask_sum" -> "Bid/Ask Sum", "spread" -> "Bid/Ask Spread", "tradespm" -> "Trades Per Minute"};
lsDataTypes = Keys[aDataTypeDescriptions];

Getting BTC data

Here we make a template string that for CSV data retrieval from data.bitcoinity.org:

stDBOURL = StringTemplate["https://data.bitcoinity.org/export_data.csv?currency=`currency`&data_type=`dataType`&exchange=`exchange`&r=`timeUnit`&t=l&timespan=`timeSpan`"]

(*TemplateObject[{"https://data.bitcoinity.org/export_data.csv?currency=", TemplateSlot["currency"], "&data_type=", TemplateSlot["dataType"], "&exchange=", TemplateSlot["exchange"], "&r=", TemplateSlot["timeUnit"], "&t=l&timespan=", TemplateSlot["timeSpan"]}, CombinerFunction -> StringJoin, InsertionFunction -> TextString]*)

Here is an association with default values for the string template above:

aDBODefaultParameters = <|"currency" -> "USD", "dataType" -> "price", "exchange" -> "all", "timeUnit" -> "day", "timeSpan" -> "all"|>;

Remark: The metadata assigned above is used to form valid queries for the query string template.

Remark: Not all combinations of parameters are “fully respected” by data.bitcoinity.org. For example, if a data request is with time granularity that is too fine over a large time span, then the returned data is with coarser granularity.

Price for a particular currency and exchange pair

Here we retrieve data by overwriting the parameters for currency, time unit, time span, and exchange:

dsBTCPriceData = 
  ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"currency" -> "EUR", "timeUnit" -> "hour", "timeSpan" -> "7d", "exchange" -> "coinbase"|>]]]

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCPriceData]

Volume data

Here we retrieve data by overwriting the parameters for data type, time unit, time span, and exchange:

dsBTCVolumeData = 
  ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"dataType" -> Volume, "timeUnit" -> "day", "timeSpan" -> "30d", "exchange" -> "all"|>]]]

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCVolumeData]

Plots

Price data

Here we extract the non-time columns in the tabular price data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCPriceData[All, {"Time", #}][Values]] & /@Rest[Normal[Keys[dsBTCPriceData[[1]]]]]], AspectRatio -> 1/4, ImageSize -> Large]

Volume data

Here we extract the non-time columns (corresponding to exchanges) in the tabular volume data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCVolumeData[All, {"Time", #}][Values]] & /@ Rest[Normal[Keys[dsBTCVolumeData[[1]]]]]], PlotRange -> All, AspectRatio -> 1/4, ImageSize -> Large]

References

[DBO1] https://data.bitcoinity.org.

[WK1] Wikipedia entry, Cryptocurrency.

[YF1] Yahoo! Finance, Cryptocurrencies.

Time series search engines over COVID-19 data

Posted on January 27, 2021 by Anton Antonov Antonov

Introduction

In this article we proclaim the preparation and availability of interactive interfaces to two Time Series Search Engines (TSSEs) over COVID-19 data. One TSSE is based on Apple Mobility Trends data, [APPL1]; the other on The New York Times COVID-19 data, [NYT1].

Here are links to interactive interfaces of the TSSEs hosted (and publicly available) at shinyapps.io by RStudio:

Motivation: The primary motivation for making the TSSEs and their interactive interfaces is to use them as exploratory tools. Combined with relevant data analysis (e.g. [AA1, AA2]) the TSSEs should help to form better intuition and feel of the spread of COVID-19 and related data aggregation, public reactions, and government polices.

The rest of the article is structured as follows:

Brief descriptions the overall process, the data
Brief descriptions the search engines structure and implementation
Discussions of a few search examples and their (possible) interpretations

The overall process

For both search engines the overall process has the same steps:

Ingest the data
Do basic (and advanced) data analysis
Make (and publish) reports detailing the data ingestion and transformation steps
Enhance the data with transformed versions of it or with additional related data
Make a Time Series Sparse Matrix Recommender (TSSMR)
Make a Time Series Search Engine Interactive Interface (TSSEII)
Make the interactive interface easily accessible over the World Wide Web

Here is a flow chart that corresponds to the steps listed above:

Data

The Apple data

The Apple Mobility Trends data is taken from Apple’s site, see [APPL1]. The data ingestion, basic data analysis, time series seasonality demonstration, (graph) clusterings are given in [AA1]. (Here is a link to the corresponding R-notebook .)

The weather data was taken using the Mathematica function WeatherData, [WRI1].

(It was too much work to get the weather data using some of the well known weather data R packages.)

The New York Times data

The New York Times COVID-19 data is taken from GitHub, see [NYT1]. The data ingestion, basic data analysis, and visualizations are given in [AA2]. (Here is a link to the corresponding R-notebook .)

The search engines

The following sub-sections have screenshots of the TSSE interactive interfaces.

I did experiment with combining the data of the two engines, but did not turn out to be particularly useful. It seems that is more interesting and useful to enhance the Apple data engine with temperature data, and to enhance The New Your Times engine with the (consecutive) differences of the time series.

Structure

The interactive interfaces have three panels:

Nearest Neighbors
- Gives the time series nearest neighbors for the time series of selected entity.
- Has interactive controls for entity selection and filtering.
Trend Finding
- Gives the time series that adhere to a specified named trend.
- Has interactive controls for trend curves selection and entity filtering.
Notes
- Gives references and data objects summary.

Implementation

Both TSSEs are implemented using the R packages “SparseMatrixRecommender”, [AAp1], and “SparseMatrixRecommenderInterfaces”, [AAp2].

The package “SparseMatrixRecommender” provides functions to create and use Sparse Matrix Recommender (SMR) objects. Both TSSEs use underlying SMR objects.

The package “SparseMatrixRecommenderInterfaces” provides functions to generate the server and client functions for the Shiny framework by RStudio.

As it was mentioned above, both TSSEs are published at shinyapps.io. The corresponding source codes can be found in [AAr1].

Apple Mobility Trends Reports Search Engine

The Apple data TSSE has four types of time series (“entities”). The first three are normalized volumes of Apple maps requests while driving, transit transport use, and walking. (See [AA1] for more details.) The fourth is daily mean temperature at different geo-locations.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

The New York Times COVID-19 Data Search Engine

The New York Times TSSE has four types of time series (aggregated) cases and deaths, and their corresponding time series differences.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

Examples

In this section we discuss in some detail several examples of using each of the TSSEs.

Apple data search engine examples

Here are a few observations from [AA1]:

The COVID-19 lockdowns are clearly reflected in the time series.
The time series from the Apple Mobility Trends data shows strong weekly seasonality. Roughly speaking, people go to places they are not familiar with on Fridays and Saturdays. Other work week days people are more familiar with their trips. Since much lesser number of requests are made on Sundays, we can conjecture that many people stay at home or visit very familiar locations.

Here are a few assumptions:

Where people frequently go (work, school, groceries shopping, etc.) they do not need directions that much.
People request directions when they have more free time and will for “leisure trips.”
During vacations people are more likely to be in places they are less familiar with.
People are more likely to take leisure trips when the weather is good. (Warm, not raining, etc.)

Nice, France vs Florida, USA

Consider the results of the Nearest Neighbors panel for Nice, France.

Since French tend to go on vacation in July and August ([SS1, INSEE1]) we can see that driving, transit, and walking in Nice have pronounced peaks during that time:

Of course, we also observe the lockdown period in that geographical area.

Compare those time series with the time series from driving in Florida, USA:

We can see that people in Florida, USA have driving patterns unrelated to the typical weather seasons and vacation periods.

(Further TSSE queries show that there is a negative correlation with the temperature in south Florida and the volumes of Apple Maps directions requests.)

Italy and Balkan countries driving

We can see that according to the data people who have access to both iPhones and cars in Italy and the Balkan countries Bulgaria, Greece, and Romania have similar directions requests patterns:

(The similarities can be explained with at least a few “obvious” facts, but we are going to restrain ourselves.)

The New York Times data search engine examples

In Broward county, Florida, USA and Cook county, Illinois, USA we can see two waves of infections in the difference time series:

References

Data

[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research (2008), WeatherData, Wolfram Language function.

Articles

[AA1] Anton Antonov, “Apple mobility trends data visualization (for COVID-19)”, (2020), SystemModeling at GitHub/antononcube.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub/antononcube.

[INSEE1] Institut national de la statistique et des études économiques, “En 2010, les salariés ont pris en moyenne six semaines de congé”, (2012).

[SS1] Sam Schechner and Lee Harris, “What Happens When All of France Takes Vacation? 438 Miles of Traffic”, (2019), The Wall Street Journal

Packages, repositories

[AAp1] Anton Antonov, Sparse Matrix Recommender framework functions, (2019), R-packages at GitHub/antononcube.

[AAp2] Anton Antonov, Sparse Matrix Recommender framework interface functions, (2019), R-packages at GitHub/antononcube.

[AAr1] Anton Antonov, Coronavirus propagation dynamics, (2020), SystemModeling at GitHub/antononcube.

NY Times COVID-19 data visualization (Update)

Posted on January 15, 2021 by Anton Antonov Antonov

Introduction

This post is both an update and a full-blown version of an older post — “NY Times COVID-19 data visualization” — using NY Times COVID-19 data up to 2021-01-13.

The purpose of this document/notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].

The following steps are taken:

Ingest data
- Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
- Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Do “out of the box” time series forecast.
Analyze fluctuations around time series trends.

Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].

Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.

Import data

NYTimes USA states data

dsNYDataStates = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"];
dsNYDataStates = dsNYDataStates[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataStates[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsNYDataStates]

NYTimes USA counties data

dsNYDataCounties = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"];
dsNYDataCounties = dsNYDataCounties[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataCounties[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsNYDataCounties]

US county records

dsUSACountyData = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];
dsUSACountyData = dsUSACountyData[All, Join[#, <|"FIPS" -> ToExpression[#FIPS]|>] &];
dsUSACountyData[[1 ;; 6]]

ResourceFunction["RecordsSummary"][dsUSACountyData]

Merge data

Verify that the two datasets have common FIPS codes:

Length[Intersection[Normal[dsUSACountyData[All, "FIPS"]], Normal[dsNYDataCounties[All, "Fips"]]]]

(*3133*)

Merge the datasets:

dsNYDataCountiesExtended = Dataset[JoinAcross[Normal[dsNYDataCounties], Normal[dsUSACountyData[All, {"FIPS", "Lat", "Lon", "Population"}]], Key["Fips"] -> Key["FIPS"]]];

Add a “DateObject” column and (reverse) sort by date:

dsNYDataCountiesExtended = dsNYDataCountiesExtended[All, Join[<|"DateObject" -> DateObject[#Date]|>, #] &];
dsNYDataCountiesExtended = dsNYDataCountiesExtended[ReverseSortBy[#DateObject &]];
dsNYDataCountiesExtended[[1 ;; 6]]

Basic data analysis

We consider cases and deaths for the last date only. (The queries can be easily adjusted for other dates.)

dfQuery = dsNYDataCountiesExtended[Select[#Date == dsNYDataCountiesExtended[1, "Date"] &], {"FIPS", "Cases", "Deaths"}];
dfQuery = dfQuery[All, Prepend[#, "FIPS" -> ToString[#FIPS]] &];

Total[dfQuery[All, {"Cases", "Deaths"}]]

(*<|"Cases" -> 22387340, "Deaths" -> 355736|>*)

Here is the summary of the values of cases and deaths across the different USA counties:

ResourceFunction["RecordsSummary"][dfQuery]

The following table of plots shows the distributions of cases and deaths and the corresponding Pareto principle adherence plots:

opts = {PlotRange -> All, ImageSize -> Medium};
Rasterize[Grid[
   Function[{columnName}, 
     {Histogram[Log10[#], PlotLabel -> Row[{Log10, Spacer[3], columnName}], opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[dfQuery[All, columnName]] 
    ] /@ {"Cases", "Deaths"}, 
   Dividers -> All, FrameStyle -> GrayLevel[0.7]]]

A couple of observations:

The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
Typical manifestation of the Pareto principle: 80% of the cases and deaths are registered in 20% of the counties.

Remark: The top 20% counties of the cases are not necessarily the same as the top 20% counties of the deaths.

Distributions

Here we find the distributions that correspond to the cases and deaths (using FindDistribution ):

ResourceFunction["GridTableForm"][List @@@ Map[Function[{columnName}, 
     columnName -> FindDistribution[N@Log10[Select[#, # > 0 &]]] &@Normal[dfQuery[All, columnName]] 
    ], {"Cases", "Deaths"}], TableHeadings -> {"Data", "Distribution"}]

Pareto principle locations

The following query finds the intersection between that for the top 600 Pareto principle locations for the cases and deaths:

Length[Intersection @@ Map[Function[{columnName}, Keys[TakeLargest[Normal@dfQuery[Association, #FIPS -> #[columnName] &], 600]]], {"Cases", "Deaths"}]]

(*516*)

Geo-histogram

lsAllDates = Union[Normal[dsNYDataCountiesExtended[All, "Date"]]];
lsAllDates // Length

(*359*)

Manipulate[
  DynamicModule[{ds = dsNYDataCountiesExtended[Select[#Date == datePick &]]}, 
   GeoHistogram[
    Normal[ds[All, {"Lat", "Lon"}][All, Values]] -> N[Normal[ds[All, columnName]]], 
    Quantity[150, "Miles"], PlotLabel -> columnName, PlotLegends -> Automatic, ImageSize -> Large, GeoProjection -> "Equirectangular"] 
  ], 
  {{columnName, "Cases", "Data type:"}, {"Cases", "Deaths"}}, 
  {{datePick, Last[lsAllDates], "Date:"}, lsAllDates}]

Heat-map plots

An alternative of the geo-visualization is to use a heat-map plot. Here we use the package “HeatmapPlot.m”, [AAp1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

Cases

Cross-tabulate states with dates over cases:

matSDC = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Cases"}], "Sparse" -> True];

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):

HeatmapPlot[matSDC, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

Deaths

Cross-tabulate states with dates over deaths and plot:

matSDD = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Deaths"}], "Sparse" -> True];
HeatmapPlot[matSDD, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

Time series analysis

Cases

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsCases = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Cases & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsCases, PlotLabel -> "Cases", opts]

ResourceFunction["RecordsSummary"][tsCases["Path"]]

Logarithmic plot:

DateListPlot[Log10[tsCases], PlotLabel -> Row[{Log10, Spacer[3], "Cases"}], opts]

“Forecast”

Fit a time series model to log 10 of the time series:

tsm = TimeSeriesModelFit[Log10[tsCases]]

Plot log 10 data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Plot data and forecast:

DateListPlot[{tsCases, 10^TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Deaths

Time series

For each date sum all cases over the states, make a time series, and plot it:

tsDeaths = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Deaths & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsDeaths, PlotLabel -> "Deaths", opts]

ResourceFunction["RecordsSummary"][tsDeaths["Path"]]

“Forecast”

Fit a time series model:

tsm = TimeSeriesModelFit[tsDeaths, "ARMA"]

Plot data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Fluctuations

We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)

This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

Fluctuations presence

Here we plot the consecutive differences of the cases:

DateListPlot[Differences[tsCases], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

Here we plot the consecutive differences of the deaths:

DateListPlot[Differences[tsDeaths], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.

Absolute and relative errors distributions

Here we take interesting part of the cases data:

tsData = TimeSeriesWindow[tsCases, {{2020, 5, 1}, {2020, 12, 31}}];

Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:

qrObj = 
   QRMonUnit[tsData]⟹
    QRMonEchoDataSummary⟹
    QRMonRescale[Axes -> {False, True}]⟹
    QRMonEchoDataSummary⟹
    QRMonQuantileRegression[16, 0.5]⟹
    QRMonSetRegressionFunctionsPlotOptions[{PlotStyle -> Red}]⟹
    QRMonDateListPlot[AspectRatio -> 1/4, ImageSize -> Large]⟹
    QRMonErrorPlots["RelativeErrors" -> False, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]⟹
    QRMonErrorPlots["RelativeErrors" -> True, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True];

Here we find the distribution of the absolute errors (fluctuations) using FindDistribution:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*CauchyDistribution[6.0799*10^-6, 0.000331709]*)

Absolute errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*ExtremeValueDistribution[-0.000996315, 0.00207593]*)

Here we find the distribution of the of the relative errors:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*StudentTDistribution[0.0000511326, 0.00244023, 1.59364]*)

Relative errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*NormalDistribution[9.66949*10^-6, 0.00394395]*)

References

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.

[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.

[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.

Generation of Random Bethlehem Stars

Posted on December 21, 2020 by Anton Antonov Antonov

Introduction

This document/notebook is inspired by the Mathematica Stack Exchange (MSE) question “Plotting the Star of Bethlehem”, [MSE1]. That MSE question requests efficient and fast plotting of a certain mathematical function that (maybe) looks like the Star of Bethlehem, [Wk1]. Instead of doing what the author of the questions suggests, I decided to use a generative art program and workflows from three of most important Machine Learning (ML) sub-cultures: Latent Semantic Analysis, Recommendations, and Classification.

Although we discuss making of Bethlehem Star-like images, the ML workflows and corresponding code presented in this document/notebook have general applicability – in many situations we have to make classifiers based on data that has to be “feature engineered” through pipeline of several types of ML transformative workflows and that feature engineering requires multiple iterations of re-examinations and tuning in order to achieve the set goals.

The document/notebook is structured as follows:

Target Bethlehem Star images
Simplistic approach
Elaborated approach outline
Sections that follow through elaborated approach outline:
1. Data generation
2. Feature extraction
3. Recommender creation
4. Classifier creation and utilization experiments

(This document/notebook is a “raw” chapter for the book “Simplified Machine Learning Workflows”, [AAr3].)

Target images

Here are the images taken from [MSE1] that we consider to be “Bethlehem Stars” in this document/notebook:

imgStar1 = Import["https://i.stack.imgur.com/qmmOw.png"];
imgStar2 = Import["https://i.stack.imgur.com/5gtsS.png"];
Row[{imgStar1, Spacer[5], imgStar2}]

We notice that similar images can be obtained using the Wolfram Function Repository (WFR) function RandomMandala, [AAr1]. Here are a dozen examples:

SeedRandom[5];
Multicolumn[Table[MandalaToWhiterImage@ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> RandomInteger[{2, 8}], "ConnectingFunction" -> FilledCurve@*BezierCurve], 12], 6, Background -> Black]

Simplistic approach

We can just generate a large enough set of mandalas and pick the ones we like.

More precisely we have the following steps:

We generate, say, 200 random mandalas using BlockRandom and keeping track of the random seeds
1. The mandalas are generated with rotational symmetry order 2 and filled Bezier curve connections.
We pick mandalas that look, more or less, like Bethlehem Stars
Add picked mandalas to the results list
If too few mandalas are in the results list go to 1.

Here are some mandalas generated with those steps:

lsStarReferenceSeeds = DeleteDuplicates@{697734, 227488491, 296515155601, 328716690761, 25979673846, 48784395076, 61082107304, 63772596796, 128581744446, 194807926867, 254647184786, 271909611066, 296515155601, 575775702222, 595562118302, 663386458123, 664847685618, 680328164429, 859482663706};
Multicolumn[
  Table[BlockRandom[ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve, ColorFunction -> (White &), Background -> Black], RandomSeeding -> rs], {rs, lsStarReferenceSeeds}] /. GrayLevel[0.25`] -> White, 6, Appearance -> "Horizontal", Background -> Black]

Remark: The plot above looks prettier in notebook converted with the resource function DarkMode.

Elaborated approach

Assume that we want to automate the simplistic approach described in the previous section.

One way to automate is to create a Machine Learning (ML) classifier that is capable of discerning which RandomMandala objects look like Bethlehem Star target images and which do not. With such a classifier we can write a function BethlehemMandala that applies the classifier on multiple results from RandomMandala and returns those mandalas that the classifier says are good.

Here are the steps of building the proposed classifier:

Generate a large enough Random Mandala Images Set (RMIS)
Create a feature extractor from a subset of RMIS
Assign features to all of RMIS
Make a recommender with the RMIS features and other image data (like pixel values)
Apply the RMIS recommender over the target Bethlehem Star images and determine and examine image sets that are:
- the best recommendations
- the worse recommendations
With the best and worse recommendations sets compose training data for classifier making
Train a classifier
Examine classifier application to (filtering of) random mandala images (both in RMIS and not in RMIS)
If the results are not satisfactory redo some or all of the steps above

Remark: If the results are not satisfactory we should consider using the obtained classifier at the data generation phase. (This is not done in this document/notebook.)

Remark: The elaborated approach outline and flow chart have general applicability, not just for generation of random images of a certain type.

Flow chart

Here is a flow chart that corresponds to the outline above:

A few observations for the flow chart follow:

The flow chart has a feature extraction block that shows that the feature extraction can be done in several ways.
- The application of LSA is a type of feature extraction which this document/notebook uses.
If the results are not good enough the flow chart shows that the classifier can be used at the data generation phase.
If the results are not good enough there are several alternatives to redo or tune the ML algorithms.
- Changing or tuning the recommender implies training a new classifier.
- Changing or tuning the feature extraction implies making a new recommender and a new classifier.

Data generation and preparation

In this section we generate random mandala graphics, transform them into images and corresponding vectors. Those image-vectors can be used to apply dimension reduction algorithms. (Other feature extraction algorithms can be applied over the images.)

Generated data

Generate large number of mandalas:

k = 20000;
knownSeedsQ = False;
SeedRandom[343];
lsRSeeds = Union@RandomInteger[{1, 10^9}, k];
AbsoluteTiming[
  aMandalas = 
    If[TrueQ@knownSeedsQ, 
     Association@Table[rs -> BlockRandom[ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve], RandomSeeding -> rs], {rs, lsRSeeds}], 
    (*ELSE*) 
     Association@Table[i -> ResourceFunction["RandomMandala"]["RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve], {i, 1, k}] 
    ]; 
 ]

(*{18.7549, Null}*)

Check the number of mandalas generated:

Length[aMandalas]

(*20000*)

Show a sample of the generated mandalas:

Magnify[Multicolumn[MandalaToWhiterImage /@ RandomSample[Values@aMandalas, 40], 10, Background -> Black], 0.7]

Data preparation

Convert the mandala graphics into images using appropriately large (or appropriately small) image sizes:

AbsoluteTiming[
  aMImages = ParallelMap[ImageResize[#, {120, 120}] &, aMandalas]; 
 ]

(*{248.202, Null}*)

Flatten each of the images into vectors:

AbsoluteTiming[
  aMImageVecs = ParallelMap[Flatten[ImageData[Binarize@ColorNegate@ColorConvert[#, "Grayscale"]]] &, aMImages]; 
 ]

(*{16.0125, Null}*)

Remark: Below those vectors are called image-vectors.

Feature extraction

In this section we use the software monad LSAMon, [AA1, AAp1], to do dimension reduction over a subset of random mandala images.

Remark: Other feature extraction methods can be used through the built-in functions FeatureExtraction and FeatureExtract.

Dimension reduction

Create an LSAMon object and extract image topics using Singular Value Decomposition (SVD) or Independent Component Analysis (ICA), [AAr2]:

SeedRandom[893];
AbsoluteTiming[
  lsaObj = 
    LSAMonUnit[]⟹
     LSAMonSetDocumentTermMatrix[SparseArray[Values@RandomSample[aMImageVecs, UpTo[2000]]]]⟹
     LSAMonApplyTermWeightFunctions["None", "None", "Cosine"]⟹
     LSAMonExtractTopics["NumberOfTopics" -> 40, Method -> "ICA", "MaxSteps" -> 240, "MinNumberOfDocumentsPerTerm" -> 0]⟹
     LSAMonNormalizeMatrixProduct[Normalized -> Left]; 
 ]

(*{16.1871, Null}*)

Show the importance coefficients of the topics (if SVD was used the plot would show the singular values):

ListPlot[Norm /@ SparseArray[lsaObj⟹LSAMonTakeH], Filling -> Axis, PlotRange -> All, PlotTheme -> "Scientific"]

Show the interpretation of the extracted image topics:

lsaObj⟹
   LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹
   LSAMonEchoFunctionContext[ImageAdjust[Image[Partition[#, ImageDimensions[aMImages[[1]]][[1]]]]] & /@ SparseArray[#H] &];

Approximation

Pick a test image that is a mandala image or a target image and pre-process it:

If[True, 
   ind = RandomChoice[Range[Length[Values[aMImages]]]]; 
   imgTest = MandalaToWhiterImage@aMandalas[[ind]]; 
   matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aMImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic], 
  (*ELSE*) 
   imgTest = Binarize[imgStar2, 0.5]; 
   matImageTest = ToSSparseMatrix[SparseArray@List@ImageToVector[imgTest, ImageDimensions[aMImages[[1]]]], "RowNames" -> Automatic, "ColumnNames" -> Automatic] 
  ];
imgTest

Find the representation of the test image with the chosen feature extractor (LSAMon object here):

matReprsentation = lsaObj⟹LSAMonRepresentByTopics[matImageTest]⟹LSAMonTakeValue;
lsCoeff = Normal@SparseArray[matReprsentation[[1, All]]];
ListPlot[lsCoeff, Filling -> Axis, PlotRange -> All]

Show the interpretation of the found representation:

H = SparseArray[lsaObj⟹LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹LSAMonTakeH];
vecReprsentation = lsCoeff . H;
ImageAdjust@Image[Rescale[Partition[vecReprsentation, ImageDimensions[aMImages[[1]]][[1]]]]]

Recommendations

In this section we utilize the software monad SMRMon, [AAp3], to create a recommender for the random mandala images.

Remark: Instead of the Sparse Matrix Recommender (SMR) object the built-in function Nearest can be used.

Create SSparseMatrix object for all image-vectors:

matImages = ToSSparseMatrix[SparseArray[Values@aMImageVecs], "RowNames" -> Automatic, "ColumnNames" -> Automatic]

Normalize the rows of the image-vectors matrix:

AbsoluteTiming[
  matPixel = WeightTermsOfSSparseMatrix[matImages, "None", "None", "Cosine"] 
 ]

Get the LSA topics matrix:

matH = (lsaObj⟹LSAMonNormalizeMatrixProduct[Normalized -> Right]⟹LSAMonTakeH)

Find the image topics representation for each image-vector (assuming matH was computed with SVD or ICA):

AbsoluteTiming[
  matTopic = matPixel . Transpose[matH] 
 ]

Here we create a recommender based on the images data (pixels) and extracted image topics (or other image features):

smrObj = 
   SMRMonUnit[]⟹
    SMRMonCreate[<|"Pixel" -> matPixel, "Topic" -> matTopic|>]⟹
    SMRMonApplyNormalizationFunction["Cosine"]⟹
    SMRMonSetTagTypeWeights[<|"Pixel" -> 0.2, "Topic" -> 1|>];

Remark: Note the weights assigned to the pixels and the topics in the recommender object above. Those weights were derived by examining the recommendations results shown below.

Here is the image we want to find most similar mandala images to – the target image:

imgTarget = Binarize[imgStar2, 0.5]

Here is the profile of the target image:

aProf = MakeSMRProfile[lsaObj, imgTarget, ImageDimensions[aMImages[[1]]]];
TakeLargest[aProf, 6]

(*<|"10032-10009-4392" -> 0.298371, "3906-10506-10495" -> 0.240086, "10027-10014-4387" -> 0.156797, "8342-8339-6062" -> 0.133822, "3182-3179-11222" -> 0.131565, "8470-8451-5829" -> 0.128844|>*)

Using the target image profile here we compute the recommendation scores for all mandala images of the recommender:

aRecs = 
   smrObj⟹
    SMRMonRecommendByProfile[aProf, All]⟹
    SMRMonTakeValue;

Here is a plot of the similarity scores:

Row[{ResourceFunction["RecordsSummary"][Values[aRecs]], ListPlot[Values[aRecs], ImageSize -> Medium, PlotRange -> All, PlotTheme -> "Detailed", PlotLabel -> "Similarity scores"]}]

Here are the closest (nearest neighbor) mandala images:

Multicolumn[Values[ImageAdjust@*ColorNegate /@ aMImages[[ToExpression /@ Take[Keys[aRecs], 48]]]], 12, Background -> Black]

Here are the most distant mandala images:

Multicolumn[Values[ImageAdjust@*ColorNegate /@ aMImages[[ToExpression /@ Take[Keys[aRecs], -48]]]], 12, Background -> Black]

Classifier creation and utilization

In this section we:

Prepare classifier data
Build and examine a classifier using the software monad ClCon, [AA2, AAp2], using appropriate training, testing, and validation data ratios
Build a classifier utilizing all training data
Generate Bethlehem Star mandalas by filtering mandala candidates with the classifier

As it was mentioned above we prepare the data to build classifiers with by:

Selecting top, highest scores recommendations and labeling them with True
Selecting bad, low score recommendations and labeling them with False

AbsoluteTiming[
  Block[{
    lsBest = Values@aMandalas[[ToExpression /@ Take[Keys[aRecs], 120]]], 
    lsWorse = Values@aMandalas[[ToExpression /@ Join[Take[Keys[aRecs], -200], RandomSample[Take[Keys[aRecs], {3000, -200}], 200]]]]}, 
   lsTrainingData = 
     Join[
      Map[MandalaToWhiterImage[#, ImageDimensions@aMImages[[1]]] -> True &, lsBest], 
      Map[MandalaToWhiterImage[#, ImageDimensions@aMImages[[1]]] -> False &, lsWorse] 
     ]; 
  ] 
 ]

(*{27.9127, Null}*)

Using ClCon train a classifier and show its performance measures:

clObj = 
   ClConUnit[lsTrainingData]⟹
    ClConSplitData[0.75, 0.2]⟹
    ClConMakeClassifier["NearestNeighbors"]⟹
    ClConClassifierMeasurements⟹
    ClConEchoValue⟹
    ClConClassifierMeasurements["ConfusionMatrixPlot"]⟹
    ClConEchoValue;

Remark: We can re-run the ClCon workflow above several times until we obtain a classifier we want to use.

Train a classifier with all prepared data:

clObj2 = 
   ClConUnit[lsTrainingData]⟹
    ClConSplitData[1, 0.2]⟹
    ClConMakeClassifier["NearestNeighbors"];

Get the classifier function from ClCon object:

cfBStar = clObj2⟹ClConTakeClassifier

Here we generate Bethlehem Star mandalas using the classifier trained above:

SeedRandom[2020];
Multicolumn[MandalaToWhiterImage /@ BethlehemMandala[12, cfBStar, 0.87], 6, Background -> Black]

Generate Bethlehem Star mandala images utilizing the classifier (with a specified classifier probabilities threshold):

SeedRandom[32];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0.87, "Probabilities" -> True]]

Show unfiltered Bethlehem Star mandala candidates:

SeedRandom[32];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0, "Probabilities" -> True]]

Remark: Examine the probabilities in the image-probability associations above – they show that the classifier is “working.“

Here is another set generated Bethlehem Star mandalas using rotational symmetry order 4:

SeedRandom[777];
KeyMap[MandalaToWhiterImage, BethlehemMandala[12, cfBStar, 0.8, "RotationalSymmetryOrder" -> 4, "Probabilities" -> True]]

Remark: Note that although a higher rotational symmetry order is used the highly scored results still seem relevant – they have the features of the target Bethlehem Star images.

References

[AA1] Anton Antonov, “A monad for Latent Semantic Analysis workflows”, (2019), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “A monad for classification workflows”, (2018)), MathematicaForPrediction at WordPress.

[MSE1] “Plotting the Star of Bethlehem”, (2020),Mathematica Stack Exchange, question 236499,

[Wk1] Wikipedia entry, Star of Bethlehem.

Packages

[AAr1] Anton Antonov, RandomMandala, (2019), Wolfram Function Repository.

[AAr2] Anton Antonov, IdependentComponentAnalysis, (2019), Wolfram Function Repository.

[AAr3] Anton Antonov, “Simplified Machine Learning Workflows” book, (2019), GitHub/antononcube.

[AAp1] Anton Antonov, Monadic Latent Semantic Analysis Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Monadic contextual classification Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, Monadic Sparse Matrix Recommender Mathematica package, (2018), MathematicaForPrediction at GitHub/antononcube.

Code definitions

urlPart = "https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/";
Get[urlPart <> "MonadicLatentSemanticAnalysis.m"];
Get[urlPart <> "MonadicSparseMatrixRecommender.m"];
Get[urlPart <> "/MonadicContextualClassification.m"];

Clear[MandalaToImage, MandalaToWhiterImage];
MandalaToImage[gr_Graphics, imgSize_ : {120, 120}] := ColorNegate@ImageResize[gr, imgSize];
MandalaToWhiterImage[gr_Graphics, imgSize_ : {120, 120}] := ColorNegate@ImageResize[gr /. GrayLevel[0.25`] -> Black, imgSize];

Clear[ImageToVector];
ImageToVector[img_Image] := Flatten[ImageData[ColorConvert[img, "Grayscale"]]];
ImageToVector[img_Image, imgSize_] := Flatten[ImageData[ColorConvert[ImageResize[img, imgSize], "Grayscale"]]];
ImageToVector[___] := $Failed;

Clear[MakeSMRProfile];
MakeSMRProfile[lsaObj_LSAMon, gr_Graphics, imgSize_] := MakeSMRProfile[lsaObj, {gr}, imgSize];
MakeSMRProfile[lsaObj_LSAMon, lsGrs : {_Graphics}, imgSize_] := MakeSMRProfile[lsaObj, MandalaToWhiterImage[#, imgSize] & /@ lsGrs, imgSize]
MakeSMRProfile[lsaObj_LSAMon, img_Image, imgSize_] := MakeSMRProfile[lsaObj, {img}, imgSize];
MakeSMRProfile[lsaObj_LSAMon, lsImgs : {_Image ..}, imgSize_] := 
   Block[{lsImgVecs, matTest, aProfPixel, aProfTopic}, 
    lsImgVecs = ImageToVector[#, imgSize] & /@ lsImgs; 
    matTest = ToSSparseMatrix[SparseArray[lsImgVecs], "RowNames" -> Automatic, "ColumnNames" -> Automatic]; 
    aProfPixel = ColumnSumsAssociation[lsaObj⟹LSAMonRepresentByTerms[matTest]⟹LSAMonTakeValue]; 
    aProfTopic = ColumnSumsAssociation[lsaObj⟹LSAMonRepresentByTopics[matTest]⟹LSAMonTakeValue]; 
    aProfPixel = Select[aProfPixel, # > 0 &]; 
    aProfTopic = Select[aProfTopic, # > 0 &]; 
    Join[aProfPixel, aProfTopic] 
   ];
MakeSMRProfile[___] := $Failed;

Clear[BethlehemMandalaCandiate];
BethlehemMandalaCandiate[opts : OptionsPattern[]] := ResourceFunction["RandomMandala"][opts, "RotationalSymmetryOrder" -> 2, "NumberOfSeedElements" -> Automatic, "ConnectingFunction" -> FilledCurve@*BezierCurve];

Clear[BethlehemMandala];
Options[BethlehemMandala] = Join[{ImageSize -> {120, 120}, "Probabilities" -> False}, Options[ResourceFunction["RandomMandala"]]];
BethlehemMandala[n_Integer, cf_ClassifierFunction, opts : OptionsPattern[]] := BethlehemMandala[n, cf, 0.87, opts];
BethlehemMandala[n_Integer, cf_ClassifierFunction, threshold_?NumericQ, opts : OptionsPattern[]] := 
   Block[{imgSize, probsQ, res, resNew, aResScores = <||>, aResScoresNew = <||>}, 
     
     imgSize = OptionValue[BethlehemMandala, ImageSize]; 
     probsQ = TrueQ[OptionValue[BethlehemMandala, "Probabilities"]]; 
     
     res = {}; 
     While[Length[res] < n, 
      resNew = Table[BethlehemMandalaCandiate[FilterRules[{opts}, Options[ResourceFunction["RandomMandala"]]]], 2*(n - Length[res])]; 
      aResScoresNew = Association[# -> cf[MandalaToImage[#, imgSize], "Probabilities"][True] & /@ resNew]; 
      aResScoresNew = Select[aResScoresNew, # >= threshold &]; 
      aResScores = Join[aResScores, aResScoresNew]; 
      res = Keys[aResScores] 
     ]; 
     
     aResScores = TakeLargest[ReverseSort[aResScores], UpTo[n]]; 
     If[probsQ, aResScores, Keys[aResScores]] 
    ] /; n > 0;
BethlehemMandala[___] := $Failed

Investigating COVID-19 with R: data analysis and simulations

Posted on May 29, 2020 by Anton Antonov Antonov

Methodological presentation
R-Ladies Miami Meetup, May 28th 2020

The extended abstract of the presentation was loosely followed. Here is the presentation mind-map:

(Note that mind-map’s PDF has hyperlinks. Also, see the folder Presentation-aids. )

The organizers and I did a poll for what people want to hear. After discussing the results of the 15 votes from that poll we decided the presentation to be a methodological one instead of a know-how one.

Approximately 30% of the presentation was based on the R-project “COVID-19-modeling-in-R”, [AA1].

Approximately 30% of the presentation was based on an R-programmed software monad for epidemiology compartmental models, ECMMon-R, [AAr2].

For the rest were used frameworks, simulations, and graphics made with Mathematica, [AAr1].

The presentation was given online (because of COVID-19) using Zoom. 90 people registered. Nearly 40 showed up (and maybe 20 stayed throughout.)

Here is a link to the video recording.

Screenshots

Here are screenshots of statistics used in the introduction:

References

Coronavirus

[Wk1] Wikipedia entry, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

[Wk2] Wikipedia entry, Coronavirus disease 2019.

Modeling

[Wk3] Wikipedia entry, Compartmental models in epidemiology.

[Wk4] Wikipedia entry, System dynamics.

R code/software

[KS1] Karline Soetaert, Thomas Petzoldt, R. Woodrow Setzer, “deSolve: Solvers for Initial Value Problems of Differential Equations (‘ODE’, ‘DAE’, ‘DDE’)”, CRAN.

[AA1] Anton Antonov, “COVID-19-modeling-in-R”, 2020, SystemModeling at GitHub.

[AAr1] Anton Antonov, Coronavirus-propagation-dynamics, 2020, SystemModeling at GitHub.

[AAr2] Anton Antonov, Epidemiology Compartmental Modeling Monad in R, 2020, ECMMon-R at GitHub.

	数字为2026 年不卓越但很快乐 - 偏… on Numerically 2026 is unremarkab…
	Day 22 – Numer… on Primitive roots generation…
	Numerically 2026 is… on Primitive roots generation…
	Day 9 – Monadi… on Monad code generation and…
	Monad laws in Raku \|… on Monad code generation and…

… or “Making Robust LLM Computational Pipelines from Software Engineering Perspective”

Abstract

General structure of LLM-based workflows

References

Articles, blog posts

Notebooks

Packages

Raku

Python

R

Videos

Introduction

References

Data ingestion

Distributions

Pareto principle manifestation

Number of creations

Popularity

Correlations

Bubble chart 2D

Bubble chart 3D

Second system effect traces

Instead of conclusions

Introduction

Get Chinese character images

Vector representation of images

Data preparation

Clustering over the image vectors

LSAMon application

Using FeatureExtraction

Visual thesaurus

Graph clustering

Hierarchical clustering

Representing all characters with smaller set of basic ones

Approximation of “unseen” characters

Finding more interpretable bases

Cleaner automatic radicals

LSAMon pipeline with NNMF

Approximation

Setup

References

References

Articles

Mathematica packages and repository functions

Python packages

R packages

Introduction

Cryptocurrencies data

Significant cryptocurrencies

By summaries

By price

By trading volume

Related events

Cryptocurrency time series with events

Observations

Going down

Going up

Time series correlations

Getting time series data

Opening price time series

Trading volume time series

Observations

Nearest neighbors graphs

Graphs overview

Other types of analysis

Load packages

Definitions

References

Articles

Packages

Resource functions

Introduction

Yahoo! Finance

Getting cryptocurrencies symbols and summaries

Get all time series

Summary

Plots

data.bitcoinity.org

Metadata

Assignments

Vector representation of
images

Clustering over the image
vectors

Representing
all characters with smaller set of basic ones

Approximation of “unseen”
characters

Finding more interpretable
bases