# Halloween Rorschach animations

Last weekend I made and uploaded to YouTube a presentation discussing the making of Rorschach mask animations in both 2D and 3D:

Here are Mathematica notebooks discussing the process in detail:

Here is the link to the Imgur gallery of animations: “Attempts to recreate Rorschach’s mask”.

Here is the Halloween animation I made today:

Here is the black-&-white version:

Here is a collage of the “guiding images” for the animations above:

# Trie based classifiers evaluation

## Introduction

In this notebook we show how to evaluate Machine Learning (ML) classifiers based on Tries with frequencies, [AA1, AA2, AAp1], created over a well known ML dataset. The computations are done with packages and functions from the Wolfram Language (WL) ecosystem.

The classifiers based on Tries with frequencies can be seen as generalized Naive Bayesian Classifiers (NBCs).

We use the workflow summarized in this flowchart:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MarkdownDocuments/Diagrams/A-monad-for-classification-workflows/Classification-workflow-horizontal-layout.jpg"]

For more details on classification workflows see the article “A monad for classification workflows”, [AA3].

Remark: This notebook is the Mathematica counterpart of the Raku computable Markdown document with the same name [AA7, AA6].

Remark: Mathematica and WL are used as synonyms in this notebook.

## Data

In this section we obtain a dataset to make classification experiments with.

Through the Wolfram Function Repository (WFR) function ExampleDataset we can get data for the Titanic ship wreck:

dsTitanic0 = ResourceFunction["ExampleDataset"][{"MachineLearning", "Titanic"}];
dsTitanic0[[1 ;; 4]]

Remark: ExampleDataset uses ExampleData. Datasets from the ExampleData’s “MachineLearning” and “Statistics” collections are processed in order to obtain Dataset objects.

Instead of using the built-in Titanic data we use a version of it (which is used in Mathematica-vs-R comparisons, [AAr1]):

dsTitanic[[1 ;; 4]]

Here is a summary of the data:

ResourceFunction["RecordsSummary"][dsTitanic]

## Make tries

In this section for demonstration purposes let us create a shorter trie and display it in tree form.

Here we drop the identifier column and take the record fields in a particular order, in which “passengerSurvival” is the last field:

lsRecords = Normal@dsTitanic[All, Prepend[KeyDrop[#, "id"], "passengerAge" -> ToString[#passengerAge]] &][All, {"passengerClass", "passengerSex", "passengerAge", "passengerSurvival"}][Values];

Here is a sample:

RandomSample[lsRecords, 3]

(*{{"3rd", "female", "-1", "died"}, {"1st", "female", "50", "survived"}, {"3rd", "male", "30", "died"}}*)

Here make a trie without the field “passengerAge”:

trTitanic = TrieCreate[lsRecords[[All, {1, 2, 4}]]];
TrieForm[trTitanic, AspectRatio -> 1/4, ImageSize -> 900]

Here is a corresponding mosaic plot, [AA4, AAp3]:

MosaicPlot[lsRecords[[All, {1, 2, 4}]]]

Remark: The function MosaicPlot uses tries with frequencies in its implementation. Observing and reasoning with mosaic plots should make it clear that tries with frequencies are (some sort of) generalized Naive Bayesian classifiers.

## Trie classifier

In this section we create a Trie-based classifier.

In order to make certain reproducibility statements for the kind of experiments shown here, we use random seeding (with SeedRandom) before any computations that use pseudo-random numbers. Meaning, one would expect WL code that starts with a SeedRandom statement (e.g. SeedRandom[89]) to produce the same pseudo random numbers if it is executed multiple times (without changing it.)

SeedRandom[12];

Here we split the data into training and testing data in a stratified manner (a split for each label):

aSplit1 = GroupBy[lsRecords, #[[-1]] &, AssociationThread[{"training", "testing"}, TakeDrop[RandomSample[#], Round[0.75*Length[#]]]] &];
Map[Length, aSplit1, {2}]

(*<|"survived" -> <|"training" -> 375, "testing" -> 125|>, "died" -> <|"training" -> 607, "testing" -> 202|>|>*)

Here we aggregate training and testing data (and show the corresponding sizes):

aSplit2 = <|
"training" -> Join @@ Map[#training &, aSplit1],
"testing" -> Join @@ Map[#testing &, aSplit1]|>;
Length /@ aSplit2

(*<|"training" -> 982, "testing" -> 327|>*)

Here we make a trie with the training data (and show the node counts):

trTitanic = TrieNodeProbabilities[TrieCreate[aSplit2["training"]]];
TrieNodeCounts[trTitanic]

(*<|"total" -> 142, "internal" -> 60, "leaves" -> 82|>*)

Here is the trie in tree form:

Here is an example decision-classification:

TrieClassify[trTitanic, {"1st", "female"}]

(*"survived"*)

Here is an example probabilities-classification:

TrieClassify[trTitanic, {"1st", "female"}, "Probabilities"]

(*<|"survived" -> 0.962264, "died" -> 0.0377358|>*)

We want to classify across all testing data, but not all testing data-records might be present in the trie. Let us check that such testing records are few (or none):

Tally[Map[TrieKeyExistsQ[trTitanic, #] &, aSplit2["testing"]]]

(*{{True, 321}, {False, 6}}*)

Let us remove the records that cannot be classified:

lsTesting = Pick[aSplit2["testing"], Map[TrieKeyExistsQ[trTitanic, #] &, aSplit2["testing"]]];
Length[lsTesting]

(*321*)

Here we classify all testing records and show a sample of the obtained actual-predicted pairs:

lsClassRes = {Last[#], TrieClassify[trTitanic, Most[#]]} & /@ lsTesting;
RandomSample[lsClassRes, 6]

(*{{"survived", "survived"}, {"died", "survived"}, {"died", "died"}, {"survived", "survived"}, {"survived", "died"}, {"survived", "survived"}}*)

Here we cross tabulate the actual vs predicted labels using WFR’s CrossTabulate:

ResourceFunction["CrossTabulate"][lsClassRes]

The cross-tabulation results look bad because the default decision threshold is used. We get better results by selecting a decision threshold via Receiver Operating Characteristic (ROC) plots.

## Trie classification with ROC plots

In this section we systematically evaluate the Trie-based classifier using the Receiver Operating Characteristic (ROC) framework.

Here we classify all testing data records. For each record:

• Get probabilities association
• Add to that association the actual label
• Make sure the association has both survival labels
lsClassRes = Map[Join[<|"survived" -> 0, "died" -> 0, "actual" -> #[[-1]]|>, TrieClassify[trTitanic, #[[1 ;; -2]], "Probabilities"]] &, lsTesting];

Here we make a ROC record, [AA5, AAp4]:

ToROCAssociation[{"survived", "died"}, #actual & /@ lsClassRes, Map[If[#survived >= 0.5, "survived", "died"] &, lsClassRes]]

(*<|"TruePositive" -> 71, "FalsePositive" -> 14, "TrueNegative" -> 184, "FalseNegative" -> 52|>*)

Here we obtain the range of the label “survived”:

lsVals = Map[#survived &, lsClassRes];
MinMax[lsVals]

(*{0, 1.}*)

Here we make list of decision thresholds:

In the following code cell for each threshold:

• For each classification association decide on “survived” if the corresponding value is greater or equal to the threshold
• Make threshold’s ROC-association
lsROCs = Table[
ToROCAssociation[{"survived", "died"}, #actual & /@ lsClassRes, Map[If[#survived >= th, "survived", "died"] &, lsClassRes]],
{th, lsThresholds}];

Here is the obtained ROCs dataset:

Dataset[MapThread[Prepend[#1, "Threshold" -> #2] &, {lsROCs, lsThresholds}]]

Here is the corresponding ROC plot:

ROCPlot["FPR", "TPR", lsThresholds, lsROCs, GridLines -> Automatic, ImageSize -> Large]

We can see the Trie-based classifier has reasonable prediction abilities – we get ≈ 80% True Positive Rate (TPR) for a relatively small False Positive Rate (FPR), ≈ 20%.

## Confusion matrices

Using ClassifierMeasurements we can produce the corresponding confusion matrix plots (using “made on the spot” Manipulate interface):

DynamicModule[{lsThresholds = lsThresholds, lsClassRes = lsClassRes, lsClassRes2},
Manipulate[
lsClassRes2 = Map[{#actual, If[#survived >= lsThresholds[[i]], "survived", "died"]} &, lsClassRes];
Append[DeleteCases[ClassifierMeasurements[lsClassRes2[[All, 1]], lsClassRes2[[All, 2]], "ConfusionMatrixPlot"], ImageSize -> _], ImageSize -> Medium],
{{i, Flatten[Position[lsThresholds, Nearest[lsThresholds, 0.3][[1]]]][[1]], "index:"}, 1, Length[lsThresholds], 1, Appearance -> "Open"},
{{normalizeQ, False, "normalize?:"}, {False, True}}
]
]

## References

### Articles

[AA1] Anton Antonov, “Tries with frequencies for data mining”, (2013), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “Tries with frequencies in Java”, (2017), MathematicaForPrediction at WordPress.

[AA3] Anton Antonov, “A monad for classification workflows”, (2018), MathematicaForPrediction at WordPress.

[AA4] Anton Antonov, “Mosaic plots for data visualization”, (2014), MathematicaForPrediction at WordPress.

[AA5] Anton Antonov, “Basic example of using ROC with Linear regression”, (2016), MathematicaForPrediction at WordPress.

[AA6] Anton Antonov, “Trie based classifiers evaluation”, (2022), RakuForPrediction-book at GitHub/antononcube.

[AA7] Anton Antonov, “Trie based classifiers evaluation”, (2022), RakuForPrediction at WordPress.

### Packages

[AAp1] Anton Antonov, TriesWithFrequencies Mathematica package, (2014-2022), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, ROCFunctions Mathematica package, (2016), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, MosaicPlot Mathematica package, (2014), MathematicaForPrediction at GitHub/antononcube.

### Repositories

[AAr1] Anton Antonov, Mathematica vs. R project, (2018-2022), GitHub/antononcube.

## Setup

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/TriesWithFrequencies.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/ROCFunctions.m"];
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"];
dsTitanic = Import["https://raw.githubusercontent.com/antononcube/MathematicaVsR/master/Data/MathematicaVsR-Data-Titanic.csv", "Dataset", HeaderLines -> 1];

# Nightcore “Schweine” video making

## Introduction

This notebook/document shows how to make Nightcore modifications to a song video. We use Glukoza’s song “Schweine”. The song “Schweine” became popular via the radio station Vladivostok FM of the game “Grand Theft Auto IV”.

Remark: We use Schweine since its licencing allows copies of it to be (publicly) played via YouTube.

The Nightcore transformation of the song was fairly straightforward with Mathematica / WL. The video transformation and combination are also fairly straightforward or easy.

Remark: Here is the final result uploaded to YouTube (https://www.youtube.com/watch?v=8UsR9L3KPIU):

## Get movies

Here is a link to the official video: https://www.youtube.com/watch?v=Ue5ZBe-GzSM .

Here is a link to a version with English subtitles: https://www.youtube.com/watch?v=9Es1nPWzJ-0 .

Download at least one of the videos. (Use a Firefox or Google Chrome plugin; use VLC; or utilize the paclet described in the post “Playing with YouTube from Mathematica”, [BMI1].)

vdSubSchweine = Import["~/Downloads/Glukoza Nostra - Schweine -subtitled-.mp4"]

## Make Nightcore audio

The process for making a song Nightcore is described in Wikipedia, [Wk1]. Basically, we just make the tempo 20-30% faster and raise the pitch with, ≈5.5 semitones.

Remark: An alternative of the process shown in this section is to use audio transformation programs like Audacity and AmadeusPro.

Here we get the audio from the video:

auSchweine = Audio[vdSubSchweine]

Here we change the tempo to be 20% faster:

AbsoluteTiming[
auSchweineFaster = AudioTimeStretch[auSchweine, 1/1.2]
]

Here we raise the pitch with 5.5 semitones:

AbsoluteTiming[
auSchweineNightcore = AudioPitchShift[auSchweineFaster, Quantity[5.5, "Semitones"]]
]

## Get lyrics

Although, we have a video with English subtitles, it would be interesting to experiment with adding subtitles to the video or “discovering” the subtitles in the video frames.

Instead of just copy-&-pasting the text I took screenshot of lyrics here:
https://lyrics-on.net/en/1023698-schweine-shvajjne-lyrics.html .

Here the image above is split into two halves and they displayed in a grid:

imgLyrics1 = ImageTake[imgLyrics, All, {1, ImageDimensions[imgLyrics][[2]]/2}];
imgLyrics2 = ImageTake[imgLyrics, All, {ImageDimensions[imgLyrics][[2]]/2, -1}];
GraphicsGrid[{{imgLyrics1, imgLyrics2}}, Dividers -> All, ImageSize -> 700]

Here we recognize the lyrics within each half:

Grid[{{TextRecognize[imgLyrics1, Language -> "Russian"], TextRecognize[imgLyrics2, Language -> "English"]}}, Dividers -> All]

Remark: Because we found a video with subtitles, we do not use further the extracted lyrics in this notebook.

## Direct video styling

If we only wanted to change how the video looks we can directly manipulate the video frames with VideoFrameMap, [WRI6] :

AbsoluteTiming[
k = 0;
vdSchweine4 = VideoFrameMap[Switch[Mod[k++, 500] < 250, True, EdgeDetect[#], False, ImageEffect[#, "EdgeStylization"]] &, vdSubSchweine];
]
vdSchweine4

Remark: Since we want to make both the audio and video shorter we have to use video frames.

## Make Nightcore video

Get the frames of the video:

AbsoluteTiming[
lsFrames = VideoExtractFrames[vdSubSchweine, All];
]

(*{11.5501, Null}*)

Show the number of frames:

Length[lsFrames]

(*9041*)

Change all the frames to have the “Sepia” image effect:

AbsoluteTiming[
lsFramesSepia = ParallelMap[ImageEffect[#, "Sepia"] &, lsFrames];
]

(*{124.898, Null}*)

Generate (audio-less) video from the list of frames that have the same length as the generated audio:

AbsoluteTiming[
vdSubSchweineNew = VideoGenerator[lsFramesSepia, Duration[auSchweineNightcore]];
]

(*{115.34, Null}*)

Combine the video and audio (into a new video):

AbsoluteTiming[
vdSubSchweineNightcore = VideoCombine[{vdSubSchweineNew, auSchweineNightcore}];
]

(*{0.576532, Null}*)
vdSubSchweineNightcore

Remark: Here we do not export the video, since Mathematica saves it in a “standard” location of the host operating system.

## References

[BMA1] b3m2ma1, “Playing with
, (2018), Wolfram Community. (GitHub
.)

[Wk1] Wikipedia entry, “Nightcore”.

[WRI1] Wolfram Research (2010), TextRecognize, Wolfram Language
function, https://reference.wolfram.com/language/ref/TextRecognize.html
(updated 2020).

[WRI2] Wolfram Research (2016), Audio, Wolfram Language function,
https://reference.wolfram.com/language/ref/Audio.html (updated
2020).

[WRI3] Wolfram Research (2016), AudioTimeStretch, Wolfram Language
function,
https://reference.wolfram.com/language/ref/AudioTimeStretch.html
(updated 2020).

[WRI4] Wolfram Research (2016), AudioPitchShift, Wolfram Language
function,
https://reference.wolfram.com/language/ref/AudioPitchShift.html (updated
2020).

[WRI5] Wolfram Research (2020), VideoExtractFrames, Wolfram Language
function,
https://reference.wolfram.com/language/ref/VideoExtractFrames.html.

[WRI6] Wolfram Research (2020), VideoFrameMap, Wolfram Language
function, https://reference.wolfram.com/language/ref/VideoFrameMap.html
(updated 2021).

[WRI7] Wolfram Research (2008), ImageEffect, Wolfram Language
function, https://reference.wolfram.com/language/ref/ImageEffect.html
(updated 13).

[WRI8] Wolfram Research (2020), VideoGenerator, Wolfram Language
function, https://reference.wolfram.com/language/ref/VideoGenerator.html
(updated 2021).

[WRI9] Wolfram Research (2020), VideoCombine, Wolfram Language
function,
https://reference.wolfram.com/language/ref/VideoCombine.html.

# Re-exploring the structure of Chinese character images

## Introduction

In this notebook we show information retrieval and clustering
techniques over images of Unicode collection of Chinese characters. Here
is the outline of notebook’s exposition:

1. Get Chinese character images.
2. Cluster “image vectors” and demonstrate that the obtained
clusters have certain explainability elements.
3. Apply Latent Semantic Analysis (LSA) workflow to the character
set.
4. Show visual thesaurus through a recommender system. (That uses
Cosine similarity.)
5. Discuss graph and hierarchical clustering using LSA matrix
factors.
6. Demonstrate approximation of “unseen” character images with an
image basis obtained through LSA over a small set of (simple)
images.
7. Redo character approximation with more “interpretable” image
basis.

Remark: This notebook started as an (extended)
comment for the Community discussion “Exploring
structure of Chinese characters through image processing”
, [SH1].
(Hence the title.)

## Get Chinese character images

This code is a copy of the code in the original
Community post by Silvia Hao
, [SH1]:

Module[{fsize = 50, width = 64, height = 64},
lsCharIDs = Map[FromCharacterCode[#, "Unicode"] &, 16^^4E00 - 1 + Range[width height]];
]
charPage = Module[{fsize = 50, width = 64, height = 64},
16^^4E00 - 1 + Range[width height] // pipe[
FromCharacterCode[#, "Unicode"] &
, Characters, Partition[#, width] &
, Grid[#, Background -> Black, Spacings -> {0, 0}, ItemSize -> {1.5, 1.2}, Alignment -> {Center, Center}, Frame -> All, FrameStyle -> Directive[Red, AbsoluteThickness[3 \[Lambda]]]] &
, Style[#, White, fsize, FontFamily -> "Source Han Sans CN", FontWeight -> "ExtraLight"] &
, Rasterize[#, Background -> Black] &
]
];
chargrid = charPage // ColorDistance[#, Red] & // Image[#, "Byte"] & // Sign //Erosion[#, 5] &;
lmat = chargrid // MorphologicalComponents[#, Method -> "BoundingBox", CornerNeighbors -> False] &;
chars = ComponentMeasurements[{charPage // ColorConvert[#, "Grayscale"] &, lmat}, "MaskedImage", #Width > 10 &] // Values // Map@RemoveAlphaChannel;
chars = Module[{size = chars // Map@ImageDimensions // Max}, ImageCrop[#, {size, size}] & /@ chars];

Here is a sample of the obtained images:

SeedRandom[33];
RandomSample[chars, 5]

## Vector representation of images

Define a function that represents an image into a linear vector space
(of pixels):

Clear[ImageToVector];
ImageToVector[img_Image] := Flatten[ImageData[ColorConvert[img, "Grayscale"]]];
ImageToVector[img_Image, imgSize_] := Flatten[ImageData[ColorConvert[ImageResize[img, imgSize], "Grayscale"]]];

## References

### Articles

[AA1] Anton Antonov, “Crypto-currencies data acquisition with visualization”, (2021), MathematicaForPrediction at WordPress.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub.

[AA3] Anton Antonov, “Apple mobility trends data visualization”, (2020), SystemModeling at GitHub.

### Packages

[AAp1] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

### Resource functions

[AAw1] Anton Antonov, CryptocurrencyData, (2021).

[AAw2] Anton Antonov, RecordsSummary, (2019).

[AAw3] Anton Antonov, ParetoPrinciplePlot, (2019).

[AAw4] Anton Antonov, CrossTabulate, (2019).

[AAw5] Anton Antonov, IndependentComponentAnalysis, (2019).

[AAw6] Anton Antonov, QuantileRegression, (2019).

# Crypto-currencies data acquisition with visualization

## Introduction

In this notebook we show how to obtain crypto-currencies data from several data sources and make some basic time series visualizations. We assume the described data acquisition workflow is useful for doing more detailed (exploratory) analysis.

There are multiple crypto-currencies data sources, but a small proportion of them give a convenient way of extracting crypto-currencies data automatically. I found the easiest to work with to be https://finance.yahoo.com/cryptocurrencies, [YF1]. Another easy to work with Bitcoin-only data source is https://data.bitcoinity.org , [DBO1].

(I also looked into using https://www.coindesk.com/coindesk20. )

Remark: The code below is made with certain ad-hoc inductive reasoning that brought meaningful results. This means the code has to be changed if the underlying data organization in [YF1, DBO1] is changed.

## Yahoo! Finance

### Getting cryptocurrencies symbols and summaries

In this section we get all crypto-currencies symbols and related metadata.

Get the data of all crypto-currencies in [YF1]:

AbsoluteTiming[
lsData = Import["https://finance.yahoo.com/cryptocurrencies", "Data"];
]

(*{6.18067, Null}*)

Locate the data:

pos = First@Position[lsData, {"Symbol", "Name", "Price (Intraday)", "Change", "% Change", ___}];
dsCryptoCurrenciesColumnNames = lsData[[Sequence @@ pos]]
Length[dsCryptoCurrenciesColumnNames]

(*{"Symbol", "Name", "Price (Intraday)", "Change", "% Change", "Market Cap", "Volume in Currency (Since 0:00 UTC)", "Volume in Currency (24Hr)", "Total Volume All Currencies (24Hr)", "Circulating Supply", "52 Week Range", "1 Day Chart"}*)

(*12*)

Get the data:

dsCryptoCurrencies = lsData[[Sequence @@ Append[Most[pos], 2]]];
Dimensions[dsCryptoCurrencies]

(*{25, 10}*)

Make a dataset:

dsCryptoCurrencies = Dataset[dsCryptoCurrencies][All, AssociationThread[dsCryptoCurrenciesColumnNames[[1 ;; -3]], #] &]

### Get all time series

In this section we get all the crypto-currencies time series from [YF1].

AbsoluteTiming[
ccNow = Round@AbsoluteTime[Date[]] - AbsoluteTime[{1970, 1, 1, 0, 0, 0}];
aCryptoCurrenciesDataRaw =
Association@
Map[
# -> ResourceFunction["ImportCSVToDataset"]["https://query1.finance.yahoo.com/v7/finance/download/" <> # <>"?period1=1410825600&period2=" <> ToString[ccNow] <> "&interval=1d&events=history&includeAdjustedClose=true"] &, Normal[dsCryptoCurrencies[All, "Symbol"]]
];
]

(*{5.98745, Null}*)

Remark: Note that in the code above we specified the upper limit of the time span to be the current date. (And shifted it with respect to the epoch start 1970-01-01 used by [YF1].)

Check we good the data with dimensions retrieval:

Dimensions /@ aCryptoCurrenciesDataRaw

(*<|"BTC-USD" -> {2468, 7}, "ETH-USD" -> {2144, 7}, "USDT-USD" -> {2307, 7}, "BNB-USD" -> {1426, 7}, "ADA-USD" -> {1358, 7}, "DOGE-USD" -> {2468, 7}, "XRP-USD" -> {2468, 7}, "USDC-USD" -> {986, 7}, "DOT1-USD" -> {304, 7}, "HEX-USD" -> {551, 7}, "UNI3-USD" -> {81, 7},"BCH-USD" -> {1428, 7}, "LTC-USD" -> {2468, 7}, "SOL1-USD" -> {436, 7}, "LINK-USD" -> {1369, 7}, "THETA-USD" -> {1250, 7}, "MATIC-USD" -> {784, 7}, "XLM-USD" -> {2468, 7}, "ICP1-USD" -> {32, 7}, "VET-USD" -> {1052, 7}, "ETC-USD" -> {1792, 7}, "FIL-USD" -> {1285, 7}, "TRX-USD" -> {1376, 7}, "XMR-USD" -> {2468, 7}, "EOS-USD" -> {1450, 7}|>*)

Check we good the data with random sample:

RandomSample[#, 6] & /@ KeyTake[aCryptoCurrenciesDataRaw, RandomChoice[Keys@aCryptoCurrenciesDataRaw]]

Here we add the crypto-currencies symbols and convert date strings into date objects.

AbsoluteTiming[
aCryptoCurrenciesData = Association@KeyValueMap[Function[{k, v}, k -> v[All, Join[<|"Symbol" -> k, "DateObject" -> DateObject[#Date]|>, #] &]], aCryptoCurrenciesDataRaw];
]

(*{8.27865, Null}*)

### Summary

In this section we compute the summary over all datasets:

ResourceFunction["RecordsSummary"][Join @@ Values[aCryptoCurrenciesData], "MaxTallies" -> 30]

### Plots

Here we plot the “Low” and “High” price time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]},
DateListPlot[{
Normal[dsTemp[All, {"DateObject", "Low"}][Values]],
Normal[dsTemp[All, {"DateObject", "High"}][Values]]},
PlotLegends -> {"Low", "High"},
AspectRatio -> 1/4,
PlotRange -> All]
] &,
aCryptoCurrenciesData
]

Here we plot the volume time series for each crypto-currency for the last 120 days:

nDays = 120;
Map[
Block[{dsTemp = #[Select[AbsoluteTime[#DateObject] > AbsoluteTime[DatePlus[Now, -Quantity[nDays, "Days"]]] &]]},
DateListPlot[{
Normal[dsTemp[All, {"DateObject", "Volume"}][Values]]},
PlotLabel -> "Volume",
AspectRatio -> 1/4,
PlotRange -> All]
] &,
aCryptoCurrenciesData
]

## data.bitcoinity.org

In this section we ingest crypto-currency data from data.bitcoinity.org, [DBO1].

In this sub-section we assign different metadata elements used in data.bitcoinity.org.

The currencies and exchanges we obtained by examining the output of:

Import["https://data.bitcoinity.org/markets/price/30d/USD?t=l", "Plaintext"]

#### Assignments

lsCurrencies = {"all", "AED", "ARS", "AUD", "BRL", "CAD", "CHF", "CLP", "CNY", "COP", "CZK", "DKK", "EUR", "GBP", "HKD", "HRK", "HUF", "IDR", "ILS", "INR", "IRR", "JPY", "KES", "KRW", "MXN", "MYR", "NOK", "NZD", "PHP", "PKR", "PLN", "RON", "RUB", "RUR", "SAR", "SEK", "SGD", "THB", "TRY", "UAH", "USD", "VEF", "XAU", "ZAR"};
lsExchanges = {"all", "bit-x", "bit2c", "bitbay", "bitcoin.co.id", "bitcoincentral", "bitcoinde", "bitcoinsnorway", "bitcurex", "bitfinex", "bitflyer", "bithumb", "bitmarketpl", "bitmex", "bitquick", "bitso", "bitstamp", "btcchina", "btce", "btcmarkets", "campbx", "cex.io", "clevercoin", "coinbase", "coinfloor", "exmo", "gemini", "hitbtc", "huobi", "itbit", "korbit", "kraken", "lakebtc", "localbitcoins", "mercadobitcoin", "okcoin", "paymium", "quadrigacx", "therocktrading", "vaultoro", "wallofcoins"};
lsTimeSpans = {"10m", "1h", "6h", "24h", "3d", "30d", "6m", "2y", "5y", "all"};
lsTimeUnit = {"second", "minute", "hour", "day", "week", "month"};
aDataTypeDescriptions = Association@{"price" -> "Prince", "volume" -> "Trading Volume", "rank" -> "Rank", "bidask_sum" -> "Bid/Ask Sum", "spread" -> "Bid/Ask Spread", "tradespm" -> "Trades Per Minute"};
lsDataTypes = Keys[aDataTypeDescriptions];

### Getting BTC data

Here we make a template string that for CSV data retrieval from data.bitcoinity.org:

stDBOURL = StringTemplate["https://data.bitcoinity.org/export_data.csv?currency=currency&data_type=dataType&exchange=exchange&r=timeUnit&t=l&timespan=timeSpan"]

(*TemplateObject[{"https://data.bitcoinity.org/export_data.csv?currency=", TemplateSlot["currency"], "&data_type=", TemplateSlot["dataType"], "&exchange=", TemplateSlot["exchange"], "&r=", TemplateSlot["timeUnit"], "&t=l&timespan=", TemplateSlot["timeSpan"]}, CombinerFunction -> StringJoin, InsertionFunction -> TextString]*)

Here is an association with default values for the string template above:

aDBODefaultParameters = <|"currency" -> "USD", "dataType" -> "price", "exchange" -> "all", "timeUnit" -> "day", "timeSpan" -> "all"|>;

Remark: The metadata assigned above is used to form valid queries for the query string template.

Remark: Not all combinations of parameters are “fully respected” by data.bitcoinity.org. For example, if a data request is with time granularity that is too fine over a large time span, then the returned data is with coarser granularity.

#### Price for a particular currency and exchange pair

Here we retrieve data by overwriting the parameters for currency, time unit, time span, and exchange:

dsBTCPriceData =
ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"currency" -> "EUR", "timeUnit" -> "hour", "timeSpan" -> "7d", "exchange" -> "coinbase"|>]]]

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCPriceData]

#### Volume data

Here we retrieve data by overwriting the parameters for data type, time unit, time span, and exchange:

dsBTCVolumeData =
ResourceFunction["ImportCSVToDataset"][stDBOURL[Join[aDBODefaultParameters, <|"dataType" -> Volume, "timeUnit" -> "day", "timeSpan" -> "30d", "exchange" -> "all"|>]]]

Here is a summary:

ResourceFunction["RecordsSummary"][dsBTCVolumeData]

### Plots

#### Price data

Here we extract the non-time columns in the tabular price data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCPriceData[All, {"Time", #}][Values]] & /@Rest[Normal[Keys[dsBTCPriceData[[1]]]]]], AspectRatio -> 1/4, ImageSize -> Large]

#### Volume data

Here we extract the non-time columns (corresponding to exchanges) in the tabular volume data obtained above and plot the corresponding time series:

DateListPlot[Association[# -> Normal[dsBTCVolumeData[All, {"Time", #}][Values]] & /@ Rest[Normal[Keys[dsBTCVolumeData[[1]]]]]], PlotRange -> All, AspectRatio -> 1/4, ImageSize -> Large]

## References

[DBO1] https://data.bitcoinity.org.

[WK1] Wikipedia entry, Cryptocurrency.

[YF1] Yahoo! Finance, Cryptocurrencies.

# Time series search engines over COVID-19 data

## Introduction

In this article we proclaim the preparation and availability of interactive interfaces to two Time Series Search Engines (TSSEs) over COVID-19 data. One TSSE is based on Apple Mobility Trends data, [APPL1]; the other on The New York Times COVID-19 data, [NYT1].

Here are links to interactive interfaces of the TSSEs hosted (and publicly available) at shinyapps.io by RStudio:

Motivation: The primary motivation for making the TSSEs and their interactive interfaces is to use them as exploratory tools. Combined with relevant data analysis (e.g. [AA1, AA2]) the TSSEs should help to form better intuition and feel of the spread of COVID-19 and related data aggregation, public reactions, and government polices.

The rest of the article is structured as follows:

1. Brief descriptions the overall process, the data
2. Brief descriptions the search engines structure and implementation
3. Discussions of a few search examples and their (possible) interpretations

## The overall process

For both search engines the overall process has the same steps:

1. Ingest the data
2. Do basic (and advanced) data analysis
3. Make (and publish) reports detailing the data ingestion and transformation steps
4. Enhance the data with transformed versions of it or with additional related data
5. Make a Time Series Sparse Matrix Recommender (TSSMR)
6. Make a Time Series Search Engine Interactive Interface (TSSEII)
7. Make the interactive interface easily accessible over the World Wide Web

Here is a flow chart that corresponds to the steps listed above:

## Data

### The Apple data

The Apple Mobility Trends data is taken from Apple’s site, see [APPL1]. The data ingestion, basic data analysis, time series seasonality demonstration, (graph) clusterings are given in [AA1]. (Here is a link to the corresponding R-notebook .)

The weather data was taken using the Mathematica function WeatherData, [WRI1].

(It was too much work to get the weather data using some of the well known weather data R packages.)

### The New York Times data

The New York Times COVID-19 data is taken from GitHub, see [NYT1]. The data ingestion, basic data analysis, and visualizations are given in [AA2]. (Here is a link to the corresponding R-notebook .)

## The search engines

The following sub-sections have screenshots of the TSSE interactive interfaces.

I did experiment with combining the data of the two engines, but did not turn out to be particularly useful. It seems that is more interesting and useful to enhance the Apple data engine with temperature data, and to enhance The New Your Times engine with the (consecutive) differences of the time series.

### Structure

The interactive interfaces have three panels:

• Nearest Neighbors
• Gives the time series nearest neighbors for the time series of selected entity.
• Has interactive controls for entity selection and filtering.
• Trend Finding
• Gives the time series that adhere to a specified named trend.
• Has interactive controls for trend curves selection and entity filtering.
• Notes
• Gives references and data objects summary.

### Implementation

Both TSSEs are implemented using the R packages “SparseMatrixRecommender”, [AAp1], and “SparseMatrixRecommenderInterfaces”, [AAp2].

The package “SparseMatrixRecommender” provides functions to create and use Sparse Matrix Recommender (SMR) objects. Both TSSEs use underlying SMR objects.

The package “SparseMatrixRecommenderInterfaces” provides functions to generate the server and client functions for the Shiny framework by RStudio.

As it was mentioned above, both TSSEs are published at shinyapps.io. The corresponding source codes can be found in [AAr1].

The Apple data TSSE has four types of time series (“entities”). The first three are normalized volumes of Apple maps requests while driving, transit transport use, and walking. (See [AA1] for more details.) The fourth is daily mean temperature at different geo-locations.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

### The New York Times COVID-19 Data Search Engine

The New York Times TSSE has four types of time series (aggregated) cases and deaths, and their corresponding time series differences.

Here are screenshots of the panels “Nearest Neighbors” and “Trend Finding” (at interface launch):

## Examples

In this section we discuss in some detail several examples of using each of the TSSEs.

### Apple data search engine examples

Here are a few observations from [AA1]:

• The COVID-19 lockdowns are clearly reflected in the time series.
• The time series from the Apple Mobility Trends data shows strong weekly seasonality. Roughly speaking, people go to places they are not familiar with on Fridays and Saturdays. Other work week days people are more familiar with their trips. Since much lesser number of requests are made on Sundays, we can conjecture that many people stay at home or visit very familiar locations.

Here are a few assumptions:

• Where people frequently go (work, school, groceries shopping, etc.) they do not need directions that much.
• People request directions when they have more free time and will for “leisure trips.”
• During vacations people are more likely to be in places they are less familiar with.
• People are more likely to take leisure trips when the weather is good. (Warm, not raining, etc.)

#### Nice, France vs Florida, USA

Consider the results of the Nearest Neighbors panel for Nice, France.

Since French tend to go on vacation in July and August ([SS1, INSEE1]) we can see that driving, transit, and walking in Nice have pronounced peaks during that time:

Of course, we also observe the lockdown period in that geographical area.

Compare those time series with the time series from driving in Florida, USA:

We can see that people in Florida, USA have driving patterns unrelated to the typical weather seasons and vacation periods.

(Further TSSE queries show that there is a negative correlation with the temperature in south Florida and the volumes of Apple Maps directions requests.)

#### Italy and Balkan countries driving

We can see that according to the data people who have access to both iPhones and cars in Italy and the Balkan countries Bulgaria, Greece, and Romania have similar directions requests patterns:

(The similarities can be explained with at least a few “obvious” facts, but we are going to restrain ourselves.)

### The New York Times data search engine examples

In Broward county, Florida, USA and Cook county, Illinois, USA we can see two waves of infections in the difference time series:

## References

### Data

[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research (2008), WeatherData, Wolfram Language function.

### Articles

[AA1] Anton Antonov, “Apple mobility trends data visualization (for COVID-19)”, (2020), SystemModeling at GitHub/antononcube.

[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub/antononcube.

[INSEE1] Institut national de la statistique et des études économiques, “En 2010, les salariés ont pris en moyenne six semaines de congé”, (2012).

[SS1] Sam Schechner and Lee Harris, “What Happens When All of France Takes Vacation? 438 Miles of Traffic”, (2019), The Wall Street Journal

### Packages, repositories

[AAp1] Anton Antonov, Sparse Matrix Recommender framework functions, (2019), R-packages at GitHub/antononcube.

[AAp2] Anton Antonov, Sparse Matrix Recommender framework interface functions, (2019), R-packages at GitHub/antononcube.

[AAr1] Anton Antonov, Coronavirus propagation dynamics, (2020), SystemModeling at GitHub/antononcube.

# NY Times COVID-19 data visualization (Update)

## Introduction

This post is both an update and a full-blown version of an older post — “NY Times COVID-19 data visualization” — using NY Times COVID-19 data up to 2021-01-13.

The purpose of this document/notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].

The following steps are taken:

• Ingest data
• Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
• Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
• Merge the data.
• Make data summaries and related plots.
• Make corresponding geo-plots.
• Do “out of the box” time series forecast.
• Analyze fluctuations around time series trends.

Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].

Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.

## Import data

### NYTimes USA states data

dsNYDataStates = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"];
dsNYDataStates = dsNYDataStates[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataStates[[1 ;; 6]]
ResourceFunction["RecordsSummary"][dsNYDataStates]

### NYTimes USA counties data

dsNYDataCounties = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"];
dsNYDataCounties = dsNYDataCounties[All, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataCounties[[1 ;; 6]]
ResourceFunction["RecordsSummary"][dsNYDataCounties]

### US county records

dsUSACountyData = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];
dsUSACountyData = dsUSACountyData[All, Join[#, <|"FIPS" -> ToExpression[#FIPS]|>] &];
dsUSACountyData[[1 ;; 6]]
ResourceFunction["RecordsSummary"][dsUSACountyData]

## Merge data

Verify that the two datasets have common FIPS codes:

Length[Intersection[Normal[dsUSACountyData[All, "FIPS"]], Normal[dsNYDataCounties[All, "Fips"]]]]

(*3133*)

Merge the datasets:

dsNYDataCountiesExtended = Dataset[JoinAcross[Normal[dsNYDataCounties], Normal[dsUSACountyData[All, {"FIPS", "Lat", "Lon", "Population"}]], Key["Fips"] -> Key["FIPS"]]];

Add a “DateObject” column and (reverse) sort by date:

dsNYDataCountiesExtended = dsNYDataCountiesExtended[All, Join[<|"DateObject" -> DateObject[#Date]|>, #] &];
dsNYDataCountiesExtended = dsNYDataCountiesExtended[ReverseSortBy[#DateObject &]];
dsNYDataCountiesExtended[[1 ;; 6]]

## Basic data analysis

We consider cases and deaths for the last date only. (The queries can be easily adjusted for other dates.)

dfQuery = dsNYDataCountiesExtended[Select[#Date == dsNYDataCountiesExtended[1, "Date"] &], {"FIPS", "Cases", "Deaths"}];
dfQuery = dfQuery[All, Prepend[#, "FIPS" -> ToString[#FIPS]] &];
Total[dfQuery[All, {"Cases", "Deaths"}]]

(*<|"Cases" -> 22387340, "Deaths" -> 355736|>*)

Here is the summary of the values of cases and deaths across the different USA counties:

ResourceFunction["RecordsSummary"][dfQuery]

The following table of plots shows the distributions of cases and deaths and the corresponding Pareto principle adherence plots:

opts = {PlotRange -> All, ImageSize -> Medium};
Rasterize[Grid[
Function[{columnName},
{Histogram[Log10[#], PlotLabel -> Row[{Log10, Spacer[3], columnName}], opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[dfQuery[All, columnName]]
] /@ {"Cases", "Deaths"},
Dividers -> All, FrameStyle -> GrayLevel[0.7]]]

A couple of observations:

• The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
• Typical manifestation of the Pareto principle: 80% of the cases and deaths are registered in 20% of the counties.

Remark: The top 20% counties of the cases are not necessarily the same as the top 20% counties of the deaths.

### Distributions

Here we find the distributions that correspond to the cases and deaths (using FindDistribution ):

ResourceFunction["GridTableForm"][List @@@ Map[Function[{columnName},
columnName -> FindDistribution[N@Log10[Select[#, # > 0 &]]] &@Normal[dfQuery[All, columnName]]
], {"Cases", "Deaths"}], TableHeadings -> {"Data", "Distribution"}]

### Pareto principle locations

The following query finds the intersection between that for the top 600 Pareto principle locations for the cases and deaths:

Length[Intersection @@ Map[Function[{columnName}, Keys[TakeLargest[Normal@dfQuery[Association, #FIPS -> #[columnName] &], 600]]], {"Cases", "Deaths"}]]

(*516*)

## Geo-histogram

lsAllDates = Union[Normal[dsNYDataCountiesExtended[All, "Date"]]];
lsAllDates // Length

(*359*)
Manipulate[
DynamicModule[{ds = dsNYDataCountiesExtended[Select[#Date == datePick &]]},
GeoHistogram[
Normal[ds[All, {"Lat", "Lon"}][All, Values]] -> N[Normal[ds[All, columnName]]],
Quantity[150, "Miles"], PlotLabel -> columnName, PlotLegends -> Automatic, ImageSize -> Large, GeoProjection -> "Equirectangular"]
],
{{columnName, "Cases", "Data type:"}, {"Cases", "Deaths"}},
{{datePick, Last[lsAllDates], "Date:"}, lsAllDates}]

## Heat-map plots

An alternative of the geo-visualization is to use a heat-map plot. Here we use the package “HeatmapPlot.m”, [AAp1].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]

### Cases

Cross-tabulate states with dates over cases:

matSDC = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Cases"}], "Sparse" -> True];

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):

HeatmapPlot[matSDC, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

### Deaths

Cross-tabulate states with dates over deaths and plot:

matSDD = ResourceFunction["CrossTabulate"][dsNYDataStates[All, {"State", "Date", "Deaths"}], "Sparse" -> True];
HeatmapPlot[matSDD, DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000]

## Time series analysis

### Cases

#### Time series

For each date sum all cases over the states, make a time series, and plot it:

tsCases = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Cases & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsCases, PlotLabel -> "Cases", opts]
ResourceFunction["RecordsSummary"][tsCases["Path"]]

Logarithmic plot:

DateListPlot[Log10[tsCases], PlotLabel -> Row[{Log10, Spacer[3], "Cases"}], opts]

#### “Forecast”

Fit a time series model to log 10 of the time series:

tsm = TimeSeriesModelFit[Log10[tsCases]]

Plot log 10 data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

Plot data and forecast:

DateListPlot[{tsCases, 10^TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

### Deaths

#### Time series

For each date sum all cases over the states, make a time series, and plot it:

tsDeaths = TimeSeries@(List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Deaths & /@ #] &]]);
opts = {PlotTheme -> "Detailed", PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
DateListPlot[tsDeaths, PlotLabel -> "Deaths", opts]
ResourceFunction["RecordsSummary"][tsDeaths["Path"]]

#### “Forecast”

Fit a time series model:

tsm = TimeSeriesModelFit[tsDeaths, "ARMA"]

Plot data and forecast:

DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]

## Fluctuations

We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)

This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]

### Fluctuations presence

Here we plot the consecutive differences of the cases:

DateListPlot[Differences[tsCases], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

Here we plot the consecutive differences of the deaths:

DateListPlot[Differences[tsDeaths], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]

From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.

### Absolute and relative errors distributions

Here we take interesting part of the cases data:

tsData = TimeSeriesWindow[tsCases, {{2020, 5, 1}, {2020, 12, 31}}];

Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:

qrObj =
QRMonUnit[tsData]⟹
QRMonEchoDataSummary⟹
QRMonRescale[Axes -> {False, True}]⟹
QRMonEchoDataSummary⟹
QRMonQuantileRegression[16, 0.5]⟹
QRMonSetRegressionFunctionsPlotOptions[{PlotStyle -> Red}]⟹
QRMonDateListPlot[AspectRatio -> 1/4, ImageSize -> Large]⟹
QRMonErrorPlots["RelativeErrors" -> False, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]⟹
QRMonErrorPlots["RelativeErrors" -> True, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True];

Here we find the distribution of the absolute errors (fluctuations) using FindDistribution:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*CauchyDistribution[6.0799*10^-6, 0.000331709]*)

Absolute errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> False]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*ExtremeValueDistribution[-0.000996315, 0.00207593]*)

Here we find the distribution of the of the relative errors:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[All, 2]]]

(*StudentTDistribution[0.0000511326, 0.00244023, 1.59364]*)

Relative errors distributions for the last 90 days:

lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> True]⟹QRMonTakeValue)[0.5];
FindDistribution[lsNoise[[-90 ;; -1, 2]]]

(*NormalDistribution[9.66949*10^-6, 0.00394395]*)

## References

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.

[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.

[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.

# A monad for Epidemiologic Compartmental Modeling Workflows

## Introduction

In this document we describe the design and demonstrate the implementation of a (software programming) monad, [Wk1], for Epidemiology Compartmental Modeling (ECM) workflows specification and execution. The design and implementation are done with Mathematica / Wolfram Language (WL). A very similar implementation is also done in R.

Monad’s name is “ECMMon”, which stands for “Epidemiology Compartmental Modeling Monad”, and its monadic implementation is based on the State monad package “StateMonadCodeGenerator.m”, [AAp1, AA1], ECMMon is implemented in the package [AAp8], which relies on the packages [AAp3-AAp6]. The original ECM workflow discussed in [AA5] was implemented in [AAp7]. An R implementation of ECMMon is provided by the package [AAr2].

The goal of the monad design is to make the specification of ECM workflows (relatively) easy and straightforward by following a certain main scenario and specifying variations over that scenario.

We use real-life COVID-19 data, The New York Times COVID-19 data, see [NYT1, AA5].

The monadic programming design is used as a Software Design Pattern. The ECMMon monad can be also seen as a Domain Specific Language (DSL) for the specification and programming of epidemiological compartmental modeling workflows.

Here is an example of using the ECMMon monad over a compartmental model with two types of infected populations:

The table above is produced with the package “MonadicTracing.m”, [AAp2, AA1], and some of the explanations below also utilize that package.

As it was mentioned above the monad ECMMon can be seen as a DSL. Because of this the monad pipelines made with ECMMon are sometimes called “specifications”.

### Contents description

The document has the following structure.

• The sections “Package load” and “Data load” obtain the needed code and data.
• The section “Design consideration” provide motivation and design decisions rationale.
• The section “Single site models” give brief descriptions of certain “seed” models that can be used in the monad.
• The section “Single-site model workflow demo”, “Multi-site workflow demo” give demonstrations of how to utilize the ECMMon monad .
• Using concrete practical scenarios and “real life” data.
• The section “Batch simulation and calibration process” gives methodological preparation for the content of the next two sections.
• The section “Batch simulation workflow” and “Calibration workflow” describe how to do most important monad workflows after the model is developed.
• The section “Future plans” outlines future directions of development.

Remark: One can read only the sections “Introduction”, “Design consideration”, “Single-site models”, and “Batch simulation and calibration process”. That set of sections provide a fairly good, programming language agnostic exposition of the substance and novel ideas of this document.

In this section we load packages used in this notebook:

Import["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Projects/Coronavirus-propagation-dynamics/WL/MonadicEpidemiologyCompartmentalModeling.m"]; Import["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Projects/Coronavirus-propagation-dynamics/WL/MultiSiteModelSimulation.m"] Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicTracing.m"] Import["https://raw.githubusercontent.com/antononcube/ConversationalAgents/master/Packages/WL/ExternalParsersHookup.m"]

Remark: The import commands above would trigger some other package imports.

In this section we ingest data using the “umbrella function” MultiSiteModelReadData from [AAp5]:

AbsoluteTiming[   aData = MultiSiteModelReadData[];   ]  (*{38.8635, Null}*)

### Data summaries

ResourceFunction["RecordsSummary"] /@ aData

### Transform data

Here we transform the population related data in a form convenient for specifying the simulations with it:

aPopulations = Association@Map[{#Lon, #Lat} -> #Population &, Normal[aData["CountyRecords"]]]; aInfected = Association@Map[{#Lon, #Lat} -> #Cases &, Normal[aData["CasesAndDeaths"]]]; aDead = Association@Map[{#Lon, #Lat} -> #Deaths &, Normal[aData["CasesAndDeaths"]]];

### Geo-visualizations

Using the built-in function GeoHistogram we summarize the USA county populations, and COViD-19 infection cases and deaths:

Row@MapThread[GeoHistogram[KeyMap[Reverse, #1], Quantity[140, "Miles"], PlotLabel -> #2, PlotTheme -> "Scientific", PlotLegends -> Automatic, ImageSize -> Medium] &, {{aPopulations, aInfected, aDead}, {"Populations", "Infected", "Dead"}}]

(Note that in the plots above we have to reverse the keys of the given population associations.)

Using the function HextileHistogram from [AAp7 ] here we visualize USA county populations over a hexagonal grid with cell radius 2 degrees ((\approx 140) miles (\approx 222) kilometers):

HextileHistogram[aPopulations, 2, PlotRange -> All, PlotLegends -> Automatic, ImageSize -> Large]

In this notebook we prefer using HextileHistogram because it represents the simulation data in geometrically more faithful way.

## Design considerations

### The big picture

The main purpose of the designed epidemic compartmental modeling framework (i.e. software monad) is to have the ability to do multiple, systematic simulations for different scenario play-outs over large scale geographical regions. The target end-users are decision makers at government level and researchers of pandemic or other large scale epidemic effects.

Here is a diagram that shows the envisioned big picture workflow:

### Large-scale modeling

The standard classical compartmental epidemiology models are not adequate over large geographical areas, like, countries. We design a software framework – the monad ECMMon – that allows large scale simulations using a simple principle workflow:

1. Develop a single-site model for relatively densely populated geographical area for which the assumptions of the classical models (approximately) hold.
2. Extend the single-site model into a large-scale multi-site model using statistically derived traveling patterns; see [AA4].
3. Supply the multi-site model with appropriately prepared data.
4. Run multiple simulations to see large scale implications of different policies.
5. Calibrate the model to concrete observed (or faked) data. Go to 4.

### Flow chart

The following flow chart visualizes the possible workflows the software monad ECMMon:

### Two models in the monad

• An ECMMon object can have one or two models. One of the models is a “seed”, single-site model from [AAp1], which, if desired, is scaled into a multi-site model, [AA3, AAp2].
• Workflows with only the single-site model are supported.
• Say, workflows for doing sensitivity analysis, [AA6, BC1].
• Scaling of a single-site model into multi-site is supported and facilitated.
• Workflows for the multi-site model include preliminary model scaling steps and simulation steps.
• After the single-site model is scaled the monad functions use the multi-site model.
• The workflows should be easy to specify and read.

### Single-site model workflow

1. Make a single-site model.
2. Assign stocks initial conditions.
3. Assign rates values.
4. Simulate.
5. Plot results.
6. Go to 2.

### Multi-site model workflow

1. Make a single-site model.
3. Scale the single-site model into a multi-site model.
1. The single-site assigned rates become “global” when the single-site model is scaled.
2. The scaling is based on assumptions for traveling patterns of the populations.
3. There are few alternatives for that scaling:
1. Using locations geo-coordinates
2. Using regular grids covering a certain area based on in-habited locations geo-coordinates
3. Using traveling patterns contingency matrices
4. Using “artificial” patterns of certain regular types for qualitative analysis purposes
4. Enhance the multi-site traveling patterns matrix and re-scale the single site model.
1. We might want to combine traveling patterns by ground transportation with traveling patterns by airplanes.
2. For quarantine scenarios this might a less important capability of the monad.
1. Hence, this an optional step.
5. Assign stocks initial conditions for each of the sites in multi-scale model.
6. Assign rates for each of the sites.
7. Simulate.
8. Plot global simulation results.
9. Plot simulation results for focus sites.

## Single-site models

We have a collection of single-site models that have different properties and different modeling goals, [AAp3, AA7, AA8]. Here is as diagram of a single-site model that includes hospital beds and medical supplies as limitation resources, [AA7]:

### SEI2HR model

In this sub-section we briefly describe the model SEI2HR, which is used in the examples below.

Remark: SEI2HR stands for “Susceptible, Exposed, Infected Two, Hospitalized, Recovered” (populations).

Detailed description of the SEI2HR model is given in [AA7].

#### Verbal description

We start with one infected (normally symptomatic) person, the rest of the people are susceptible. The infected people meet other people directly or get in contact with them indirectly. (Say, susceptible people touch things touched by infected.) For each susceptible person there is a probability to get the decease. The decease has an incubation period: before becoming infected the susceptible are (merely) exposed. The infected recover after a certain average infection period or die. A certain fraction of the infected become severely symptomatic. If there are enough hospital beds the severely symptomatic infected are hospitalized. The hospitalized severely infected have different death rate than the non-hospitalized ones. The number of hospital beds might change: hospitals are extended, new hospitals are build, or there are not enough medical personnel or supplies. The deaths from infection are tracked (accumulated.) Money for hospital services and money from lost productivity are tracked (accumulated.)

The equations below give mathematical interpretation of the model description above.

#### Equations

Here are the equations of one the epidemiology compartmental models, SEI2HR, [AA7], implemented in [AAp3]:

ModelGridTableForm[SEI2HRModel[t], "Tooltips" -> False]["Equations"] /. {eq_Equal :> TraditionalForm[eq]}

The equations for Susceptible, Exposed, Infected, Recovered populations of SEI2R are “standard” and explanations about them are found in [WK2, HH1]. For SEI2HR those equations change because of the stocks Hospitalized Population and Hospital Beds.

The equations time unit is one day. The time horizon is one year. In this document we consider COVID-19, [Wk2, AA1], hence we do not consider births.

## Single-site model workflow demo

In this section we demonstrate some of the sensitivity analysis discussed in [AA6, BC1].

Make a single-site model, SEI2HR:

model1 = SEI2HRModel[t, "InitialConditions" -> True, "RateRules" -> True, "TotalPopulationRepresentation" -> "AlgebraicEquation"];

Make an association with “default” parameters:

aDefaultPars = <|     aip -> 26,      aincp -> 5,      \[Beta][ISSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],      \[Beta][INSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],      qsd -> 60,      ql -> 21,      qcrf -> 0.25,      \[Beta][HP] -> 0.01,      \[Mu][ISSP] -> 0.035/aip,      \[Mu][INSP] -> 0.01/aip,      nhbr[TP] -> 3/1000,      lpcr[ISSP, INSP] -> 1,      hscr[ISSP, INSP] -> 1     |>;

Execute the workflow multiple times with different quarantine starts:

qlVar = 56; lsRes =   Map[    ECMMonUnit[]⟹      ECMMonSetSingleSiteModel[model1]⟹      ECMMonAssignRateRules[Join[aDefaultPars, <|qsd -> #, ql -> qlVar|>]]⟹      ECMMonSimulate[365]⟹      ECMMonPlotSolutions[{"Infected Severely Symptomatic Population"}, 240,         "Together" -> True, "Derivatives" -> False,         PlotRange -> {0, 12000}, ImageSize -> 250,         Epilog -> {Orange, Dashed, Line[{{#1, 0}, {#1, 12000}}], Line[{{#1 + qlVar, 0}, {#1 + qlVar, 12000}}]},         PlotLabel -> Row[{"Quarantine start:", Spacer[5], #1, ",", Spacer[5], "end:", Spacer[5], #1 + qlVar}],         "Echo" -> False]⟹      ECMMonTakeValue &, Range[25, 100, 5]];

Plot the simulation solutions for “Infected Severely Symptomatic Population”:

Multicolumn[#[[1, 1]] & /@ lsRes, 4]

Both theoretical and computational details of the workflow above are given [AA7, AA8].

## Multi-site workflow demo

In this section we demonstrate the multi-site model workflow using COVID-19 data for USA, [WRI2, NYT1].

Here a “seed”, single-site model is created:

model1 = SEI2HRModel[t, "InitialConditions" -> True, "RateRules" -> True, "TotalPopulationRepresentation" -> "AlgebraicEquation"];

Here we specify a multi-site model workflow (the monadic steps are separated and described with purple print-outs):

ecmObj =     ECMMonUnit[]⟹     ECMMonSetSingleSiteModel[model1]⟹     ECMMonAssignRateRules[      <|       aip -> 26,        aincp -> 5,        \[Beta][ISSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],        \[Beta][INSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],        qsd -> 0,        ql -> 56,        qcrf -> 0.25,        \[Beta][HP] -> 0.01,        \[Mu][ISSP] -> 0.035/aip,        \[Mu][INSP] -> 0.01/aip,        nhbr[TP] -> 3/1000,        lpcr[ISSP, INSP] -> 1,        hscr[ISSP, INSP] -> 1       |>      ]⟹     ECMMonEcho[Style["Show the single-site model tabulated form:", Bold, Purple]]⟹     ECMMonEchoFunctionContext[Magnify[ModelGridTableForm[#singleSiteModel], 1] &]⟹     ECMMonMakePolygonGrid[Keys[aPopulations], 1.5, "BinningFunction" -> Automatic]⟹     ECMMonEcho[Style["Show the grid based on population coordinates:", Bold, Purple]]⟹     ECMMonPlotGrid["CellIDs" -> True, ImageSize -> Large]⟹     ECMMonExtendByGrid[aPopulations, 0.12]⟹     ECMMonAssignInitialConditions[aPopulations, "Total Population", "Default" -> 0]⟹     ECMMonAssignInitialConditions[DeriveSusceptiblePopulation[aPopulations, aInfected, aDead], "Susceptible Population", "Default" -> 0]⟹     ECMMonAssignInitialConditions[<||>, "Exposed Population", "Default" -> 0]⟹     ECMMonAssignInitialConditions[aInfected, "Infected Normally Symptomatic Population", "Default" -> 0]⟹     ECMMonAssignInitialConditions[<||>, "Infected Severely Symptomatic Population", "Default" -> 0]⟹     ECMMonEcho[Style["Show total populations initial conditions data:", Bold, Purple]]⟹     ECMMonPlotGridHistogram[aPopulations, ImageSize -> Large, PlotLabel -> "Total populations"]⟹     ECMMonEcho[Style["Show infected and deceased initial conditions data:", Bold,Purple]]⟹     ECMMonPlotGridHistogram[aInfected, ColorFunction -> ColorData["RoseColors"], "ShowDataPoints" -> False, ImageSize -> Large, PlotLabel -> "Infected"]⟹     ECMMonPlotGridHistogram[aDead, ColorFunction -> ColorData["RoseColors"], "ShowDataPoints" -> False, ImageSize -> Large, PlotLabel -> "Deceased"]⟹     ECMMonEcho[Style["Simulate:", Bold, Purple]]⟹     ECMMonSimulate[365]⟹     ECMMonEcho[Style["Show global population simulation results:", Bold, Purple]]⟹     ECMMonPlotSolutions[__ ~~ "Population", 365]⟹     ECMMonEcho[Style["Show site simulation results for Miami and New York areas:", Bold, Purple]]⟹     ECMMonPlotSiteSolutions[{160, 174}, __ ~~ "Population", 365]⟹     ECMMonEcho[Style["Show deceased and hospitalzed populations results for Miami and New York areas:", Bold, Purple]]⟹     ECMMonPlotSiteSolutions[{160, 174}, {"Deceased Infected Population", "Hospitalized Population","Hospital Beds"}, 300, "FocusTime" -> 120];

Theoretical and computational details about the multi-site workflow can be found in [AA4, AA5].

## Batch simulations and calibration processes

In this section we describe the in general terms the processes of model batch simulations and model calibration. The next two sections give more details of the corresponding software design and workflows.

### Definitions

Batch simulation: If given a SD model (M), the set (P) of parameters of (M), and a set (B) of sets of values (P), (B\text{:=}\left{V_i\right}), then the set of multiple runs of (M) over (B) are called batch simulation.

Calibration: If given a model (M), the set (P) of parameters of (M), and a set of (k) time series (T\text{:=}\left{T_i\right}_{i=1}^k) that correspond to the set of stocks (S\text{:=}\left{S_i\right}_{i=1}^k) of (M) then the process of finding concrete the values (V) for (P) that make the stocks (S) to closely resemble the time series (T) according to some metric is called calibration of (M) over the targets (T).

### Roles

• There are three types of people dealing with the models:
• Modeler, who develops and implements the model and prepares it for calibration.
• Calibrator, who calibrates the model with different data for different parameters.
• Stakeholder, who requires different features of the model and outcomes from different scenario play-outs.
• There are two main calibration scenarios:
• Modeler and Calibrator are the same person
• Modeler and Calibrator are different persons

### Process

Model development and calibration is most likely going to be an iterative process.

For concreteness let us assume that the model has matured development-wise and batch simulation and model calibration is done in a (more) formal way.

Here are the steps of a well defined process between the modeling activity players described above:

1. Stakeholder requires certain scenarios to be investigated.
2. Modeler prepares the model for those scenarios.
3. Stakeholder and Modeler formulate a calibration request.
4. Calibrator uses the specifications from the calibration request to:
1. Calibrate the model
2. Derive model outcomes results
3. Provide model qualitative results
4. Provide model sensitivity analysis results
5. Modeler (and maybe Stakeholder) review the results and decides should more calibration be done.
1. I.e. go to 3.
6. Modeler does batch simulations with the calibrated model for the investigation scenarios.
7. Modeler and Stakeholder prepare report with the results.

See the documents [AA9, AA10] have questionnaires that further clarify the details of interaction between the modelers and calibrators.

### Batch simulation vs calibration

In order to clarify the similarities and differences between batch simulation and calibration we list the following observations:

• Each batch simulation or model calibration is done either for model development purposes or for scenario play-out studies.
• Batch simulation is used for qualitative studies of the model. For example, doing sensitivity analysis; see [BC1, AA7, AA8].
• Before starting the calibration we might want to study the “landscape” of the search space of the calibration parameters using batch simulations.
• Batch simulation is also done after model calibration in order to evaluate different scenarios,
• For some models with large computational complexity batch simulation – together with some evaluation metric – can be used instead of model calibration.

## Batch simulations workflow

In this section we describe the specification and execution of model batch simulations.

Batch simulations can be time consuming, hence it is good idea to

In the rest of the section we go through the following steps:

1. Make a model object
2. Batch simulate over a few combinations of parameters and show:
1. Plots of the simulation results for all populations
2. Plots of the simulation results for a particular population
3. Batch simulate over the Cartesian (outer) product of values lists of a selected pair of parameters and show the corresponding plots of all simulations

### Model (object) for batch simulations

Here we make a new ECMMon object:

ecmObj2 =   ECMMonUnit[]⟹    ECMMonSetSingleSiteModel[model1]⟹    ECMMonAssignRateRules[aDefaultPars];

### Direct specification of combinations of parameters

#### All populations

Here we simulate the model in the object of different parameter combinations given in a list of associations:

res1 =   ecmObj2⟹    ECMMonBatchSimulate[___ ~~ "Population", {<|qsd -> 60, ql -> 28|>, <|qsd -> 55, ql -> 28|>, <|qsd -> 75, ql -> 21, \[Beta][ISSP] -> 0|>}, 240]⟹    ECMMonTakeValue;

Remark: The stocks in the results are only stocks that are populations – that is specified with the string expression pattern ___~~”Population”.

Here is the shape of the result:

Short /@ res1

Here are the corresponding plots:

ListLinePlot[#, PlotTheme -> "Detailed", ImageSize -> Medium, PlotRange -> All] & /@ res1

#### Focus population

We might be interested in the batch simulations results for only one, focus populations. Here is an example:

res2 =     ecmObj2⟹     ECMMonBatchSimulate["Infected Normally Symptomatic Population", {<|qsd -> 60, ql -> 28|>, <|qsd -> 55, ql -> 28|>, <|qsd -> 75, ql -> 21, \[Beta][ISSP] -> 0|>}, 240]⟹     ECMMonTakeValue;

Here is the shape of the result:

Short /@ res2

Here are the corresponding plots:

Multicolumn[  KeyValueMap[   ListLinePlot[#2, PlotLabel -> #1, PlotTheme -> "Detailed",      Epilog -> {Directive[Orange, Dashed],        Line[{Scaled[{0, -1}, {#1[qsd], 0}], Scaled[{0, 1}, {#1[qsd], 0}]}],        Line[{Scaled[{0, -1}, {#1[qsd] + #1[ql], 0}], Scaled[{0, 1}, {#1[qsd] + #1[ql], 0}]}]},      ImageSize -> Medium] &, res2]]

### Outer product of parameters

Instead of specifying an the combinations of parameters directly we can specify the values taken by each parameter using an association in which the keys are parameters and the values are list of values:

res3 =     ecmObj2⟹     ECMMonBatchSimulate[__ ~~ "Population", <|qsd -> {60, 55, 75}, ql -> {28, 21}|>, 240]⟹     ECMMonTakeValue;

Here is the shallow form of the results

Short /@ res3

Here are the corresponding plots:

Multicolumn[  KeyValueMap[   ListLinePlot[#2, PlotLabel -> #1, PlotTheme -> "Detailed",      Epilog -> {Directive[Gray, Dashed],        Line[{Scaled[{0, -1}, {#1[qsd], 0}], Scaled[{0, 1}, {#1[qsd], 0}]}],        Line[{Scaled[{0, -1}, {#1[qsd] + #1[ql], 0}], Scaled[{0, 1}, {#1[qsd] + #1[ql], 0}]}]},      ImageSize -> Medium] &, res3]]

## Calibration workflow

In this section we go through the computation steps of the calibration of single-site SEI2HR model.

Remark: We use real data in this section, but the presented calibration results and outcome plots are for illustration purposes only. A rigorous study with discussion of the related assumptions and conclusions is beyond the scope of this notebook/document.

### Calibration steps

Here are the steps performed in the rest of the sub-sections of this section:

1. Ingest data for infected cases, deaths due to disease, etc.
2. Choose a model to calibrate.
3. Make the calibration targets – those a vectors corresponding to time series over regular grids.
1. Consider using all of the data in order to evaluate model’s applicability.
2. Consider using fractions of the data in order to evaluate model’s ability to predict the future or reconstruct data gaps.
4. Choose calibration parameters and corresponding ranges for their values.
5. If more than one target choose the relative weight (or importance) of the targets.
6. Calibrate the model.
7. Evaluate the fitting between the simulation results and data.
1. Using statistics and plots.
8. Make conclusions. If insufficiently good results are obtained go to 2 or 4.

Remark: When doing calibration epidemiological models a team of people it is better certain to follow (rigorously) well defined procedures. See the documents:

Remark: We plan to prepare have several notebooks dedicated to calibration of both single-site and multi-site models.

### USA COVID-19 data

Here data for the USA COVID-19 infection cases and deaths from [NYT1] (see [AA6] data ingestion details):

lsCases = {1, 1, 1, 2, 3, 5, 5, 5, 5, 6, 7, 8, 11, 11, 11, 12, 12, 12,12, 12, 13, 13, 14, 15, 15, 15, 15, 25, 25, 25, 27, 30, 30, 30, 43, 45, 60, 60, 65, 70, 85, 101, 121, 157, 222, 303, 413, 530, 725,976, 1206, 1566, 2045, 2603, 3240, 4009, 5222, 6947, 9824, 13434, 17918, 23448, 30082, 37696, 46791, 59678, 73970, 88796, 103318, 119676, 139165, 160159, 184764, 212033, 241127, 262275, 288195, 314991, 341540, 370689, 398491, 423424, 445213, 467106, 490170, 512972, 539600, 566777, 590997, 613302, 637812, 660549, 685165, 714907, 747741, 777098, 800341, 820764, 844225, 868644, 895924, 927372, 953923, 977395, 998136, 1020622, 1043873, 1069587, 1095405,1118643, 1137145, 1155671, 1176913, 1196485, 1222057, 1245777, 1267911, 1285105, 1306316, 1326559, 1349019, 1373255, 1395981, 1416682, 1436260, 1455183, 1473813, 1491974, 1513223, 1536848, 1559020, 1578876, 1600414, 1620096, 1639677, 1660303, 1688335, 1709852, 1727711, 1745546, 1763803, 1784049, 1806724, 1831494, 1855870, 1874023, 1894074, 1918373, 1943743, 1970066, 2001470, 2031613, 2057493, 2088420, 2123068, 2159633, 2199841, 2244876, 2286401, 2324563, 2362875, 2411709, 2461341, 2514500, 2573030, 2622980, 2667278, 2713656, 2767129, 2825865, 2885325, 2952393, 3012349, 3069369, 3129738, 3194944, 3263077, 3338308, 3407962, 3469137, 3529938, 3588229, 3653114, 3721574, 3790356, 3862588, 3928575, 3981476, 4039440, 4101329, 4167741, 4235717, 4303663, 4359188, 4408708, 4455340, 4507370, 4560539, 4617036, 4676822, 4730639, 4777548, 4823529, 4876038, 4929115, 4981066, 5038637, 5089258, 5130147, 5166032, 5206970, 5251824, 5297150, 5344322, 5388034, 5419494, 5458726, 5497530, 5541830, 5586297, 5631403, 5674714, 5707327, 5742814, 5786178, 5817338, 5862014, 5917466, 5958619, 5988001, 6012054, 6040456, 6073671, 6110645, 6157050, 6195893, 6228601, 6264192, 6301923, 6341145, 6385411, 6432677, 6472474, 6507345, 6560827, 6597281, 6638806, 6682079, 6734971, 6776512, 6812354, 6847745, 6889421, 6930523, 6975693, 7027692, 7073962, 7107992, 7168298, 7210171, 7261433, 7315687, 7373073, 7422798, 7466501, 7513020, 7565839, 7625285, 7688761, 7757326, 7808123, 7853753, 7916879, 7976530, 8039653, 8113165, 8193658, 8270925, 8328369, 8401001, 8473618, 8555199, 8642599, 8737995, 8814233, 8892933, 8983153, 9074711, 9182627, 9301455, 9436244, 9558668, 9659817, 9784920, 9923082, 10065150, 10222441, 10405550, 10560047, 10691686, 10852769, 11011484, 11183982, 11367840, 11561152, 11727724, 11864571, 12039323, 12213742, 12397014, 12495699, 12693598, 12838076, 12972986, 13135728, 13315143, 13516558, 13728192, 13958512, 14158135, 14325555, 14519697, 14731424, 14954596, 15174109, 15447371, 15647963, 15826415, 16020169, 16218331, 16465552, 16697862, 16941306, 17132902, 17305013, 17498382, 17694678, 17918870, 18106293, 18200349, 18410644, 18559596, 18740591, 18932346, 19157710};
lsDeaths = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 6, 10, 12, 12, 15, 19, 22, 26, 31, 37, 43, 50, 59, 63, 84, 106, 137, 181, 223, 284, 335, 419, 535, 694, 880, 1181, 1444, 1598, 1955, 2490, 3117, 3904, 4601, 5864, 6408, 7376, 8850, 10159, 11415,12924, 14229, 15185, 16320, 18257, 20168, 21941, 23382, 24617, 26160, 27535, 29821, 31633, 33410, 35104, 36780, 37660, 38805, 40801, 42976, 44959, 46552, 48064, 49122, 50012, 52079, 54509, 56277, 57766, 59083, 59903, 60840, 62299, 63961, 65623, 67143, 68260, 68959, 69633, 71042, 72474, 73718, 74907, 75891, 76416, 76888, 77579, 79183, 80329, 81452, 82360, 82904, 83582, 84604, 85545, 86487, 87559, 88256, 88624, 89235, 90220, 91070, 91900, 92621, 93282, 93578, 93988, 94710, 95444, 96123, 96760, 97297, 97514, 97832, 98638, 99372, 101807, 102525, 103018, 103258, 103594,104257, 104886, 105558, 106148, 106388, 106614, 106998, 107921, 108806, 109705, 110522, 111177, 111567, 111950, 112882, 113856, 114797, 115695, 116458, 116854, 117367, 118475, 119606, 120684, 121814, 122673, 123105, 124782, 126089, 127465, 128720, 130131, 131172, 131584, 132174, 133517, 134756, 135271, 136584, 137547, 138076, 138601, 140037, 141484, 142656, 143793, 144819, 145315, 145831, 147148, 148517, 149550, 150707, 151641, 152065, 152551, 153741, 154908, 156026, 157006, 157869, 158235, 158714, 159781, 160851, 161916, 162869, 163573, 163970, 164213, 164654, 165811, 166713, 167913, 168594, 168990, 169423, 170674, 171663, 172493, 173408, 174075, 174281, 174689, 175626, 176698, 177559, 178390, 179145, 179398, 179736, 180641, 181607, 182455, 183282, 183975, 184298, 184698, 185414, 186359, 187280, 188168, 188741, 189165, 189501, 190312, 191267, 192077, 192940, 193603, 193976, 194481, 195405, 196563, 197386, 198271, 199126, 199462, 199998, 200965, 201969, 202958, 203878, 204691, 205119, 205623, 206757, 208331, 209417, 210829, 211838, 212277, 212989, 214442, 215872, 217029, 218595, 219791, 220402, 221165, 222750, 224641, 226589, 228452, 229868, 230695, 231669, 233856, 236127, 237284, 238636, 239809, 240607, 241834, 244430, 247258, 250102, 252637, 254784, 255857, 257282, 260038, 263240, 266144, 268940, 271172, 272499, 274082, 277088, 280637, 283899, 286640, 289168, 290552, 292346, 295549, 298953, 301713, 302802, 304403, 305581, 307396, 310994, 314654};

Remark: The COVID-19 data was ingested from [NYT1] on 2020-12-31,

### Calibration targets

From the data we make the calibration targets association:

aTargets = <|{ISSP -> 0.2 lsCases, INSP -> 0.8 lsCases, DIP -> lsDeaths}|>;

Remark: Note that we split the infection cases into 20% severely symptomatic cases and 80% normally symptomatic cases.

Here is the corresponding plot:

ListLogPlot[aTargets, PlotTheme -> "Detailed", PlotLabel -> "Calibration targets", ImageSize -> Medium]

Here we prepare a smaller set of the targets data for the calibration experiments below:

aTargetsShort = Take[#, 170] & /@ aTargets;

### Model creation

modelSEI2HR = SEI2HRModel[t, "TotalPopulationRepresentation" -> "AlgebraicEquation"];

Here are the parameters we want to experiment with (or do calibration with):

lsFocusParams = {aincp, aip, sspf[SP], \[Beta][HP], qsd, ql, qcrf, nhbcr[ISSP, INSP], nhbr[TP]};

Here we set custom rates and initial conditions:

aDefaultPars = <|     \[Beta][ISSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],      \[Beta][INSP] -> 0.5*Piecewise[{{1, t < qsd}, {qcrf, qsd <= t <= qsd + ql}}, 1],      qsd -> 60,      ql -> 8*7,      qcrf -> 0.25,      \[Beta][HP] -> 0.01,      \[Mu][ISSP] -> 0.035/aip,      \[Mu][INSP] -> 0.01/aip,      nhbr[TP] -> 3/1000,      lpcr[ISSP, INSP] -> 1,      hscr[ISSP, INSP] -> 1     |>;

Remark: Note the piecewise functions for (\beta [\text{ISSP}]) and (\beta [\text{INSP}]).

### Calibration

Here is the USA population number we use for calibration:

usaPopulation = QuantityMagnitude@CountryData["UnitedStates", "Population"]  (*329064917*)

Here is we create a ECMMon object that has default parameters and initial conditions assigned above:

AbsoluteTiming[   ecmObj3 =      ECMMonUnit[]⟹      ECMMonSetSingleSiteModel[modelSEI2HR]⟹      ECMMonAssignInitialConditions[<|TP[0] -> usaPopulation, SP[0] -> usaPopulation - 1, ISSP[0] -> 1|>]⟹      ECMMonAssignRateRules[KeyDrop[aDefaultPars, {aip, aincp, qsd, ql, qcrf}]]⟹      ECMMonCalibrate[       "Target" -> KeyTake[aTargetsShort, {ISSP, DIP}],        "StockWeights" -> <|ISSP -> 0.8, DIP -> 0.2|>,        "Parameters" -> <|aip -> {10, 35}, aincp -> {2, 16}, qsd -> {60, 120}, ql -> {20, 160}, qcrf -> {0.1, 0.9}|>,        DistanceFunction -> EuclideanDistance,        Method -> {"NelderMead", "PostProcess" -> False},        MaxIterations -> 1000       ];   ]  (*{28.0993, Null}*)

Here are the found parameters:

calRes = ecmObj3⟹ECMMonTakeValue  (*{152516., {aip -> 10., aincp -> 3.67018, qsd -> 81.7067, ql -> 111.422, qcrf -> 0.312499}}*)

#### Using different minimization methods and distance functions

In the monad the calibration of the models is done with NMinimize. Hence, the monad function ECMMonCalibrate takes all options of NMinimize and can do calibrations with the same data and parameter search space using different global minima finding methods and distance functions.

Remark: EuclideanDistance is an obvious distance function, but use others like infinity norm and sum norm. Also, we can use a distance function that takes parts of the data. (E.g. between days 50 and 150 because the rest of the data is, say, unreliable.)

### Verification of the fit

maxTime = Length[aTargets[[1]]];
ecmObj4 =     ECMMonUnit[]⟹     ECMMonSetSingleSiteModel[modelSEI2HR]⟹     ECMMonAssignInitialConditions[<|TP[0] -> usaPopulation, SP[0] -> usaPopulation - 1, ISSP[0] -> 1|>]⟹     ECMMonAssignRateRules[Join[aDefaultPars, Association[calRes[[2]]]]]⟹     ECMMonSimulate[maxTime]⟹     ECMMonPlotSolutions[___ ~~ "Population" ~~ ___, maxTime, ImageSize -> Large, LogPlot -> False];

aSol4 = ecmObj4⟹ECMMonGetSolutionValues[Keys[aTargets], maxTime]⟹ECMMonTakeValue;
Map[ListLogPlot[{aSol4[#], aTargets[#]}, PlotLabel -> #, PlotRange -> All, ImageSize -> Medium, PlotLegends -> {"Calibrated model", "Target"}] &, Keys[aTargets]]

### Conclusions from the calibration results

We see the that with the calibration found parameter values the model can fit the data for the first 200 days, after that it overestimates the evolution of the infected and deceased popupulations.

We can conjecture that:

• The model is too simple, hence inadequate
• That more complicated quarantine policy functions have to be used
• That the calibration process got stuck in some local minima

## Future plans

In this section we outline some of the directions in which the presented work on ECMMon can be extended.

### More unit tests and random unit tests

We consider the preparation and systematic utilization of unit tests to be a very important component of any software development. Unit tests are especially important when complicated software package like ECMMon are developed.

For the presented software monad (and its separately developed, underlying packages) have implemented a few collections of tests, see [AAp10, AAp11].

We plan to extend and add more complicated unit tests that test for both quantitative and qualitative behavior. Here are some examples for such tests:

• Stock-vs-stock orbits produced by simulations of certain epidemic models
• Expected theoretical relationships between populations (or other stocks) for certain initial conditions and rates
• Wave-like propagation of the proportions of the infected populations in multi-site models over artificial countries and traveling patterns
• Finding of correct parameter values with model calibration over different data (both artificial and real life)
• Expected number of equations for different model set-ups
• Expected (relative) speed of simulations with respect to model sizes

Further for the monad ECMMon we plant to develop random pipeline unit tests as the ones in [AAp12] for the classification monad ClCon, [AA11].

### More comprehensive calibration guides and documentation

We plan to produce more comprehensive guides for doing calibration with ECMMon and in general with Mathematica’s NDSolve and NMinimize functions.

### Full correspondence between the Mathematica and R implementations

The ingredients of the software monad ECMMon and ECMMon itself were designed and implemented in Mathematica first. The corresponding design and implementation was done in R, [AAr2]. To distinguish the two implementations we call the R one ECMMon-R and Mathematica (Wolfram Language) one ECMMon-WL.

At this point the calibration is not implemented in ECMMon-R, but we plan to do that soon.

Using ECMMon-R (and the RStudio’s Shiny ecosystem) allows for highly shareable interactive interfaces to be programed. Here is an example: https://antononcube.shinyapps.io/SEI2HR-flexdashboard/ .

(With Mathematica similar interactive interfaces are presented in [AA7, AA8].)

### Model transfer between Mathematica and R

We are very interested in transferring epidemiological models from Mathematica to R (or Python, or Julia.)

This can be done in two principle ways: (i) using Mathematica expressions parsers, or (ii) using matrix representations. We plan to investigate the usage of both approaches.

### Conversational agent

Consider the making of a conversational agent for epidemiology modeling workflows building. Initial design and implementation is given in [AA13, AA14].

Consider the following epidemiology modeling workflow specification:

lsCommands = "create with SEI2HR;assign 100000 to total population;set infected normally symptomatic population to be 0;set infected severely symptomatic population to be 1;assign 0.56 to contact rate of infected normally symptomatic population;assign 0.58 to contact rate of infected severely symptomatic population;assign 0.1 to contact rate of the hospitalized population;simulate for 240 days;plot populations results;calibrate for target DIPt -> tsDeaths, over parameters contactRateISSP in from 0.1 to 0.7;echo pipeline value";

Here is the ECMMon code generated using the workflow specification:

ToEpidemiologyModelingWorkflowCode[lsCommands, "Execute" -> False, "StringResult" -> True]  (*"ECMMonUnit[SEI2HRModel[t]] ⟹ECMMonAssignInitialConditions[<|TP[0] -> 100000|>] ⟹ECMMonAssignInitialConditions[<|INSP[0] -> 0|>] ⟹ECMMonAssignInitialConditions[<|ISSP[0] -> 1|>] ⟹ECMMonAssignRateRules[<|\\[Beta][INSP] -> 0.56|>] ⟹ECMMonAssignRateRules[<|\\[Beta][ISSP] -> 0.58|>] ⟹ECMMonAssignRateRules[<|\\[Beta][HP] -> 0.1|>] ⟹ECMMonSimulate[\"MaxTime\" -> 240] ⟹ECMMonPlotSolutions[ \"Stocks\" -> __ ~~ \"Population\"] ⟹ECMMonCalibrate[ \"Target\" -> <|DIP -> tsDeaths|>, \"Parameters\" -> <|\\[Beta][ISSP] -> {0.1, 0.7}|> ] ⟹ECMMonEchoValue[]"*)

Here is the execution of the code above:

Block[{tsDeaths = Take[lsDeaths, 150]}, ToEpidemiologyModelingWorkflowCode[lsCommands]];

#### Different target languages

Using the natural commands workflow specification we can generate code to other languages, like, Python or R:

ToEpidemiologyModelingWorkflowCode[lsCommands, "Target" -> "Python"]  (*"obj = ECMMonUnit( model = SEI2HRModel())obj = ECMMonAssignInitialConditions( ecmObj = obj, initConds = [TPt = 100000])obj = ECMMonAssignInitialConditions( ecmObj = obj, initConds = [INSPt = 0])obj = ECMMonAssignInitialConditions( ecmObj = obj, initConds = [ISSPt = 1])obj = ECMMonAssignRateValues( ecmObj = obj, rateValues = [contactRateINSP = 0.56])obj = ECMMonAssignRateValues( ecmObj = obj, rateValues = [contactRateISSP = 0.58])obj = ECMMonAssignRateValues( ecmObj = obj, rateValues = [contactRateHP = 0.1])obj = ECMMonSimulate( ecmObj = obj, maxTime = 240)obj = ECMMonPlotSolutions( ecmObj = obj, stocksSpec = \".*Population\")obj = ECMMonCalibrate( ecmObj = obj,  target = [DIPt = tsDeaths], parameters = [contactRateISSP = [0.1, 0.7]] )"*)

## References

### Articles

[Wk1] Wikipedia entry, Monad.

[Wk2] Wikipedia entry, “Compartmental models in epidemiology”.

[Wk3] Wikipedia entry, “Coronavirus disease 2019”.

[BC1] Lucia Breierova, Mark Choudhari, An Introduction to Sensitivity Analysis, (1996), Massachusetts Institute of Technology.

[JS1] John D.Sterman, Business Dynamics: Systems Thinking and Modeling for a Complex World. (2000), New York: McGraw.

[HH1] Herbert W. Hethcote (2000). “The Mathematics of Infectious Diseases”. SIAM Review. 42 (4): 599–653. Bibcode:2000SIAMR..42..599H. doi:10.1137/s0036144500371907.

[AA1] Anton Antonov, ”Monad code generation and extension”, (2017), MathematicaForPrediction at GitHub/antononcube.

[AA2] Anton Antonov, “Coronavirus propagation modeling considerations”, (2020), SystemModeling at GitHub/antononcube.

[AA3] Anton Antonov, “Basic experiments workflow for simple epidemiological models”, (2020), SystemModeling at GitHub/antononcube.

[AA4] Anton Antonov, “Scaling of Epidemiology Models with Multi-site Compartments”, (2020), SystemModeling at GitHub/antononcube.

[AA5] Anton Antonov, “WirVsVirus hackathon multi-site SEI2R over a hexagonal grid graph”, (2020), SystemModeling at GitHub/antononcube.

[AA6] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub/antononcube.

[AA7] Anton Antonov, “SEI2HR model with quarantine scenarios”, (2020), SystemModeling at GitHub/antononcube.

[AA8] Anton Antonov, “SEI2HR-Econ model with quarantine and supplies scenarios”, (2020), SystemModeling at GitHub/antononcube.

[AA9] Anton Antonov, Modelers questionnaire, (2020), SystemModeling at GitHub/antononcube.

[AA10] Anton Antonov, Calibrators questionnaire, (2020), SystemModeling at GitHub/antononcube.

[AA11] Anton Antonov, A monad for classification workflows, (2018), MathematicaForPrediction at WordPress.

### Repositories, packages

[WRI1] Wolfram Research, Inc., “Epidemic Data for Novel Coronavirus COVID-19”, WolframCloud.

[WRI2] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[AAr1] Anton Antonov, Coronavirus propagation dynamics project, (2020), SystemModeling at GitHub/antononcube.

[AAr2] Anton Antonov, Epidemiology Compartmental Modeling Monad R package, (2020), ECMMon-R at GitHu/antononcube.

[AAp1] Anton Antonov, State monad code generator Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Monadic tracing Mathematica package, (2017), MathematicaForPrediction at GitHub/antononcube.

[AAp3] Anton Antonov, Epidemiology models Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp4] Anton Antonov, Epidemiology models modifications Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp5] Anton Antonov, Epidemiology modeling visualization functions Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp6] Anton Antonov, System dynamics interactive interfaces functions Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp7] Anton Antonov, Multi-site model simulation Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp8] Anton Antonov, Monadic Epidemiology Compartmental Modeling Mathematica package, (2020), SystemModeling at GitHub/antononcube.

[AAp9] Anton Antonov, Hextile bins Mathematica package, (2020), MathematicaForPrediction at GitHub/antononcube.

[AAp10] Anton Antonov, Monadic Epidemiology Compartmental Modeling Mathematica unit tests, (2020), SystemModeling at GitHub/antononcube.

[AAp11] Anton Antonov, Epidemiology Models NDSolve Mathematica unit tests, (2020), SystemModeling at GitHub/antononcube.

[AAp12] Anton Antonov, Monadic contextual classification random pipelines Mathematica unit tests, (2018), MathematicaForPrediction at GitHub/antononcube.

[AAp13] Anton Antonov, Epidemiology Modeling Workflows Raku package, (2020), Raku-DSL-English-EpidemiologyModelingWorkflows at GitHu/antononcube.

[AAp14] Anton Antonov, External Parsers Hookup Mathematica package, (2019), ConversationalAgents at GitHub.