## Introduction

*This post is both an update and a full-blown version of an older post — “NY Times COVID-19 data visualization” — using NY Times COVID-19 data up to 2021-01-13.*

The purpose of this document/notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].

The following steps are taken:

- Ingest data
- Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
- Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].

- Merge the data.
- Make data summaries and related plots.
- Make corresponding geo-plots.
- Do “out of the box” time series forecast.
- Analyze fluctuations around time series trends.

Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].

**Remark:** The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.

## Import data

### NYTimes USA states data

```
-19-data/master/us-states.csv"];
dsNYDataStates = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covidAll, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataStates = dsNYDataStates[1 ;; 6]] dsNYDataStates[[
```

` ResourceFunction["RecordsSummary"][dsNYDataStates]`

### NYTimes USA counties data

```
-19-data/master/us-counties.csv"];
dsNYDataCounties = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/nytimes/covidAll, AssociationThread[Capitalize /@ Keys[#], Values[#]] &];
dsNYDataCounties = dsNYDataCounties[1 ;; 6]] dsNYDataCounties[[
```

` ResourceFunction["RecordsSummary"][dsNYDataCounties]`

### US county records

```
dsUSACountyData = ResourceFunction["ImportCSVToDataset"]["https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv"];All, Join[#, <|"FIPS" -> ToExpression[#FIPS]|>] &];
dsUSACountyData = dsUSACountyData[1 ;; 6]] dsUSACountyData[[
```

` ResourceFunction["RecordsSummary"][dsUSACountyData]`

## Merge data

Verify that the two datasets have common FIPS codes:

```
Length[Intersection[Normal[dsUSACountyData[All, "FIPS"]], Normal[dsNYDataCounties[All, "Fips"]]]]
3133*) (*
```

Merge the datasets:

`Normal[dsNYDataCounties], Normal[dsUSACountyData[All, {"FIPS", "Lat", "Lon", "Population"}]], Key["Fips"] -> Key["FIPS"]]]; dsNYDataCountiesExtended = Dataset[JoinAcross[`

Add a “DateObject” column and (reverse) sort by date:

```
All, Join[<|"DateObject" -> DateObject[#Date]|>, #] &];
dsNYDataCountiesExtended = dsNYDataCountiesExtended[
dsNYDataCountiesExtended = dsNYDataCountiesExtended[ReverseSortBy[#DateObject &]];1 ;; 6]] dsNYDataCountiesExtended[[
```

## Basic data analysis

We consider cases and deaths for the last date only. (The queries can be easily adjusted for other dates.)

```
Select[#Date == dsNYDataCountiesExtended[1, "Date"] &], {"FIPS", "Cases", "Deaths"}];
dfQuery = dsNYDataCountiesExtended[All, Prepend[#, "FIPS" -> ToString[#FIPS]] &]; dfQuery = dfQuery[
```

```
Total[dfQuery[All, {"Cases", "Deaths"}]]
22387340, "Deaths" -> 355736|>*) (*<|"Cases" ->
```

Here is the summary of the values of cases and deaths across the different USA counties:

` ResourceFunction["RecordsSummary"][dfQuery]`

The following table of plots shows the distributions of cases and deaths and the corresponding Pareto principle adherence plots:

```
PlotRange -> All, ImageSize -> Medium};
opts = {Rasterize[Grid[
Function[{columnName},
Histogram[Log10[#], PlotLabel -> Row[{Log10, Spacer[3], columnName}], opts], ResourceFunction["ParetoPrinciplePlot"][#, PlotLabel -> columnName, opts]} &@Normal[dfQuery[All, columnName]]
{
] /@ {"Cases", "Deaths"}, Dividers -> All, FrameStyle -> GrayLevel[0.7]]]
```

A couple of observations:

- The logarithms of the cases and deaths have nearly Normal or Logistic distributions.
- Typical manifestation of the Pareto principle: 80% of the cases and deaths are registered in 20% of the counties.

**Remark:** The top 20% counties of the cases are not necessarily the same as the top 20% counties of the deaths.

### Distributions

Here we find the distributions that correspond to the cases and deaths (using `FindDistribution`

):

```
List @@@ Map[Function[{columnName},
ResourceFunction["GridTableForm"][Select[#, # > 0 &]]] &@Normal[dfQuery[All, columnName]]
columnName -> FindDistribution[N@Log10[TableHeadings -> {"Data", "Distribution"}] ], {"Cases", "Deaths"}],
```

### Pareto principle locations

The following query finds the intersection between that for the top 600 Pareto principle locations for the cases and deaths:

```
Length[Intersection @@ Map[Function[{columnName}, Keys[TakeLargest[Normal@dfQuery[Association, #FIPS -> #[columnName] &], 600]]], {"Cases", "Deaths"}]]
516*) (*
```

## Geo-histogram

```
Union[Normal[dsNYDataCountiesExtended[All, "Date"]]];
lsAllDates = Length
lsAllDates //
359*) (*
```

```
Manipulate[
DynamicModule[{ds = dsNYDataCountiesExtended[Select[#Date == datePick &]]},
GeoHistogram[Normal[ds[All, {"Lat", "Lon"}][All, Values]] -> N[Normal[ds[All, columnName]]],
150, "Miles"], PlotLabel -> columnName, PlotLegends -> Automatic, ImageSize -> Large, GeoProjection -> "Equirectangular"]
Quantity[
],
{{columnName, "Cases", "Data type:"}, {"Cases", "Deaths"}}, Last[lsAllDates], "Date:"}, lsAllDates}] {{datePick,
```

## Heat-map plots

An alternative of the geo-visualization is to use a heat-map plot. Here we use the package “HeatmapPlot.m”, [AAp1].

`Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Misc/HeatmapPlot.m"]`

### Cases

Cross-tabulate states with dates over cases:

`All, {"State", "Date", "Cases"}], "Sparse" -> True]; matSDC = ResourceFunction["CrossTabulate"][dsNYDataStates[`

Make a heat-map plot by sorting the columns of the cross-tabulation matrix (that correspond to states):

`DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000] HeatmapPlot[matSDC, `

### Deaths

Cross-tabulate states with dates over deaths and plot:

```
All, {"State", "Date", "Deaths"}], "Sparse" -> True];
matSDD = ResourceFunction["CrossTabulate"][dsNYDataStates[DistanceFunction -> {EuclideanDistance, None}, AspectRatio -> 1/2, ImageSize -> 1000] HeatmapPlot[matSDD,
```

## Time series analysis

### Cases

#### Time series

For each date sum all cases over the states, make a time series, and plot it:

```
List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Cases & /@ #] &]]);
tsCases = TimeSeries@(PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
opts = {PlotTheme -> "Detailed", DateListPlot[tsCases, PlotLabel -> "Cases", opts]
```

` ResourceFunction["RecordsSummary"][tsCases["Path"]]`

Logarithmic plot:

`DateListPlot[Log10[tsCases], PlotLabel -> Row[{Log10, Spacer[3], "Cases"}], opts]`

#### “Forecast”

Fit a time series model to log 10 of the time series:

`Log10[tsCases]] tsm = TimeSeriesModelFit[`

Plot log 10 data and forecast:

`DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]`

Plot data and forecast:

`DateListPlot[{tsCases, 10^TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]`

### Deaths

#### Time series

For each date sum all cases over the states, make a time series, and plot it:

```
List @@@ Normal[GroupBy[Normal[dsNYDataCountiesExtended], #DateObject &, Total[#Deaths & /@ #] &]]);
tsDeaths = TimeSeries@(PlotRange -> All, AspectRatio -> 1/4,ImageSize -> Large};
opts = {PlotTheme -> "Detailed", DateListPlot[tsDeaths, PlotLabel -> "Deaths", opts]
```

` ResourceFunction["RecordsSummary"][tsDeaths["Path"]]`

#### “Forecast”

Fit a time series model:

` tsm = TimeSeriesModelFit[tsDeaths, "ARMA"]`

Plot data and forecast:

`DateListPlot[{tsm["TemporalData"], TimeSeriesForecast[tsm, {10}]}, opts, PlotLegends -> {"Data", "Forecast"}]`

## Fluctuations

We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)

This can be efficiently using the software monad `QRMon`

, [AAp2, AA1]. Here we load the `QRMon`

package:

`Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MonadicProgramming/MonadicQuantileRegression.m"]`

### Fluctuations presence

Here we plot the consecutive differences of the cases:

`DateListPlot[Differences[tsCases], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]`

Here we plot the consecutive differences of the deaths:

`DateListPlot[Differences[tsDeaths], ImageSize -> Large, AspectRatio -> 1/4, PlotRange -> All]`

From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.

### Absolute and relative errors distributions

Here we take interesting part of the cases data:

`2020, 5, 1}, {2020, 12, 31}}]; tsData = TimeSeriesWindow[tsCases, {{`

Here we specify QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:

```
qrObj =
QRMonUnit[tsData]⟹
QRMonEchoDataSummary⟹Axes -> {False, True}]⟹
QRMonRescale[
QRMonEchoDataSummary⟹16, 0.5]⟹
QRMonQuantileRegression[PlotStyle -> Red}]⟹
QRMonSetRegressionFunctionsPlotOptions[{AspectRatio -> 1/4, ImageSize -> Large]⟹
QRMonDateListPlot[False, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]⟹
QRMonErrorPlots["RelativeErrors" -> True, AspectRatio -> 1/4, ImageSize -> Large, DateListPlot -> True]; QRMonErrorPlots["RelativeErrors" ->
```

Here we find the distribution of the absolute errors (fluctuations) using `FindDistribution`

:

```
False]⟹QRMonTakeValue)[0.5];
lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> All, 2]]]
FindDistribution[lsNoise[[
CauchyDistribution[6.0799*10^-6, 0.000331709]*) (*
```

Absolute errors distributions for the last 90 days:

```
False]⟹QRMonTakeValue)[0.5];
lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> 90 ;; -1, 2]]]
FindDistribution[lsNoise[[-
ExtremeValueDistribution[-0.000996315, 0.00207593]*) (*
```

Here we find the distribution of the of the relative errors:

```
True]⟹QRMonTakeValue)[0.5];
lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> All, 2]]]
FindDistribution[lsNoise[[
StudentTDistribution[0.0000511326, 0.00244023, 1.59364]*) (*
```

Relative errors distributions for the last 90 days:

```
True]⟹QRMonTakeValue)[0.5];
lsNoise = (qrObj⟹QRMonErrors["RelativeErrors" -> 90 ;; -1, 2]]]
FindDistribution[lsNoise[[-
NormalDistribution[9.66949*10^-6, 0.00394395]*) (*
```

## References

[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.

[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.

[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.

[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.

[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.

[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.