Pareto principle adherence examples

This post (document) is made to provide examples of the Pareto principle manifestation in different datasets.

The Pareto principle is an interesting law that manifests in many contexts. It is also known as "Pareto law", "the law of significant few", "the 80-20 rule".

For example:

  • "80% of the land is owned by 20% of the population",

  • "10% of all lakes contain 90% of all lake water."

For extensive discussion and studied examples see the Wikipedia entry "Pareto principle", [4].

It is a good idea to see for which parts of the analyzed data the Pareto principle manifests. Testing for the Pareto principle is usually simple. For example, assume that we have the GDP of all countries:

countries = CountryData["Countries"];
gdps = {CountryData[#, "Name"], CountryData[#, "GDP"]} & /@ countries;
gdps = DeleteCases[gdps, {_, _Missing}] /. Quantity[x_, _] :> x;

Grid[{RecordsSummary[gdps, {"country", "GDP"}]}, Alignment -> Top, Dividers -> All]

GDPUnsorted1

In order to test for the manifestation of the Pareto principle we have to (i) sort the GDP values in descending order, (ii) find the cumulative sums, (iii) normalize the obtained vector by the sum of all values, and (iv) plot the result. These steps are done with the following two commands:

t = Reverse@Sort@gdps[[All, 2]];
ListPlot[Accumulate[t]/Total[t], PlotRange -> All, GridLines -> {{0.2} Length[t], {0.8}}, Frame -> True]

GDPPlot1

In this document we are going to use the special function ParetoLawPlot defined in the next section and the package [1]. Most of the examples use data that is internally accessible within Mathematica. Several external data examples are considered.

See the package [1] for the function RecordsSummary. See the source file [2] for R functions that facilitate the plotting of Pareto principle graphs. See the package [3] for the outlier detection functions used below.

Definitions

This simple function makes a list plot that would help assessing the manifestation of the Pareto principle. It takes the same options as ListPlot.

Clear[ParetoLawPlot]
Options[ParetoLawPlot] = Options[ListPlot];
ParetoLawPlot[dataVec : {_?NumberQ ..}, opts : OptionsPattern[]] := ParetoLawPlot[{Tooltip[dataVec, 1]}, opts];
ParetoLawPlot[dataVecs : {{_?NumberQ ..} ..}, opts : OptionsPattern[]] := ParetoLawPlot[MapThread[Tooltip, {dataVecs, Range[Length[dataVecs]]}], opts];
ParetoLawPlot[dataVecs : {Tooltip[{_?NumberQ ..}, _] ..}, opts : OptionsPattern[]] :=
  Block[{t, mc = 0.5},
   t = Map[Tooltip[(Accumulate[#]/Total[#] &)[SortBy[#[[1]], -# &]], #[[2]]] &, dataVecs];
   ListPlot[t, opts, PlotRange -> All, GridLines -> {Length[t[[1, 1]]] Range[0.1, mc, 0.1], {0.8}}, Frame -> True, FrameTicks -> {{Automatic, Automatic}, {Automatic, Table[{Length[t[[1, 1]]] c, ToString[Round[100 c]] <> "%"}, {c, Range[0.1, mc, 0.1]}]}}]
  ];

This function is useful for coloring the outliers in the list plots.

ClearAll[ColorPlotOutliers]
ColorPlotOutliers[] := # /. {Point[ps_] :> {Point[ps], Red, Point[ps[[OutlierPosition[ps[[All, 2]]]]]]}} &;
ColorPlotOutliers[oid_] := # /. {Point[ps_] :> {Point[ps], Red, Point[ps[[OutlierPosition[ps[[All, 2]], oid]]]]}} &;

These definitions can be also obtained by loading the packages MathematicaForPredictionUtilities.m and OutlierIdentifiers.m; see [1,3].

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"]
Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/OutlierIdentifiers.m"]

Units

Below we are going to use the metric system of units. (If preferred we can easily switch to the imperial system.)

$UnitSystem = "Metric";(*"Imperial"*)

CountryData

We are going to consider a typical Pareto principle example — weatlh of income distribution.

GDP

This code find the Gross Domestic Product (GDP) of different countries:

gdps = {CountryData[#, "Name"], CountryData[#, "GDP"]} & /@CountryData["Countries"];
gdps = DeleteCases[gdps, {_, _Missing}] /. Quantity[x_, _] :> x;

The corresponding Pareto plot (note the default grid lines) shows that 10% of countries have 90% of the wealth:

ParetoLawPlot[gdps[[All, 2]], ImageSize -> 400]

GDPPlot2

Here is the log histogram of the GDP values.

Histogram[Log10@gdps[[All, 2]], 20, PlotRange -> All]

GDPHistogram1

The following code shows the log plot of countries GDP values and the found outliers.

Manipulate[
 DynamicModule[{data = Transpose[{Range[Length[gdps]], Sort[gdps[[All, 2]]]}], pos},
  pos = OutlierPosition[modFunc@data[[All, 2]], tb@*opar];
  If[Length[pos] > 0,
   ListLogPlot[{data, data[[pos]]}, PlotRange -> All, PlotTheme -> "Detailed", FrameLabel -> {"Index", "GDP"}, PlotLegends -> SwatchLegend[{"All", "Outliers"}]],
   ListLogPlot[{data}, PlotRange -> All, PlotTheme -> "Detailed", FrameLabel -> {"Index", "GDP"}, PlotLegends -> SwatchLegend[{"All", "Outliers"}]]
  ]
 ],
 {{opar, SPLUSQuartileIdentifierParameters, "outliers detector"}, {HampelIdentifierParameters, SPLUSQuartileIdentifierParameters}},
 {{tb, TopOutliers, "bottom|top"}, {BottomOutliers, TopOutliers}},
 {{modFunc, Identity, "data modifier function"}, {Identity, Log}}
]

Outliers1

This table gives the values for countries with highest GDP.

Block[{data = gdps[[OutlierPosition[gdps[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]
]

HighestGDP1

Population

Similar data retrieval and plots can be made for countries populations.

pops = {CountryData[#, "Name"], CountryData[#, "Population"]} & /@CountryData["Countries"];
unit = QuantityUnit[pops[[All, 2]]][[1]];
pops = DeleteCases[pops, {_, _Missing}] /. Quantity[x_, _] :> x;

In the following Pareto plot we can see that 15% of countries have 80% of the total population:

ParetoLawPlot[pops[[All, 2]], PlotLabel -> Row[{"Population", ", ", unit}]]

PopPlot1

Here are the countries with most people:

Block[{data = pops[[OutlierPosition[pops[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]
]

HighestPop1

Area

We can also see that the Pareto principle holds for the countries areas:

areas = {CountryData[#, "Name"], CountryData[#, "Area"]} & /@CountryData["Countries"];
areas = DeleteCases[areas, {_, _Missing}] /. Quantity[x_, _] :> x;
ParetoLawPlot[areas[[All, 2]]]

AreaPlot1

Block[{data = areas[[OutlierPosition[areas[[All, 2]], TopOutliers@*SPLUSQuartileIdentifierParameters]]]},
 Row[Riffle[#, " "]] &@Map[Grid[#, Dividers -> All, Alignment -> {Left, "."}] &, Partition[SortBy[data, -#[[-1]] &], Floor[Length[data]/3]]]
]

HighestArea1

Time series-wise

An interesting diagram is to plot together the curves of GDP changes for different countries. We can see China and Poland have had rapid growth.

res = Table[
    (t = CountryData[countryName, {{"GDP"}, {1970, 2015}}];
     t = Reverse@Sort[t["Path"][[All, 2]] /. Quantity[x_, _] :> x];
     Tooltip[t, countryName])
    , {countryName, {"USA", "China", "Poland", "Germany", "France", "Denmark"}}];

ParetoLawPlot[res, PlotRange -> All, Joined -> True, PlotLegends -> res[[All, 2]]]

GDPGrowth1

Manipulate

This dynamic interface can be used for a given country to compare (i) the GDP evolution in time and (ii) the corresponding Pareto plot.

Manipulate[
 DynamicModule[{ts, t},
  ts = CountryData[countryName, {{"GDP"}, {1970, 2015}}];
  t = Reverse@Sort[ts["Path"][[All, 2]] /. Quantity[x_, _] :> x];
  Grid[{{"Date list plot of GDP values", "GDP Pareto plot"}, {DateListPlot[ts, ImageSize -> Medium],
     ParetoLawPlot[t, ImageSize -> Medium]}}]
 ], {countryName, {"USA", "China", "Poland", "Germany", "France", "Denmark"}}]

GDPGrowth2

Country flag colors

The following code demonstrates that the colors of the pixels in country flags also adhere to the Pareto principle.

flags = CountryData[#, "Flag"] & /@ CountryData["Countries"];

flags[[1 ;; 12]]

Flags1

ids = ImageData /@ flags;

pixels = Apply[Join, Flatten[ids, 1]];

Clear[ToBinFunc]
ToBinFunc[x_] := Evaluate[Piecewise[MapIndexed[{#2[[1]], #1[[1]] < x <= #1[[2]]} &, Partition[Range[0, 1, 0.1], 2, 1]]]];

pixelsInt = Transpose@Table[Map[ToBinFunc, pixels[[All, i]]], {i, 1, 3}];

pixelsIntTally = SortBy[Tally[pixelsInt], -#[[-1]] &];

ParetoLawPlot[pixelsIntTally[[All, 2]]]

FlagsPlot1

TunnelData

Loking at lengths in the tunnel data we can see the manifestation of an exaggerated Pareto principle.

tunnelLengths = TunnelData[All, {"Name", "Length"}];
tunnelLengths // Length

(* 1552 *)

t = Reverse[Sort[DeleteMissing[tunnelLengths[[All, -1]]] /. Quantity[x_, _] :> x]];

ParetoLawPlot[t]

TunnelsPlot1

Here is the logarithmic histogram of the lengths:

Histogram[Log10@t, PlotRange -> All, PlotTheme -> "Detailed"]

TunnelsHist1

LakeData

The following code gathers the data and make the Pareto plots surface areas, volumes, and fish catch values for lakes. We can that the lakes volumes show exaggerated Pareto principle.

lakeAreas = LakeData[All, "SurfaceArea"];
lakeVolumes = LakeData[All, "Volume"];
lakeFishCatch = LakeData[All, "CommercialFishCatch"];

data = {lakeAreas, lakeVolumes, lakeFishCatch};
t = N@Map[DeleteMissing, data] /. Quantity[x_, _] :> x;

opts = {PlotRange -> All, ImageSize -> Medium}; MapThread[ParetoLawPlot[#1, PlotLabel -> Row[{#2, ", ", #3}], opts] &, {t, {"Lake area", "Lake volume", "Commercial fish catch"}, DeleteMissing[#][[1, 2]] & /@ data}]

LakesPlot1

City data

One of the examples given in [5] is that the city areas obey the Power Law. Since the Pareto principle is a kind of Power Law we can confirm that observation using Pareto principle plots.

The following grid of Pareto principle plots is for areas and population sizes of cities in selected states of USA.

res = Table[
    (cities = CityData[{All, stateName, "USA"}];
     t = Transpose@Outer[CityData, cities, {"Area", "Population"}];
     t = Map[DeleteMissing[#] /. Quantity[x_, _] :> x &, t, {1}];
     ParetoLawPlot[MapThread[Tooltip, {t, {"Area", "Population"}}], PlotLabel -> stateName, ImageSize -> 250])
    , {stateName, {"Alabama", "California", "Florida", "Georgia", "Illinois", "Iowa", "Kentucky", "Ohio", "Tennessee"}}];

Legended[Grid[ArrayReshape[res, {3, 3}]], SwatchLegend[Cases[res[[1]], _RGBColor, Infinity], {"Area", "Population"}]]

CitiesPlot1

Movie ratings in MovieLens datasets

Looking into the MovieLens 20M dataset, [6], we can see that the Pareto princple holds for (1) most rated movies and (2) most active users. We can also see the manifestation of an exaggerated Pareto law — 90% of all ratings are for 10% of the movies.

"MovieLens20M-MDensity-and-Pareto-plots"

"MovieLens20M-MDensity-and-Pareto-plots"

The following plot taken from the blog post "PIN analysis", [7], shows that the four digit passwords people use adhere to the Pareto principle: the first 20% of (the unique) most frequently used passwords correspond to the 70% of all passwords use.

ColorNegate[Import["http://www.datagenetics.com/blog/september32012/c.png"]]

Cumulative-4-Digit-Password-Usages-ColorNegated

References

[1] Anton Antonov, "MathematicaForPrediction utilities", (2014), source code MathematicaForPrediction at GitHub, https://github.com/antononcube/MathematicaForPrediction, package MathematicaForPredictionUtilities.m.

[2] Anton Antonov, Pareto principle functions in R, source code MathematicaForPrediction at GitHub, https://github.com/antononcube/MathematicaForPrediction, source code file ParetoLawFunctions.R .

[3] Anton Antonov, Implementation of one dimensional outlier identifying algorithms in Mathematica, (2013), MathematicaForPrediction at GitHub, URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/OutlierIdentifiers.m .

[4] Wikipedia entry, "Pareto principle", URL: https://en.wikipedia.org/wiki/Pareto_principle .

[5] Wikipedia entry, "Power law", URL: https://en.wikipedia.org/wiki/Power_law .

[6] GroupLens Research, MovieLens 20M Dataset, (2015).

[7] "PIN analysis", (2012), DataGenetics.

Advertisements

Finding local extrema in noisy data using Quantile Regression

Introduction

This blog post (article) describes an algorithm for finding local extrema in noisy data using Quantile Regression. The problem formulation and a solution for it using polynomial model fitting (through LinearModelFit) were taken from Mathematica StackExchange — see “Finding Local Minima / Maxima in Noisy Data”, [1].

The proposed Quantile Regression algorithm is a version of the polynomial fitting solution (proposed by Leonid Shifrin in [1]) and has the following advantages: (i) requires less parameter tweaking, and (ii) more importantly it is more robust with very noisy and oscillating data. That robustness is achieved by using two regression fitted curves: one close to the local minima and another close to the local maxima, computed for low and high quantiles respectively. (Quantile Regression is uniquely able to do that.)

The code for this blog post is available at [6].

A complete version of this blog post is available as a PDF document here: Finding local extrema in noisy data using Quantile Regression.pdf.

The problem

Here is the problem formulation from [1].

Problem 1: We have a list of pairs of numbers representing measurements of a scalar variable on a regular time grid. The measurements have noise in them.

As a data example for this problem the author of the question in [1] has provided the following code:

temptimelist = Range[200]/10;
tempvaluelist = Sinc[#] &@temptimelist + RandomReal[{-1, 1}, 200]*0.02;
data1 = Transpose[{temptimelist, tempvaluelist}];
ListPlot[data1, PlotRange -> All, Frame -> True, ImageSize -> 500]

QRforFindingLocalExtrema-Data1

In this article we are going to consider a more general problem hinted in the discussion of the solutions in [1].

Problem 2: Assume that the data in Problem 1 is collected several times and the noise is present in both the measured values and the time of measurement. Also the data can be highly oscillatory in nature.

Consider this data generation as an example for Problem 2:

n = 1000;
xs = N@Rescale[Range[n], {1, n}, {0, 60}];
data2 = Flatten[
 Table[Transpose[{xs + 0.1 RandomReal[{-1, 1}, Length[xs]], 
 Map[5 Sinc[#] + Sin[#] + 4 Sin[1/4 #] &, xs] + 
 1.2 RandomVariate[SkewNormalDistribution[0, 1, 0.9], 
 Length[xs]]}], {10}], 1];
ListPlot[data2, PlotRange -> All, Frame -> True, ImageSize -> 500]

QRforFindingLocalExtrema-Data2

Note that this data has 10000 points and it is much larger than the data for Problem 1.

Extrema location approximation by model fitting followed by nearest neighbors search

Several solutions are given in [1]. A couple are using wavelets or Gaussian filtering for de-noising. The one we are going to focus on in this article is using polynomial fitting for extrema location approximation and then finding the actual data extrema by Nearest Neighbors (NN’s) search. It is provided by Leonid Shifrin.

Let us list the steps of that algorithm:

1. Fit a polynomial through the data (using LinearModelFit).
2. Find the local extrema of the fitted polynomial. (We will call them fit estimated extrema.)
3. Around each of the fit estimated extrema find the most extreme point in
the data by nearest neighbors search (by using Nearest).

We are going to refer to this algorithm as LMFFindExtrema. Its implementation is available here: QuantileRegressionForLocalExtrema.m, [6].

Here is the application of the algorithm to the data example of Problem 1:

QRforFindingLocalExtrema-LMFFindExtrema-Data1

(The continuous line shows the fitted curve.)

Here is the application of the algorithm to the data example of Problem 2:

QRforFindingLocalExtrema-LMFFindExtrema-Data2

It does not help to just increase the number of basis polynomials and the number of NN’s examined points:

QRforFindingLocalExtrema-LMFFindExtrema-Data2-60

We can also see that increasing the number of examined NN’s for each of the fit estimated extrema would make some the final result points to be “borrowed” from neighboring peaks in the data.

Experiments with fitting trigonometric basis functions showed very good fit to the data but the calculations of the fit estimated extrema were very slow.

Using Quantile Regression

It is trivial to rewrite the algorithm LFMFindExtrema to use Quantile Regression instead of linear model fitting of polynomials. We are going to use the Mathematica package [2] implementation described in [3,4]. The function QuantileRegression provided by [2] uses a B-spline basis [5] to find the quantile regression curves (also known as regression quantiles).

We are going to call this algorithm QRFindExtrema. Again its implementation is available here: QuantileRegressionForLocalExtrema.m, [6].

The algorithm QRFindExtrema has the following parameters: the data, number of B-spline knots, interpolation order, quantiles (corresponding to the curves to be fitted).

QRFindExtrema returns a list of regression quantile functions and a list of lists with extrema estimates.

More importantly though, QRFindExtrema can use two curves for finding the local extrema: one for local minima, and one for local maxima. (This feature is justified below.)

Here is the application of QRFindExtrema to the example data of Problem 1:

QRforFindingLocalExtrema-QRFindExtrema-Data1

Here is application of QRFindExtrema to the data of Problem 2:

QRforFindingLocalExtrema-QRFindExtrema-Data2-12-120-0.5

We can see that the regression quantile for 0.5 is too flat to get good judgment of the local extrema. We can get better results if we increase the number of knots for the B-spline basis built by QuantileRegression.

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-120-0.5

We can see that results look “almost right”, the horizontal locations of the peaks are apparently more-or-less correctly identified, but the result extrema are too close to the fitted curve. Just increasing the number of examined NN’s for the fit estimated extrema does not produce good results because points from neighboring peaks are being chosen as final extrema estimates.

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-1500-0.5

In order to solve this problem we use two regression quantiles. For local minima we use a regression quantile for a low quantile number, say, 0.02; for the local maxima we use a regression quantile for a large quantile number, say, 0.98 .

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-200-low-high

The use of two (or more) curves to be fitted is an unique capability of Quantile Regression. With this algorithm feature by construction the lower regression quantile is close to the local minima and the higher regression quantile is close to the local maxima.

Also, since we find two regression quantile curves we can use two nearest neighbors finding functions: one with the points below the low regression quantile, and one with the points above the high regression quantile. The implementation in [6] takes an option specification for should the nearest neighbor functions for finding the extrema be constructed using all data points or just the outliers (the points outside of the found regression quantiles).

More examples

More timid noisy and oscillating data with around 10,000 points.

QRforFindingLocalExtrema-QRFindExtrema-Data3-24-200-low-high

References

[1] Mathematica StackExchange discussion. “Finding Local Minima / Maxima in Noisy Data”,
URL: http://mathematica.stackexchange.com/questions/23828/finding-local-minima-maxima-in-noisy-data/ .

[2] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[3] Anton Antonov, Quantile regression through linear programming, usage guide at GitHub, https://github.com/antononcube/MathematicaForPrediction, in the directory “Documentation”, (2013).

[4] Anton Antonov, Quantile regression through linear programming, “Mathematica for prediction algorithms” blog at WordPress.com, 12/16/2013.

[5] Anton Antonov, Quantile regression with B-splines, “Mathematica for prediction algorithms” blog at WordPress.com, 1/1/2014.

[6] Anton Antonov, QuantileRegressionForLocalExtrema Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegressionForLocalExtrema.m, (2015).
The package is in the directory “Applications”.

Simple time series conversational engine

Introduction

In this blog post I am going to discuss the creation (design and programming) of a simple conversational engine for time series analysis. The conversational engine responds to commands for:
1. loading time series data (weather data, stock data, or data files),
2. finding outliers,
3. analysis by curve fitting,
4. plotting, and
5. help and state changes.

This blog post is related to the blog post [1] and uses the package with functional parsers [2] and the Quantile regression package [3].

I programmed this engine ten months ago and here is link to a movie that demonstrates it, [4]: https://www.youtube.com/watch?v=wlZ5ANglVI4 .

A package with all of the code of the conversational engine can be downloaded from GitHub, [12] and executed with Import:


Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Examples/SimpleTimeSeriesConversationalEngine.m"]

More about functional parsers, which were essential for the building of the conversational engine, can be read in “Functional parsers” by Fokker, [5].

Design of the conversational engine

The conversational engine is a finite state machine. It has a command recognizer state and command application state. The switching between the states is done upon completion states functions. The recognized commands are English sentences. Some of the sentences include Mathematica code.

Let use describe the structure of the conversational engine.

States

1. Acceptor state
The Acceptor state parses and recognizes commands. If the given command is successfully parsed then the parsing result is given to the Transducer state. Otherwise a message “Unknown command” is emitted and no state change is made (i.e.we are still in the Acceptor state).

2. Transducer state
Interprets the parsed a command by the Acceptor state. This state emits different messages according to the loaded data and accumulated graphics objects.

The states are named according to the classification given in the Wikipedia article about finite-state machines, [5]. According to that classification the conversational engine is a Mealy machine.

Responses

The results of the applied time series analysis are graphical. The Transducer state accumulates the graphs of the commands and displays them with the data. The conversational engine responses have a spoken part and a graphics part. The responses have at least a spoken part. The graphics part can omitted.

Commands

The weather data commands can be for temperature, wind speed, and pressure over a specified city. For example, “load temperature of Atlanta” or “load the Chicago wind speeds”.

The financial data commands can be for stock price or trading volume. For example, “load trade volume of NYSE:GOOG” or “load stock price of NYSE:APPL”.

The weather and stocks data loading commands do not take time period, they take only city and company specifications for WeatherData and FinancialData respectively.

The loading of data files is done with commands like “load file ‘~/LogDataWithSkewedNoise.csv’ “.

The time series analysis commands we consider are: curve fitting by a list specified functions, finding regression quantiles, and finding outliers. For example, “compute least squares fit of {1,x,Sin[x/10^6]}”, “compute 5 regression quantiles”, “find top outliers”.

The graphics commands are “plot data”, “plot(s) joined”, “plot(s) not joined”, “clear graphics”.

In addition, the conversational engine takes global, service commands like “help”, “all commands”, “start over’, etc.

Grammar specification

In order to respond to the commands that are English sentences we have to be able to parse these sentences into concrete state or message specifications. This means that we have to define the grammatical structures of the command sentences and the words in them.

This section describes the grammar of the commands using Extended Backus-Naur Form (EBNF). The package FunctionalParsers.m, [2], has Mathematica functions that generate parsers from EBNF of a grammar (see [9] and the next section). The resulting parsers are used in the conversational engine described above.

Commands for loading data explanations

Let us explain the derivation and rationale of the grammar rules for the data loading commands. (I am not going to describe all the language considerations and decisions I have made for the commands grammar. These decisions can be traced in the EBNF grammar specification.)

We want to parse and interpret commands like:
1. load data file ‘timeTable.csv’ ,
2. load temperature of Atlanta ,
3. load the wind speeds of Chicago ,
4. load the stock price of NYSE:APPL ,
5. load trading volume of NYSE:GOOG .

We can see that we have three different types of data loading commands: for files, for weather data, and for financial data. Each of these commands when parsed would be easy to interpret using Mathematica’s functions Import, WeatherData, and FinancialData respectively. Note that — because we want to keep the grammar simple — for the weather and financial data there is no time period specification. For the weather data we are going to load data for the last 60 days; for the financial data for the last 365 days. Also, for simplicity, we are not going to consider data loading commands that start with the phrases “get”, “get me”, “please load”, etc.

To this end the general load data command structure is:

DidacticLoadDataGrammarRules

Note that in the grammar rules above some of the words are optional; the optional words are put within square brackets. With the first rule we have separated the file loading from the weather and financial data loading. The file loading is further specified with rule 2. Rule 3 separates the preamble of the weather and financial data loading commands from the data specification. The latter is partitioned in rule 4. Rule 5 gives all of the possible data type requests for the weather data: temperature, pressure, and wind speed. Rule 6 gives the two possible types for financial data requests: stock price and trade volume. The file name, city, and company specifications are left out, but the rules for them are be easy to formulate.

The command grammar used in the conversational engine uses additional data loading definitions and allows “reverse” data load commands like: “load the Chicago wind speeds”.

Full grammar specification

In this sub-section the full grammar specification is given. Note that some parts of the commands are specified to be dropped from the parsing results using the symbols “\[RightTriangle]” and “\[LeftTriangle]”. (For example ” ‘trade’ \[RightTriangle] ‘volume’ ” means “parse the sequence of words ‘trade’ ‘volume’ and return ‘volume’ as a result”.) The specification ‘_LetterString’ is an implementation shortcut for matching words and identifiers with the package FunctionalParsers.m, [2].

TimeSeriesGrammarEBNF

The rules were simplified for clarity, the actual conversational engine code can be retrieved, [8].

The grammar specified above takes complicated commands like “find regression quantiles over temperature of Atlanta”.

(In certain cases I prefer reading grammars in EBNF using Emacs and ebnf-mode.el, [7].)

Parsers for the grammar

Using the grammar specification [8] and the package FunctionalParsers.m, [2], we can generate the parsers for that grammar. A detailed description of the parser generation process is given in [9].

Here is the generation step:


tokens = ToTokens[timeSeriesEBNFCode];
res = GenerateParsersFromEBNF[tokens];
res // LeafCount

Out[4]= 3069

The generation step creates a set of parsers:

In[16]:= Magnify[Shallow[res[[1, 1, 2]], 6], 0.8]

TimeSeriesGrammarEBNFParseTree

Here are the parsing results returned by the “master” rule pTSCOMMAND:
TableOfParsingCommands1

Here is another list for the engine’s “service” commands:
TableOfParsingCommands2

And see their parsing outcome using the function ParsingTestTable provided by FunctionalParsers.m:

questions = {"load data file '~/example.csv'",
"load file '~/anotherExample.csv'"};
ParsingTestTable[ParseShortest[pLOADFILECOMMAND], questions]

LoadDataFileParseTable

Note that the parser takes commands with Mathematica expressions parts if those expressions are written without spaces:

questions = {"least squares fit with x+Sin[x]"};
ParsingTestTable[ParseShortest[pTSCOMMAND], questions]

LeastSquaresFitParseTable

This type of commands are specified with the grammar rule:

= ( ‘least’ , ‘squares’ , [ ‘fit’ ] , [ ‘with’ | ‘of’ ] ) \[RightTriangle] ‘_String’ ;

Interpreters (for the parsers)

Next we have to write interpreters for the parsing results. Note that the parsing results are Mathematica expressions with heads that hint to the semantic meaning behind the parsed commands. Since we are viewing the conversational engine as a finite state machine we can say that for each parsing result we have to write a function that “reacts” to result’s head and content. This is a fairly straightforward process very similar to the one described by the Interpreter design pattern, [10].

A package with all of the code of the conversational engine can be downloaded from GitHub, [12]. The interpreters code can be found in the section “Interpreters”.

The conversational engine programming

The conversational engine was programmed using Mathematica’s dynamic evaluation functions and mechanisms. The implementation follows the Absorber and Transductor state design described above in the section “Design of the conversational engine”.

The engine has three panels: for command input, for spoken output, and for graphical output. It is fully demonstrated with the movie “Time series conversational engine” posted on YouTube, [4].

As it was mentioned in the introduction the engine can be run using Import:

Import[“https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Examples/SimpleTimeSeriesConversationalEngine.m”%5D

SimpleTimeSeriesConversationalEngineExample

Future plans

The engine can be extended in many directions depending on the final goals for it.

One must have extension is the ability to take time period specifications for the weather and financial data.

The time series analysis can be extended with more detailed analysis using functions like TimeSeriesModelFit or heteroscedasticity analysis using multiple regression quantiles, [11].

A third set of extensions is the ability to load additional kinds of data, like, the evolutions of countries populations or gross domestic products.

References

[1] Anton Antonov, Natural language processing with functional parsers, blog post at WordPress, (2014).

[2] Anton Antonov, Functional parsers Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package FunctionalParsers.m, (2014).

[3] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[4] Anton Antonov, Time series conversational engine, a YouTube movie.

[5] Wikipedia entry: Finite-state Machine, http://en.wikipedia.org/wiki/Finite-state_machine .

[6] Jeroen Fokker, Functional parsers, In: Advanced Functional Programming, Tutorial text of the First international spring school on advanced functional programming techniques in Båstad, Sweden, may 1995. (Johan Jeuring and Erik Meijer, eds). Berlin, Springer 1995 (LNCS 925), pp. 1-23. http://www.staff.science.uu.nl/~fokke101/article/parsers/index.html .

[7] Jeramey Crawford, Extended Backus-Naur Form Mode, ebnf-mode.el, emacs LISP at Github, https://github.com/jeramey/ebnf-mode .

[8] Anton Antonov, Time series conversational engine grammar in EBNF, TimeSeriesConversationalEngineGrammar.ebnf at the MathematicaForPrediction project at GitHub.

[9] Anton Antonov, Functional parsers for an integration requests language, PDF document at GitHub, https://github.com/antononcube/MathematicaForPrediction/tree/master/Documentation , (2014).

[10] Erich GammaTaligent, Richard Helm, Ralph Johnson, John Vlissides, Design patterns: elements of reusable object-oriented softwareAddison-Wesley Longman Publishing Co., Inc. Boston, MA, USA ©1995, ISBN:0-201-63361-2, http://en.wikipedia.org/wiki/Design_Patterns .

[11] Anton Antonov, Estimation of conditional density distributions, “Mathematica for prediction algorithms” blog at WordPress, (2014).

[12] Anton Antonov, Simple Time Series Conversational Engine Mathematica package, SimpleTimeSeriesConversationalEngine.m at the MathematicaForPrediction project at GitHub.

Estimation of conditional density distributions

Assume we have temperature data for a given location and we want to predict today’s temperature at that location using yesterday’s temperature. More generally, the problem discussed in this blog post can be stated as “How to estimate the conditional density of the predicted variable given a value of the conditioning covariate?”

One way to answer this question is to provide a family of regression quantiles and from them to estimate the Cumulative Distribution Function (CDF) for a given value of the covariate. In other words, given data pairs Subsuperscript[{Subscript[x, i],Subscript[y, i]}, i=1, n] and a value Subscript[x, 0] we find Subscript[CDF, Subscript[x, 0]](y). From the estimated CDF we can estimate the Probability Density Function (PDF). We can go further and use Monte Carlo type of simulations with the obtained PDF’s (this will be discussed in another post).

The experiments in this blog post follow the example sub-sub-section “Daily Melbourne Temperatures” of the book “Quantile regression” by Roger Koenker.

Consider the temperature time series of Atlanta, GA, USA from 2006.01.01 to 2014.01.13 :

location = {"Atlanta", "GA"};
tempData = WeatherData[location, "Temperature", {{2006, 1, 1}, {2014, 1, 12}, "Day"}];
tempData // Length

Atlanta temperature time series 2006-2014

We can see that this time series is heteroscedastic — the range of temperatures is wider in the winter months.

Using the time series data let us make pairs of yesterday and today data:

tempPData = Partition[tempData[[All, 2]], 2, 1];

(We pair up two consecutive temperatures.)

Atlanta yesterday-vs-today temperatures

Let us calculate the regression quantiles for a number of quantiles. (The package QuantileRegression.m used is discussed in previous posts of this blog, and it can be downloaded from MathematicaForPrediction at GitHub.)
Get[“~/MathFiles/MathematicaForPrediction/QuantileRegression.m”]

The considered set of quantiles is qs:={0.02,0.05,0.01,…,0.95,0.98}:

In[397]:= qs = N@Join[{0.02}, FindDivisions[{0, 1}, 20][[2 ;; -2]], {0.98}]
qs // Length

Out[397]= {0.02, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98}
Out[398]= 21

Next we calculate the regression quantiles using 3rd order B-spline basis functions over 5 knots.

In[399]:= AbsoluteTiming[qFuncs = QuantileRegression[tempPData, 5, qs];]

Atlanta regression quantiles for yesterday-vs-today temperatures

Given a temperature value, say t0=2°C, we can estimate CDF(t) using the quantiles and the values at t0 of the corresponding regression quantile functions.

t0 = 2;
xs = Through[qFuncs[t0]];
cdfPairs = Transpose[{xs, qs}];

We can get a first order (linear) approximation by simply connecting the consecutive points of {x[i],y[i]}, 1 <= i <= |qs| as it is shown on the following plot.

Atlanta estimated CDF for 2C

On the plot the dashed vertical grid lines are for the quantiles {0.05,0.25,0.5,0.75,0.95}; the solid vertical gray grid line is for Subscript[t, 0].

Let us define a functions of the CDF and PDF approximation and plots:

CDFEstimate[t0_] := CDFEstimate[qs, qFuncs, t0];
CDFEstimate[qs_, qFuncs_, t0_] :=
Interpolation[Transpose[{Through[qFuncs[t0]], qs}], InterpolationOrder -> 1];

CDFPDFPlot[t0_?NumberQ, qCDFInt_InterpolatingFunction,
qs : {_?NumericQ ..} : qs,
opts : OptionsPattern[]] :=
Block[{qsGL = {0.05, 0.25, 0.5, 0.75, 0.95}, xsGL},
xsGL = Pick[Flatten@qCDFInt["Grid"], MemberQ[qsGL, #] & /@ qs];
Plot[{qCDFInt[x], qCDFInt'[x]}, {x, qCDFInt["Domain"][[1, 1]], qCDFInt["Domain"][[1, 2]]},
GridLines -> {Prepend[
MapThread[{#2, {Dashed, Blend[{Blue, Green, Pink}, #1]}} &, {qsGL, xsGL}], {t0, GrayLevel[0.6]}], None},
Axes -> False, Frame -> True,
PlotLabel -> "Estimated CDF and PDF for " ToString[t0] "\[Degree]C",opts]
];

Using these definitions let us plot the estimated CDF’s and PDF’s for a collection of temperatures. Note that the quantiles are given with vertical dashed grid lines; the median is colored with green. The values of the conditioning covariate are given on the plot labels and are marked with a solid, gray vertical line.
Atlanta estimated CDF's and PDF's

In the grid of CDF-and-PDF plots we can see that:
1. when the temperature is lower there is higher probability that the next day the temperature is going to be higher,
2. when yesterday’s temperature t0 is within the range [10,25]C the median almost coincides with t0,
3. for medium and high temperatures we have distributions similar to the skew normal distribution.

As I mentioned in the beginning of the post I followed the exposition for Melbourne’s temperatures in Koenker’s book. I did the calculations described above with temperature time series data for Melbourne. Here is the plot with the fitted regression quantiles:
Melbourne regression quantiles for yesterday-vs-today temperatures

Here is the grid of plots over a collection of temperatures. Note that they bring different conclusions that the ones for Atlanta, GA, USA.
Melbourne estimated CDF's and PDF's