# Simple time series conversational engine

## Introduction

In this blog post I am going to discuss the creation (design and programming) of a simple conversational engine for time series analysis. The conversational engine responds to commands for:
1. loading time series data (weather data, stock data, or data files),
2. finding outliers,
3. analysis by curve fitting,
4. plotting, and
5. help and state changes.

This blog post is related to the blog post [1] and uses the package with functional parsers [2] and the Quantile regression package [3].

I programmed this engine ten months ago and here is link to a movie that demonstrates it, [4]: https://www.youtube.com/watch?v=wlZ5ANglVI4 .

A package with all of the code of the conversational engine can be downloaded from GitHub, [12] and executed with `Import`:

``` Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/Examples/SimpleTimeSeriesConversationalEngine.m"] ```

More about functional parsers, which were essential for the building of the conversational engine, can be read in “Functional parsers” by Fokker, [5].

## Design of the conversational engine

The conversational engine is a finite state machine. It has a command recognizer state and command application state. The switching between the states is done upon completion states functions. The recognized commands are English sentences. Some of the sentences include Mathematica code.

Let use describe the structure of the conversational engine.

### States

1. Acceptor state
The Acceptor state parses and recognizes commands. If the given command is successfully parsed then the parsing result is given to the Transducer state. Otherwise a message “Unknown command” is emitted and no state change is made (i.e.we are still in the Acceptor state).

2. Transducer state
Interprets the parsed a command by the Acceptor state. This state emits different messages according to the loaded data and accumulated graphics objects.

The states are named according to the classification given in the Wikipedia article about finite-state machines, [5]. According to that classification the conversational engine is a Mealy machine.

### Responses

The results of the applied time series analysis are graphical. The Transducer state accumulates the graphs of the commands and displays them with the data. The conversational engine responses have a spoken part and a graphics part. The responses have at least a spoken part. The graphics part can omitted.

### Commands

The weather data commands can be for temperature, wind speed, and pressure over a specified city. For example, “load temperature of Atlanta” or “load the Chicago wind speeds”.

The financial data commands can be for stock price or trading volume. For example, “load trade volume of NYSE:GOOG” or “load stock price of NYSE:APPL”.

The weather and stocks data loading commands do not take time period, they take only city and company specifications for WeatherData and FinancialData respectively.

The time series analysis commands we consider are: curve fitting by a list specified functions, finding regression quantiles, and finding outliers. For example, “compute least squares fit of {1,x,Sin[x/10^6]}”, “compute 5 regression quantiles”, “find top outliers”.

The graphics commands are “plot data”, “plot(s) joined”, “plot(s) not joined”, “clear graphics”.

In addition, the conversational engine takes global, service commands like “help”, “all commands”, “start over’, etc.

## Grammar specification

In order to respond to the commands that are English sentences we have to be able to parse these sentences into concrete state or message specifications. This means that we have to define the grammatical structures of the command sentences and the words in them.

This section describes the grammar of the commands using Extended Backus-Naur Form (EBNF). The package FunctionalParsers.m, [2], has Mathematica functions that generate parsers from EBNF of a grammar (see [9] and the next section). The resulting parsers are used in the conversational engine described above.

Let us explain the derivation and rationale of the grammar rules for the data loading commands. (I am not going to describe all the language considerations and decisions I have made for the commands grammar. These decisions can be traced in the EBNF grammar specification.)

We want to parse and interpret commands like:
1. load data file ‘timeTable.csv’ ,
2. load temperature of Atlanta ,
3. load the wind speeds of Chicago ,
4. load the stock price of NYSE:APPL ,

We can see that we have three different types of data loading commands: for files, for weather data, and for financial data. Each of these commands when parsed would be easy to interpret using Mathematica’s functions Import, WeatherData, and FinancialData respectively. Note that — because we want to keep the grammar simple — for the weather and financial data there is no time period specification. For the weather data we are going to load data for the last 60 days; for the financial data for the last 365 days. Also, for simplicity, we are not going to consider data loading commands that start with the phrases “get”, “get me”, “please load”, etc.

To this end the general load data command structure is:

Note that in the grammar rules above some of the words are optional; the optional words are put within square brackets. With the first rule we have separated the file loading from the weather and financial data loading. The file loading is further specified with rule 2. Rule 3 separates the preamble of the weather and financial data loading commands from the data specification. The latter is partitioned in rule 4. Rule 5 gives all of the possible data type requests for the weather data: temperature, pressure, and wind speed. Rule 6 gives the two possible types for financial data requests: stock price and trade volume. The file name, city, and company specifications are left out, but the rules for them are be easy to formulate.

### Full grammar specification

In this sub-section the full grammar specification is given. Note that some parts of the commands are specified to be dropped from the parsing results using the symbols “\[RightTriangle]” and “\[LeftTriangle]”. (For example ” ‘trade’ \[RightTriangle] ‘volume’ ” means “parse the sequence of words ‘trade’ ‘volume’ and return ‘volume’ as a result”.) The specification ‘_LetterString’ is an implementation shortcut for matching words and identifiers with the package FunctionalParsers.m, [2].

The rules were simplified for clarity, the actual conversational engine code can be retrieved, [8].

The grammar specified above takes complicated commands like “find regression quantiles over temperature of Atlanta”.

(In certain cases I prefer reading grammars in EBNF using Emacs and ebnf-mode.el, [7].)

## Parsers for the grammar

Using the grammar specification [8] and the package FunctionalParsers.m, [2], we can generate the parsers for that grammar. A detailed description of the parser generation process is given in [9].

Here is the generation step:

``` tokens = ToTokens[timeSeriesEBNFCode]; res = GenerateParsersFromEBNF[tokens]; res // LeafCount```

``` ```

```Out[4]= 3069 ```

The generation step creates a set of parsers:

`In[16]:= Magnify[Shallow[res[[1, 1, 2]], 6], 0.8]`

Here are the parsing results returned by the “master” rule `pTSCOMMAND`:

Here is another list for the engine’s “service” commands:

And see their parsing outcome using the function ParsingTestTable provided by FunctionalParsers.m:

```questions = {"load data file '~/example.csv'", "load file '~/anotherExample.csv'"}; ParsingTestTable[ParseShortest[pLOADFILECOMMAND], questions]```

Note that the parser takes commands with Mathematica expressions parts if those expressions are written without spaces:

```questions = {"least squares fit with x+Sin[x]"}; ParsingTestTable[ParseShortest[pTSCOMMAND], questions]```

This type of commands are specified with the grammar rule:

= ( ‘least’ , ‘squares’ , [ ‘fit’ ] , [ ‘with’ | ‘of’ ] ) \[RightTriangle] ‘_String’ ;

## Interpreters (for the parsers)

Next we have to write interpreters for the parsing results. Note that the parsing results are Mathematica expressions with heads that hint to the semantic meaning behind the parsed commands. Since we are viewing the conversational engine as a finite state machine we can say that for each parsing result we have to write a function that “reacts” to result’s head and content. This is a fairly straightforward process very similar to the one described by the Interpreter design pattern, [10].

A package with all of the code of the conversational engine can be downloaded from GitHub, [12]. The interpreters code can be found in the section “Interpreters”.

## The conversational engine programming

The conversational engine was programmed using Mathematica’s dynamic evaluation functions and mechanisms. The implementation follows the Absorber and Transductor state design described above in the section “Design of the conversational engine”.

The engine has three panels: for command input, for spoken output, and for graphical output. It is fully demonstrated with the movie “Time series conversational engine” posted on YouTube, [4].

As it was mentioned in the introduction the engine can be run using `Import`:

## Future plans

The engine can be extended in many directions depending on the final goals for it.

One must have extension is the ability to take time period specifications for the weather and financial data.

The time series analysis can be extended with more detailed analysis using functions like TimeSeriesModelFit or heteroscedasticity analysis using multiple regression quantiles, [11].

A third set of extensions is the ability to load additional kinds of data, like, the evolutions of countries populations or gross domestic products.

## References

[1] Anton Antonov, Natural language processing with functional parsers, blog post at WordPress, (2014).

[2] Anton Antonov, Functional parsers Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package FunctionalParsers.m, (2014).

[3] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[4] Anton Antonov, Time series conversational engine, a YouTube movie.

[5] Wikipedia entry: Finite-state Machine, http://en.wikipedia.org/wiki/Finite-state_machine .

[6] Jeroen Fokker, Functional parsers, In: Advanced Functional Programming, Tutorial text of the First international spring school on advanced functional programming techniques in Båstad, Sweden, may 1995. (Johan Jeuring and Erik Meijer, eds). Berlin, Springer 1995 (LNCS 925), pp. 1-23. http://www.staff.science.uu.nl/~fokke101/article/parsers/index.html .

[7] Jeramey Crawford, Extended Backus-Naur Form Mode, ebnf-mode.el, emacs LISP at Github, https://github.com/jeramey/ebnf-mode .

[8] Anton Antonov, Time series conversational engine grammar in EBNF, TimeSeriesConversationalEngineGrammar.ebnf at the MathematicaForPrediction project at GitHub.

[9] Anton Antonov, Functional parsers for an integration requests language, PDF document at GitHub, https://github.com/antononcube/MathematicaForPrediction/tree/master/Documentation , (2014).

[10] Erich GammaTaligent, Richard Helm, Ralph Johnson, John Vlissides, Design patterns: elements of reusable object-oriented softwareAddison-Wesley Longman Publishing Co., Inc. Boston, MA, USA ©1995, ISBN:0-201-63361-2, http://en.wikipedia.org/wiki/Design_Patterns .

[11] Anton Antonov, Estimation of conditional density distributions, “Mathematica for prediction algorithms” blog at WordPress, (2014).

[12] Anton Antonov, Simple Time Series Conversational Engine Mathematica package, SimpleTimeSeriesConversationalEngine.m at the MathematicaForPrediction project at GitHub.

# Directional quantile envelopes in 3D

## Introduction

This blog post was mostly made as a response to a comment of my previous blog post “Directional quantile envelopes”, [1]:

This looks extremely useful and elegant – any plans on generalizing to 3 (or more) dimensions?
–Jan Liphardt

Since I did say in the previous blog post that the algorithm can be easily extended to 3D and higher dimensions, I decided to illustrate that with an example. The 3D implementation is really easy using Mathematica’s geometric computation functions (i) for derivation of rotation matrices and (ii) for regions manipulation. (The latter are new in version 10.)

The algorithm shown below is a slightly modified version of the algorithm in “Directional quantile envelopes”. Instead of the step that does the intersection of directional quantile lines with Reduce, we can use ImplicitRegion to specify the region enclosed by the directional quantile planes. We can use other built-in functions for region predicates and manipulation to derive, visualize, or utilize the directional quantile regions found by the algorithm. (All these are exemplified below.)

## Data generation

Let us first generate some data.

```npoints = 10000; data = {RandomReal[{0, 2 Pi}, npoints], RandomReal[{-Pi, Pi}, npoints], RandomVariate[PoissonDistribution[4.8], npoints]}; data = MapThread[#3*{Cos[#1] Cos[#2], Sin[#1] Cos[#2], Sin[#2]} &, data];```

Let us plot the data

```Block[{qs = 12}, qs = Map[Quantile[#, Range[0, 1, 1/(qs - 1)]] &, Transpose[data]]; ListPointPlot3D[data, PlotRange -> All, PlotTheme -> "Detailed", FaceGrids -> {{{0, 0, -1}, Most[qs]}, {{0, 1, 0}, qs[[{1, 3}]]}, {{-1, 0, 0}, Rest[qs]}}, ImageSize -> 700] ]```

On the plot the grid lines are derived from the quantiles of x, y and z coordinates of the data set.

## Algorithm application step-by-step

1. Standardize the data
1.1. This step is generally not needed and in this implementation would complicate things.

2. Create a set of uniform angles. (Here we generate vectors that define 3D directions.)

```nqs = 20; dirs = N@Flatten[ Table[{Cos[\[Theta]] Cos[\[Phi]], Sin[\[Theta]] Cos[\[Phi]], Sin[\[Phi]]}, {\[Theta], 2 \[Pi]/(10 nqs), 2 \[Pi], 2 \[Pi]/nqs}, {\[Phi], -\[Pi], \[Pi], 2 \[Pi]/nqs}], 1];```

`Graphics3D[Map[Arrow[{{0, 0, 0}, #}] &, dirs]]`

3. Rotate the data for each direction vector, a ∈ A. (Here we just generate the rotation matrices.)

```In[672]:= rmats = RotationMatrix[{{1, 0, 0}, #}] & /@ dirs; Length[rmats]```

``` ```

`Out[673]= 420`

4. Find the q-quantile of the z-coordinates of the data rotated at direction a. Denote that point with Z[a,q].

```In[674]:= qs = {0.95}; qDirPoints = Flatten[Map[Function[{m}, Quantile[(m.Transpose[data])[[3]], qs]], rmats]];```

5. Find the points z[a,q] corresponding to Z[a,q] in the original coordinate system.
5.1. In this implementation this is step is redundant, see the next step.

6. Find the quantile planes intersections
6.1. Instead of doing this step we simply use ImplicitRegion.

```In[676]:= qRegion = ImplicitRegion[ MapThread[(#1.{x, y, z})[[3]] <= #2 &, {rmats, qDirPoints}], {x, y, z}];```

`In[677]:= Shallow[qRegion]`

``` ```

```Out[677]//Shallow= ImplicitRegion[ LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && LessEqual[<>] && <>, {x, y, z}]```

## Wrapping it in a function

Here is a definition of a function that wraps all of the algorithms steps in the previous section.

``` Clear[QuantileEnvelopeRegion] QuantileEnvelopeRegion[points_?MatrixQ, quantile_?NumberQ, numberOfDirections_Integer] := Block[{nd = numberOfDirections, dirs, rmats, qDirPoints, qRegion}, dirs = N@Flatten[ Table[{Cos[\[Theta]] Cos[\[Phi]], Sin[\[Theta]] Cos[\[Phi]], Sin[\[Phi]]}, {\[Theta], 2 \[Pi]/(10 nd), 2 \[Pi], 2 \[Pi]/nd}, {\[Phi], -\[Pi], \[Pi], 2 \[Pi]/nd}], 1]; rmats = RotationMatrix[{{1, 0, 0}, #}] & /@ dirs; qDirPoints = Flatten[Map[ Function[{m}, Quantile[(m.Transpose[points])[[3]], quantile]], rmats]]; qRegion = ImplicitRegion[ MapThread[(#1.{x, y, z})[[3]] <= #2 &, {rmats, qDirPoints}], {x, y, z}]; qRegion ] /; Dimensions[points][[2]] == 3 && 0 < quantile <= 1; ```

## Visualizing

From the implicit region specification we can generate an approximation of the boundary surface using the built-in function BoundaryDiscretizeRegion.

```qRegion = QuantileEnvelopeRegion[data, 0.95, 20]; qRegion2 = QuantileEnvelopeRegion[data, 0.8, 20];```

```qSurface = BoundaryDiscretizeRegion[qRegion]; qSurface2 = BoundaryDiscretizeRegion[qRegion2]; ```

`Grid[{{qSurface, qSurface2}}]`

Now we can visualize the quantile surface together with the data points:

Block[{c = 3, opts, pntsgr},
``` opts = {PlotRange -> {{-c, c}, {0, c}, {-c, c}}, Boxed -> True, Axes -> True, ImageSize -> {Automatic, 500}}; pntsgr = Graphics3D[Point[data]]; Grid[{{Show[{qSurface, pntsgr}, opts], Show[{qSurface2, pntsgr}, opts]}}] ]```

## Check

Now we can answer the question how many data points are inside the quantile directions regions. Again, this is easy with the built-in region functionalities of Mathematica version 10.

```In[656]:= inPoints = Select[data, # \[Element] qRegion &]; Length[inPoints]/Length[data] // N```

``` ```

`Out[657]= 0.5974`

```In[700]:= inPoints = Select[data, # \[Element] qRegion2 &]; Length[inPoints]/Length[data] // N```

``` ```

`Out[701]= 0.1705`

(In these code lines I kept “[\Element]” instead of replacing it with “∈” in order the lines to be copy and pasted. )
As we can see the algorithm would produce a region Subscript[R, q] which contains much a smaller number of points than the requested quantile, q. This property of the algorithm is discussed in [2]. (In the two dimensional space this property is less pronounced.)

Obviously, though, we would expect the bias of the algorithm to be monotonic with respect to the quantiles requested. Given a set of points P and quantiles u and v, 0<u<v<=1, for the respectively produced by the algorithm regions Ru and Rv we have |P ∩ Ru| < |P ∩ Rv| .

## References

[1] Anton Antonov, Directional quantile envelopes, “Mathematica for prediction algorithms” blog at WordPress.com, 11/3/2014 .

[2] Linglong Kong, Ivan Mizera, “Quantile tomography: using quantiles with multivariate data”, 2013, arXiv:0805.0056v2 [stat.ME] URL: http://arxiv.org/abs/0805.0056 .

# Directional quantile envelopes

## Introduction

In this blog post I am going to discuss the computation of the so called directional quantile envelopes; see [5] for definitions, theorems, and concrete examples. The Mathematica package QuantileRegression.m, [1] has an experimental implementation, the function `QuantileEnvelope`.

This type of implementation investigation is a natural extension of the quantile regression implementations and utilization I have done before. (See the package [1] and the related guide [2]; blog posts [3,4].) I was looking for a 2D quantile regression technique that can be used as a tool for finding 2D outliers. From the exposition below it should be obvious that the technique can be generalized in higher dimensions.

The idea of directional quantile envelopes is conceptually simple (and straightforward to come up with). The calculation is also relatively simple: over a set of uniformly distributed directions we find the lines that separate the data according a quantile parameter q, 0<q<1, and with those lines we approximate the enveloping curve for data that corresponds to q.

## The algorithm

Here is the full algorithm. (The figure below can be used as a visual aid).

Parameters: data as a matrix of two columns, a quantile parameter q, 0<q<1.
1. Standardize the data.
1.1. Choose a center of the data (e.g. means of the coordinates) and a scale (e.g. inter-quartile distances).
1.2. Transpose the data to the center and scale the obtained coordinates.
2. Create a set if uniform angles, A:={i*2*Pi/k}, 0<=i<=k-1 .
2.1. Denote with a[i]:=i*2*Pi/k, i∈[0,k-1].
3. Rotate the data for each angle, a[i]∈A.
4. Find the q-quantile of the y-coordinates of the data rotated at angle a. Denote that point with r[a,q].
5. Find the points y[a,q] corresponding to r[a,q] in the original coordinate system.
6. Find the quantile lines intersections
6.1. For each angle a∈A we denote with n[a] the vector corresponding to a.
6.2. For each a∈A we have a line l[a,q] with normal n[a] and passing through y[a,q].
6.3. Find the intersection points P[q] of the lines l[a,q].
6.3.1. Each point p[j]∈P[q], j∈[1,k] is an intersection of lines at consecutive angles.
6.3.2. In other words p[j]=l[a[j-1],q] ∩ l[a[j],q], j∈[1,k].
7. Apply the reverse of the transformation done at step 3 to P[q].
8. Connecting the consecutive points of P[q] gives the directional quantile envelope approximation.

Click this thumbnail for an image with better math notation:

## Algorithm visual aid

Here is an example of an application of the described algorithm. The points in blue are the data points, the points in red are the points y[a,q], the red lines are the lines l[a,q], the points in black are the points P.

## Remarks

Remark 1: The function QuantileEnvelope of [1] follows the described algorithm. The rotations are done using matrix multiplication; the function Reduce is used to find the intersections between the lines.

Remark 2: A possible alternative algorithm is to use “the usual” quantile regression with polar coordinates. (The article [5] refers to and briefly discusses such an approach.)

# Example data

In order to demonstrate the function `QuantileEnvelope` let us define a couple of data sets.

The first data set is generated with the function `SkewNormalDistribution`, so the data set name is `sndData`.

```npoints = 12000; data = {RandomReal[{0, 2 \[Pi]}, npoints], RandomVariate[SkewNormalDistribution[0.6, 0.3, 4], npoints]}; data = MapThread[#2*{Cos[#1], Sin[#1]} &, data]; rmat = RotationMatrix[-\[Pi]/2.5].DiagonalMatrix[{2, 1}]; data = Transpose[rmat.Transpose[data]]; data = TranslationTransform[{-Norm[Abs[#]/3], 0}][#] & /@ data; sndData = Standardize[data];```

Here is a plot of sndData with grid lines derived from the quantiles of x and y coordinates of the data set.

```ListPlot[sndData, PlotRange -> All, AspectRatio -> 1, PlotTheme -> "Detailed", GridLines -> Map[Quantile[#, Range[0, 1, 1/19]] &, Transpose[sndData]], ImageSize -> 700]```

The second uses the function `PoissonDistribution`, so the data set name is `pdData`.

```npoints = 12000; data = RandomReal[NormalDistribution[12, 5.6], npoints]; data = Transpose[{data, data + RandomVariate[PoissonDistribution[12], npoints] + RandomReal[{-0.2, 0.2}, npoints]}]; data = Select[data, #[[2]] > 0 &]; data[[All, 2]] = Log[data[[All, 2]]]; pdData = data;```

Here is a plot of `pdData` with grid lines derived from the quantiles of x and y coordinates of the data set.

```ListPlot[pdData, PlotRange -> All, AspectRatio -> 1, PlotTheme -> "Detailed", GridLines -> Map[Quantile[#, Range[0, 1, 1/19]] &, Transpose[pdData]], ImageSize -> 700]```

## Examples of application (with QuantileEnvelope)

Let us demonstrate the function `QuantileEnvelope`. The function arguments are a data matrix, a quantile specification, and number of directions. The quantile specification can a number or a list of numbers. The function returns the intersection points of the quantile lines at different directions.

Here is an application to the dataset `sndData` over 8 different quantiles over 60 different uniformly spread directions.

```In[630]:= qs = {0.7, 0.75, 0.8, 0.85, 0.90, 0.95, 0.98, .99} ; AbsoluteTiming[qsPoints = QuantileEnvelope[sndData, qs, 60];] Out[631]= {0.150104, Null}```

The result is a list of 8 lists of 60 points.

```In[632]:= Dimensions[qsPoints] Out[632]= {8, 60, 2}```

Let us plot the data and the direction quantile envelopes.

```Block[{data = sndData}, Show[{ListPlot[data, AspectRatio -> 1, PlotTheme -> {"Detailed"}, GridLines -> Map[Quantile[#, Range[0, 1, 1/(20 - 1)]] &, Transpose[data]], PlotLegends -> SwatchLegend[ Blend[{Red, Orange}, Rescale[#1, {Min[qs], Max[qs]}, {0, 1}]] & /@ qs, qs]], Graphics[{PointSize[0.005], Thickness[0.0025], MapThread[{Blend[{Red, Orange}, Rescale[#1, {Min[qs], Max[qs]}, {0, 1}]], Tooltip[Line[Append[#2, #2[[1]]]], #1], Point[#2]} &, {qs, qsPoints}]}] }, ImageSize -> 700]]```

Let us similar calculations for the dataset pdData.

```In[647]:= qs = {0.7, 0.75, 0.8, 0.85, 0.90, 0.95, 0.98, 0.99} ; AbsoluteTiming[qsPoints = QuantileEnvelope[pdData, qs, 60];] Out[648]= {0.154227, Null} ```

```In[649]:= Dimensions[qsPoints] Out[649]= {8, 60, 2}```

Here is the plot of the dataset `pdData` and directional quantile envelopes.

## Some more quantile curves

Here are couple of more plots of the quantile lines at different directions with a larger number of directions.

Lines for quantile 0.95 in 40 directions:

Lines for quantiles 0.95 and 0.8 in 40 directions:

## References

[1] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[2] Anton Antonov, Quantile regression through linear programming, usage guide at GitHub, https://github.com/antononcube/MathematicaForPrediction, in the directory “Documentation”, (2013).

[3] Anton Antonov, Quantile regression through linear programming, “Mathematica for prediction algorithms” blog at WordPress.com, 12/16/2013.

[4] Anton Antonov, Quantile regression with B-splines, “Mathematica for prediction algorithms” blog at WordPress.com, 1/1/2014.

[5] Linglong Kong, Ivan Mizera, “Quantile tomography: using quantiles with multivariate data”, 2013, arXiv:0805.0056v2 [stat.ME] URL: http://arxiv.org/abs/0805.0056 .

# Removal of sub-trees in tries

I recently updated the package TriesWithFrequencies.m with two new functions `TrieRemoval` and TrieNodeFrequencies. The function `TrieNodeFrequencies` reverses the outcomes of `TrieNodeProbabilities`; it was made to facilitate the implementation and usage of `TrieRemoval`.

The removal of a sub-tree from a trie is useful when we want to alternate the learning stored in the trie.

The new functions are illustrated with examples below. (These examples reuse and extend the ones in the blog post “Tries with frequencies for data mining”.)
Consider the trie made with the list of words

`{"war", "ward", "warden", "work", "car", "cars", "carbs"}`

The function `TrieNodeFrequencies` can be used to reverse the effect of `TrieNodeProbabilities`. Its first argument is a trie, the second one is a scaling parameter (which is 1 by default). Here is an example:

`TrieForm[TrieNodeFrequencies[TrieNodeProbabilities[mytrie], 14]]`

The function `TrieRemove` can be used to remove sub-parts of a trie that correspond to a specified “word”. Here is an example using a trie with frequencies:

`TrieForm[TrieRemove[mytrie, "car"]]`

Here is an example using a trie with probabilities:

`TrieForm[TrieRemove[TrieNodeProbabilities[mytrie], "car"]]`

`TrieRemove` has two options “HowManyTimes” and “FrequencyValues” with default values `Automatic`. The option “HowManyTimes” takes a number or `Automatic` and it is still under development (its functionality is partially correct). The option “FrequencyValues” takes `True | False | Automatic` and it is used for the interpretation of the sub-tree removal. If the values are frequencies the removal process is more straightforward. If the values are probabilities the removal process has to take into account that the values on the same level of the sub-tree are conditional probabilities that sum to 1. `TrieRemove` has two different implementations one for tries with frequencies and one for tries with probabilities. The option setting `"FrequencyValues"->True|False` is used to redirect to the correct implementation. When `"FrequencyValues"->Automatic` is given a simple heuristic is used to determine is the trie argument a trie with frequencies or a trie with probabilities.

# Find Fit for Non-linear data

Three weeks ago I replied to a question in the Wolfram Community site about finding a function fit to points of non-linear data.

Here is the data:

I proposed a different way of doing the non-linear fitting using Quantile regression with B-splines. The advantages of the approach are that it does not need guessing of the fit functions and it requires less experimentation.

`Needs["QuantileRegression`"]`

After several experiments with the number of knots we can find a regression quantile using 3d order B-splines of 18 knots that approximates the data well.
``` qfunc = Simplify[QuantileRegression[data, 18, {0.5}, InterpolationOrder -> 2][[1]]];```

``` ```

```qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700]; Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700] ```

Here is the relative error estimate:
``` ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data], Filling -> Axis, Frame -> True, ImageSize -> 700] ```

I would like to point out that if we put large number of knots we can get aliasing errors effect:
``` qfunc = Simplify[ QuantileRegression[data, 40, {0.5}, InterpolationOrder -> 2][[1]]];```

``` ```

```qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}, PlotPoints -> 700]; Show[{dGr, qfGr}, Frame -> True, ImageSize -> 700] ```

If we select a non-uniform grid of knots according to gradients of the data we get a good approximation with one attempt.

``` In[107]:= knots = Join[ Rescale[Range[0, 1, 0.2], {0, 1}, {Min[data[[All, 1]]], Max[data[[All, 1]]] 3/4}], Rescale[Range[0, 1, 0.1], {0, 1}, {Max[data[[All, 1]]] 3/4, Max[data[[All, 1]]]}] ]```

``` Out[107]= {0., 3.75, 7.5, 11.25, 15., 18.75, 18.75, 19.375, 20., 20.625, 21.25, 21.875, 22.5, 23.125, 23.75, 24.375, 25.} ```

```In[108]:= qfunc = QuantileRegression[data, knots, {0.5}][[1]]; ```

The grid lines in the plot below are made with the knots.
``` qfGr = Plot[qfunc[x], {x, Min[data[[All, 1]]], Max[data[[All, 1]]]}]; Show[{dGr, qfGr}, GridLines -> {knots, None}, GridLinesStyle -> Directive[GrayLevel[0.8], Dashed], Frame -> True, ImageSize -> 700] ```

Here are the relative errors:
``` ListPlot[Map[{#[[1]], (qfunc[#1[[1]]] - #1[[2]])/#1[[2]]} &, data], Filling -> Axis, Frame -> True, ImageSize -> 700] ```

Here is the approximation function after using PiecewiseExpand and Simplify:
``` qfunc = Evaluate[Simplify[PiecewiseExpand[qfunc[[1]]]]] &; qfunc ```

### Follow-up

If it is desired to make the fitting with a certain, known model function then the function `QuantileRegressionFit` can be used. `QuantileRegressionFit` is provided by the same package, QuantileRegression.m, used in the blog post.

This older blog post of mine discusses the use of `QuantileRegressionFit` and points to a PDF with further details (theory and Mathematica commands).

# Classification and association rules for census income data

### Introduction

In this blog post I am going to show (some) analysis of census income data — the so called “Adult” data set, [1] — using three types of algorithms: decision tree classification, naive Bayesian classification, and association rules learning. Mathematica packages for all three algorithms can be found at the project MathematicaForPrediction hosted at GitHub, [2,3,4].

(The census income data set is also used in the description of the R package “arules”, [7].)

In the census data every record represents a person with 14 attributes, the last element of a record is one of the labels {“>=50K”,”<50K"}. The relationships between the categorical variables in that data set was described in my previous blog post, "Mosaic plots for data visualization".

For this data the questions I am most interested in are:
Question 1: Which of the variables (age, occupation, sex, etc.) are most decisive for determining the income of a person?
Question 2: Which values for which variables form conditions that would imply high income or low income? (I.e. ">50K" or "<=50K".)
Question 3: What conclusions or confirmations we can get from answering the previous two questions?

One way to answer Question 1 is to use following steps, [8].
1. Build a classifier with the training set.
2. Verify using the test set that good classification results are obtained.
3. If the number of variables (attributes) is k for each i, 1<=i<=k :
3.1. Shuffle the values of the i-th column of the test data and find the classification success rates.
4. Compare the obtained k classification success rates between each other and with the success rates obtained by the un-shuffled test data.
5. The variables for which the classification success rates are the worst are the most decisive.

Following these steps with a decision tree classifier, [2], I found that “marital-status” and “education-num” (years of education) are most decisive to give good prediction for the “>50K” label. Using a naive Bayesian classifier, [3], the most significant variables are “marital-status” and “relationship”. (More details are given in the sections “Application of decision trees” and “Application of naive Bayesian classifier”.)

One way to answer Question 2 is to find which values of the variables (e.g. “Wife”, “Peru”, “HS-grad”, “Exec-managerial”) associate most frequently with “>50K” and “<=50K" respectively and apply different Bayesian probability statistics on them. This is what the application of Associative rules learning gives, [9]. Another way is to use mosaic plots, [5,9], and prefix trees (also known as "tries") [6,11,12].

In order to apply Association rule learning we need to make the numerical variables categorical — we need to partition them into non-overlapping intervals. (This derived, "all categorical" data is also amenable to be input data for mosaic plots and prefix trees.)

### Data set

The data set can be found and taken from http://archive.ics.uci.edu/ml/datasets/Census+Income, [1].

(This data set was also used in my previous blog post “Mosaic plots for data visualization”.)

The description of the data set is given in the file “adult.names” of the data folder. The data folder provides two sets with the same type of data “adult.data” and “adult.test”; the former is used for training, the latter for testing.

The total number of records in the file “adult.data” is 32561; the total number of records in the file “adult.test” is 16281.

Here is how the data looks like:

Since I did not understand the meaning of the column “fnlwgt” I dropped it from the data.

Here is a summary of the data:

As it was mentioned in the introduction, only 24% of the labels are “>50K”. Also note that 2/3 of the records are for males.

### Scatter plots and mosaic plots

Often scatter plots and mosaic plots can give a good idea of the general patterns that hold in the data. This sub-section has a couple of examples, but presenting extensive plots is beyond the scope of this blog post. Let me point out that it might be very beneficial to use these kind of plots with Mathematica‘s dynamic features (like Manipulate and Tooltip), or make a grid of mosaic plots.

Mosaic plots of the categorical variables of the data can be seen in my previous blog post “Mosaic plots for data visualization”.

Here is a table of the histograms for “age”, “education-num”, and “hours-per-week”:

Here is a table with scatter plots for all numerical variables of the data:

### Application of decision trees

The building and classification with decision trees is straightforward. Since the label “>50K” is only a quarter of the records I consider the classification success rates for “>50K” to be more important.

I experimented with several sets of parameters for decision tree building. I did not get a classification success rate for “>50K” better than 0.644 . Using pruning based on the Minimal Description Length (MDL) principle did not give better results. (I have to say I find MDL pruning to be an elegant idea, but I am not convinced that it works that
well. I believe decision tree pruning based on test data would produce much better results. Only the MDL decision tree pruning is implemented in [2].)

The overall classification success rate is in line with the classification success ratios listed in explanation of the data set; see the file “adult.names” in [1].

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):

Here is a plot of the “>50K” success rates from the table above:

We can see from the table and the plot that variables “marital-status”, “education-num”, “capital-gain”, “age”, and “occupation” are very decisive when it comes to determining high income. The variable “marital-status” is significantly more decisive than the others.

While considering the decisiveness of the variable “marital-status” we can bring the following questions:
1. Do people find higher paying jobs after they get married?
2. Are people with high paying jobs more likely to marry and stay married?

Both questions are probably answered with “Yes” and probably that is why “marital-status” is so decisive. It is hard to give quantified answers to these questions just using decision trees on this data — we would need to know the salary and marital status history of the individuals (different data) or to be able to imply it (different algorithm).

We can see the decisiveness of “age”, “education-num”, “occupation”, and “hours-per-week” as natural. Of course one is expected to receive a higher pay if he has studied longer, has a high paying occupation, is older (more experienced), and works more hours per week. Note that this statement explicitly states the direction of the correlation: we do assume that longer years of study bring higher pay. It is certainly a good idea to consider the alternative direction of the correlation, that people first get high paying jobs and that these high paying jobs allow them to get older and study longer.

### Application of naive Bayesian classifiers

The naive Bayesian classifier, [3], produced better classification results than the decision trees for the label “>50K”:

Here is a table with the results of the column shuffling experiments described in the introduction (in red is the name of the data column shuffled):

Here is a plot of the “>50K” success rates from the table above:

In comparison with the decision tree importance of variables experiments we can notice that:
1. “marital-status” is very decisive and it is the second most decisive variable;
2. the most decisive variable is “relationship” but it correlates with “marital-status”;
3. “age”, “occupation”, “hours-per-week”, “capital-gain”, and “sex” are decisive.

### Shuffled classification rates plots comparison

Here are the two shuffled classification rates plots stacked together for easier comparison:

### Data modification

In order to apply the association rules finding algorithm Apriori, [4], the data set have to be modified. The modification is to change the numerical variables “age”, “education-num”, and “age” into categorical. I just partitioned them into non-overlapping intervals, labeled the intervals, and assigned the labels according the variable values. Here is the summary of the modified data for just these variables:

### Finding association rules

Using the modified data I found a large number of association rules with the Apriori algorithm, [4]. I used the measure called “confidence” to extract the most significant rules. The confidence of an association rule AC with antecedent A and consequent C is defined to be the ratio P(AC)/P(C). The higher the ratio the more confidence we have in the rule. (If the ratio is 1 we have a logical rule, CA.)

Here is a table showing the rules with highest confidence for the consequent being “>50K”:

From the table we can see for example that 2.1% of the data records (or 693 records) show that for a married man who has studied 14 years and originally from USA there is a 0.79 probability that he earns more than \$50000.

Here is a table showing the rules with highest confidence for the consequent being “<=50K":

The association rules in these tables confirm the findings with the classifiers: marital status, age, and education are good predictors of income labels “>50K” and “<=50K".

### Conclusion

The analysis confirmed (and quantified) what is considered common sense:

Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).

Using the association rules we see for example that
(1) if a person earns more than \$50000 he is very likely to be a married man with large number of years of education;
(2) single parents, younger than 25 years, who studied less than 10 years, and were never-married make less than \$50000.

### References

[1] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Census Income Data Set, URL: http://archive.ics.uci.edu/ml/datasets/Census+Income .

[2] Antonov, A., Decision tree and random forest implementations in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AVCDecisionTreeForest.m, (2013).

[3] Antonov, A., Implementation of naive Bayesian classifier generation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package NaiveBayesianClassifier.m, (2013).

[4] Antonov, A., Implementation of the Apriori algorithm in Mathematica, source code at https://github.com/antononcube/MathematicaForPrediction, package AprioriAlgorithm.m, (2013).

[5] Antonov, A., Mosaic plot for data visualization implementation in Mathematica, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package MosaicPlot.m, (2014).

[6] Antonov, A., Tries with frequencies Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package TriesWithFrequencies.m, (2013).

[7] Hahsler, M. et al., Introduction to arules \[Dash] A computational environment for mining association rules and frequent item sets, (2012).

[8] Breiman, L. et al., Classification and regression trees, Chapman & Hall, 1984.

[9] Wikipedia, Association rules learning, http://en.wikipedia.org/wiki/Association_rule_learning .

[10] Antonov, A., Mosaic plots for data visualization, (March, 2014), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Mosaic%20plots%20for%20data%20visualization.pdf .

[11] Wikipedia, Trie, http://en.wikipedia.org/wiki/Trie .

[12] Antonov, A., Tries, (December, 2013), URL: https://github.com/antononcube/MathematicaForPrediction/blob/master/Documentation/Tries.pdf .

# Enhancements of MosaicPlot

I made the following enhancements of the function `MosaicPlot` which I described (and proclaimed the implementation of) in my previous blog post:

1. Tooltips with precise contingency statistics.
2. If the last data column is numerical then `MosaicPlot` can use it as pre-computed contingency statistics.
3. Coloring of the rectangles according to a list of index->color rules.

The document “Mosaic plots for data visualization” hosted at MathematicaForPrediction at GitHub, combines the information of this blog post and previous one. The document also has Mathematica code examples of usage and description of `MosaicPlot`‘s options.

### Tooltips with precise contingency statistics

I already proclaimed in my previous blog post the tooltips functionality — when hovering with the mouse over the rectangles then `MosaicPlot`, using `Tooltip`, gives a table with the exact co-occurrence (contingency) values. Here is an example:

### Visualizing categorical columns + a numerical column

If the last data column is numerical then `MosaicPlot` can use it as pre-computed contingency statistics. This functionality is specified with the option “ExpandLastColumn”->True.

In order to explain the functionality we are going to use following interpretation. If the last of column of the data is numerical then we can treat the data as a contracted version of a longer list of records made only of the categorical columns. For example, consider the following table with observations of people’s hair and eyes color:

The table above can be considered as a contracted version of this table:

Setting the option “ExpandLastColumn” to True gives a mosaic plot corresponding to that latter, observations-expanded table:

The last data column (which is numerical) does not need to be made of integers:

### Rectangle coloring

The rectangles can be colored using the option ColorRules which specifies how the colors of the rectangles are determined from the indices of the data columns.

More precisely, the values of the option ColorRules should be a list of rules, {i1->c1,i2->c2,…}, matching the form

`{(_Integer->(_RGBColor|_GrayLevel))..} .`

If coloring for only one column index is specified the value of ColorRules can be of the form

`{_Integer->{(_RGBColor|_GrayLevel)..}} .`

The colors are used with `Blend` in order to color the rectangles according to the order of the unique values of the specified data columns.

The default value for `ColorRules` is `Automatic`. When `Automatic` is given to `ColorRules`, `MosaicPlot` finds the data column with the largest number of unique values and colors them according to their order using `ColorData[7,"ColorList"]`.

The grid of plots below shows mosaic plots of the same data with different values for the option `ColorRules` (given as plot labels).