# Quantile regression robustness

This blog post shows examples of quantile regression robustness. The quantile regression curves (called “regression quantiles”) are computed with the Mathematica package QuantileRegression.m hosted at the MathematicaForPrediction project at GitHub. Quantile regression was introduced by Koenker and Bassett in 1978; detailed theoretical descriptions and discussions are given in the book “Quantile regression” by Koenker.

This blog post extends the previous blog post.

Here is a scattered points graph based on a simple deterministic model with a simple heteroscedasticity (the variance varies with x): The data was generated with the Mathematica command:

{#, 5 + Log[#] + RandomReal[SkewNormalDistribution[0, Log[#]/5, 12]]} & /@ Range[10,200,0.126751]

Looking at the plot we would assume that the model for the data is

Y = β0 +β1 * X + β3 * log(X).

Here is a plot of the data and the regression quantiles: Let us demonstrate the robustness of the regression quantiles with the data of the previous example. Suppose that for some reason 50% of the data y-values greater than 11.25 are altered by multiplying them with a some greater than 1 factor, say, α = 1.8 . Then the altered data looks like this: Here is a plot of altered data and the regression quantiles and least squares fit for the altered data: Let us pair up the old regression quantile formulas with the new ones. We can see that the new regression quantiles computed for 0.05, 0.25, and 0.5 have not changed significantly: Also, there is a significant change in least squares fit:

(i) original data: 5.02011 + 0.000708203 x + 1.14048 Log[x],

(ii) altered dataL 6.60508 + 0.0183379 x + 0.494545 Log[x].

Here is a table of applying the altered regression quantiles to the original data: Now let us do a different type of alternation of the original data. Suppose that for some reason 70 % of the data Y-values above 0.95 regression quantile are the altered by multiplying them with a some greater than 1 factor, say, α = 10 . Then the altered data looks like this (using a log-plot): Here is a plot of the altered data and all fitted functions: Note that the least squares fit is quite off (the plot has a logarithmic scale on the y-axis). We can see that the new regression quantiles computed for 0.05, 0.25, 0.5, 0.75, and 0.95 have not changed significantly: Here is a table of applying the altered regression quantiles to the original data: The examples clearly demonstrate the robustness of quantile regression when compared to the least squares fit. As in the single distribution case, computing quantiles can be very useful for identifying outliers. For example, we can do the regression analogue of standardizing the data by subtracting the median and dividing by the interquartile distances, and declare any point outside of a specified range as an outlier.

# Outlier detection in a list of numbers

## Introduction

I added today a Mathematica package with outlier detection algorithms in the project MathematicaForPrediction at GitHub. I also wrote and uploaded a guide to using the functions in this package — see “Outlier detection in a list of numbers“.

I frequently include outlier identification in the interfaces and algorithms I make for search and prediction engines. Of course, outlier identification is also indispensable for data cleaning, normalization, and analysis.

My first introduction to outlier detection was through the book “Mining Imperfect Data: Dealing with Contamination and Incomplete Records” by Ronald K. Pearson. (Here is also a link to that book at amazon.com.)

## Outlier detection basics

The purpose of the outlier detection algorithms is to find those elements in a list of numbers that have values significantly higher or lower than the rest of the values.

Taking a certain number of elements with the highest values is not the same as an outlier detection, but it can be used as a replacement.

Let us consider the following set of 50 numbers:

``````pnts = RandomVariate[GammaDistribution[5, 1], 50]

(* {7.59113, 5.57539, 3.77879, 3.14141, 5.25833, 7.4036, 5.7324, 4.3612, \
3.04052, 4.30872, 7.79725, 8.12759, 6.75185, 4.39845, 8.12002, \
2.95435, 9.55785, 4.06057, 3.27145, 10.2521, 3.52952, 6.25076, \
1.82362, 2.8844, 4.22082, 1.94356, 2.39944, 3.97935, 6.11422, \
4.80236, 8.56683, 3.61667, 2.53776, 3.60869, 2.56662, 4.05317, \
2.27279, 4.46844, 4.57981, 5.00396, 4.17491, 10.4406, 4.31504, \
4.83232, 4.56361, 5.38989, 2.66085, 2.37754, 3.21949, 3.53067} *)``````

If we sort those numbers descendingly and plot them we get:

``````pnts = pnts // Sort // Reverse;
ListPlot[pnts, PlotStyle -> {PointSize[0.015]}, Filling -> Axis, PlotRange -> All]
`````` Let us use the following outlier detection algorithm:
1. find all values in the list that are larger than the mean value multiplied by 1.5;
2. then find the positions of these values in the list of numbers.

We can implement this algorithm in the following way.

``````pos = Flatten[Map[Position[pnts, #] &, Select[pnts, # > 1.5 Mean[pnts] &]]]

(* {1, 2, 3, 4, 5, 6, 7, 8, 9} *)``````

Lets plot all points in blue and the outliers we found in red:

``````ListPlot[{pnts, Transpose[{pos, pnts[[pos]]}]},
PlotStyle -> {PointSize[0.015], PointSize[0.009]}, Filling -> Axis,
PlotRange -> All]
`````` Instead of the mean value we can use another reference point, like the median value. Obviously, we can also use a multiplier different than 1.5.

## Using the implementation

First let us load the outlier identification package:

``Get["https://github.com/antononcube/MathematicaForPrediction/blob/master/OutlierIdentifiers.m"]``

We can find the outliers in a list of numbers with the function OutlierIdentifier:

``````OutlierIdentifier[pnts, HampelIdentifierParameters]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, \
7.4036, 6.75185, 6.25076, 2.39944, 2.37754, 2.27279, 1.94356, 1.82362} *)``````

The package has three functions for the calculation of outlier identifier parameters over a list of numbers

``````HampelIdentifierParameters[pnts]
(* {2.50757, 6.11619} *)

SPLUSQuartileIdentifierParameters[pnts]
(* {-0.549889, 9.50178} *)

QuartileIdentifierParameters[pnts]
(* {1.79581, 6.82164} *)``````

Elements of the number list that are outside of the numerical interval made by one of these pairs of numbers are considered outliers.

In many cases we want only the top outliers or only the bottom outliers. We can use the functions `TopOutliers` and `BottomOutliers` for that.

``````OutlierIdentifier[pnts, TopOutliers[HampelIdentifierParameters[#]] &]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, 7.4036, 6.75185, 6.25076} *)
``````