Finding local extrema in noisy data using Quantile Regression

Introduction

This blog post (article) describes an algorithm for finding local extrema in noisy data using Quantile Regression. The problem formulation and a solution for it using polynomial model fitting (through LinearModelFit) were taken from Mathematica StackExchange — see “Finding Local Minima / Maxima in Noisy Data”, [1].

The proposed Quantile Regression algorithm is a version of the polynomial fitting solution (proposed by Leonid Shifrin in [1]) and has the following advantages: (i) requires less parameter tweaking, and (ii) more importantly it is more robust with very noisy and oscillating data. That robustness is achieved by using two regression fitted curves: one close to the local minima and another close to the local maxima, computed for low and high quantiles respectively. (Quantile Regression is uniquely able to do that.)

The code for this blog post is available at [6].

A complete version of this blog post is available as a PDF document here: Finding local extrema in noisy data using Quantile Regression.pdf.

The problem

Here is the problem formulation from [1].

Problem 1: We have a list of pairs of numbers representing measurements of a scalar variable on a regular time grid. The measurements have noise in them.

As a data example for this problem the author of the question in [1] has provided the following code:

temptimelist = Range[200]/10;
tempvaluelist = Sinc[#] &@temptimelist + RandomReal[{-1, 1}, 200]*0.02;
data1 = Transpose[{temptimelist, tempvaluelist}];
ListPlot[data1, PlotRange -> All, Frame -> True, ImageSize -> 500]

QRforFindingLocalExtrema-Data1

In this article we are going to consider a more general problem hinted in the discussion of the solutions in [1].

Problem 2: Assume that the data in Problem 1 is collected several times and the noise is present in both the measured values and the time of measurement. Also the data can be highly oscillatory in nature.

Consider this data generation as an example for Problem 2:

n = 1000;
xs = N@Rescale[Range[n], {1, n}, {0, 60}];
data2 = Flatten[
 Table[Transpose[{xs + 0.1 RandomReal[{-1, 1}, Length[xs]], 
 Map[5 Sinc[#] + Sin[#] + 4 Sin[1/4 #] &, xs] + 
 1.2 RandomVariate[SkewNormalDistribution[0, 1, 0.9], 
 Length[xs]]}], {10}], 1];
ListPlot[data2, PlotRange -> All, Frame -> True, ImageSize -> 500]

QRforFindingLocalExtrema-Data2

Note that this data has 10000 points and it is much larger than the data for Problem 1.

Extrema location approximation by model fitting followed by nearest neighbors search

Several solutions are given in [1]. A couple are using wavelets or Gaussian filtering for de-noising. The one we are going to focus on in this article is using polynomial fitting for extrema location approximation and then finding the actual data extrema by Nearest Neighbors (NN’s) search. It is provided by Leonid Shifrin.

Let us list the steps of that algorithm:

1. Fit a polynomial through the data (using LinearModelFit).
2. Find the local extrema of the fitted polynomial. (We will call them fit estimated extrema.)
3. Around each of the fit estimated extrema find the most extreme point in
the data by nearest neighbors search (by using Nearest).

We are going to refer to this algorithm as LMFFindExtrema. Its implementation is available here: QuantileRegressionForLocalExtrema.m, [6].

Here is the application of the algorithm to the data example of Problem 1:

QRforFindingLocalExtrema-LMFFindExtrema-Data1

(The continuous line shows the fitted curve.)

Here is the application of the algorithm to the data example of Problem 2:

QRforFindingLocalExtrema-LMFFindExtrema-Data2

It does not help to just increase the number of basis polynomials and the number of NN’s examined points:

QRforFindingLocalExtrema-LMFFindExtrema-Data2-60

We can also see that increasing the number of examined NN’s for each of the fit estimated extrema would make some the final result points to be “borrowed” from neighboring peaks in the data.

Experiments with fitting trigonometric basis functions showed very good fit to the data but the calculations of the fit estimated extrema were very slow.

Using Quantile Regression

It is trivial to rewrite the algorithm LFMFindExtrema to use Quantile Regression instead of linear model fitting of polynomials. We are going to use the Mathematica package [2] implementation described in [3,4]. The function QuantileRegression provided by [2] uses a B-spline basis [5] to find the quantile regression curves (also known as regression quantiles).

We are going to call this algorithm QRFindExtrema. Again its implementation is available here: QuantileRegressionForLocalExtrema.m, [6].

The algorithm QRFindExtrema has the following parameters: the data, number of B-spline knots, interpolation order, quantiles (corresponding to the curves to be fitted).

QRFindExtrema returns a list of regression quantile functions and a list of lists with extrema estimates.

More importantly though, QRFindExtrema can use two curves for finding the local extrema: one for local minima, and one for local maxima. (This feature is justified below.)

Here is the application of QRFindExtrema to the example data of Problem 1:

QRforFindingLocalExtrema-QRFindExtrema-Data1

Here is application of QRFindExtrema to the data of Problem 2:

QRforFindingLocalExtrema-QRFindExtrema-Data2-12-120-0.5

We can see that the regression quantile for 0.5 is too flat to get good judgment of the local extrema. We can get better results if we increase the number of knots for the B-spline basis built by QuantileRegression.

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-120-0.5

We can see that results look “almost right”, the horizontal locations of the peaks are apparently more-or-less correctly identified, but the result extrema are too close to the fitted curve. Just increasing the number of examined NN’s for the fit estimated extrema does not produce good results because points from neighboring peaks are being chosen as final extrema estimates.

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-1500-0.5

In order to solve this problem we use two regression quantiles. For local minima we use a regression quantile for a low quantile number, say, 0.02; for the local maxima we use a regression quantile for a large quantile number, say, 0.98 .

QRforFindingLocalExtrema-QRFindExtrema-Data2-24-200-low-high

The use of two (or more) curves to be fitted is an unique capability of Quantile Regression. With this algorithm feature by construction the lower regression quantile is close to the local minima and the higher regression quantile is close to the local maxima.

Also, since we find two regression quantile curves we can use two nearest neighbors finding functions: one with the points below the low regression quantile, and one with the points above the high regression quantile. The implementation in [6] takes an option specification for should the nearest neighbor functions for finding the extrema be constructed using all data points or just the outliers (the points outside of the found regression quantiles).

More examples

More timid noisy and oscillating data with around 10,000 points.

QRforFindingLocalExtrema-QRFindExtrema-Data3-24-200-low-high

References

[1] Mathematica StackExchange discussion. “Finding Local Minima / Maxima in Noisy Data”,
URL: http://mathematica.stackexchange.com/questions/23828/finding-local-minima-maxima-in-noisy-data/ .

[2] Anton Antonov, Quantile regression Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegression.m, (2013).

[3] Anton Antonov, Quantile regression through linear programming, usage guide at GitHub, https://github.com/antononcube/MathematicaForPrediction, in the directory “Documentation”, (2013).

[4] Anton Antonov, Quantile regression through linear programming, “Mathematica for prediction algorithms” blog at WordPress.com, 12/16/2013.

[5] Anton Antonov, Quantile regression with B-splines, “Mathematica for prediction algorithms” blog at WordPress.com, 1/1/2014.

[6] Anton Antonov, QuantileRegressionForLocalExtrema Mathematica package, source code at GitHub, https://github.com/antononcube/MathematicaForPrediction, package QuantileRegressionForLocalExtrema.m, (2015).
The package is in the directory “Applications”.

Advertisements

Quantile regression robustness

This blog post shows examples of quantile regression robustness. The quantile regression curves (called “regression quantiles”) are computed with the Mathematica package QuantileRegression.m hosted at the MathematicaForPrediction project at GitHub. Quantile regression was introduced by Koenker and Bassett in 1978; detailed theoretical descriptions and discussions are given in the book “Quantile regression” by Koenker.

This blog post extends the previous blog post.

Here is a scattered points graph based on a simple deterministic model with a simple heteroscedasticity (the variance varies with x):

Logarithmic data with heteroscedastic skewed noise

The data was generated with the Mathematica command:

{#, 5 + Log[#] + RandomReal[SkewNormalDistribution[0, Log[#]/5, 12]]} & /@ Range[10,200,0.126751]

Looking at the plot we would assume that the model for the data is

Y = β0 +β1 * X + β3 * log(X).

Here is a plot of the data and the regression quantiles:

Logarithmic data with regression quantiles and least squares

Let us demonstrate the robustness of the regression quantiles with the data of the previous example. Suppose that for some reason 50% of the data y-values greater than 11.25 are altered by multiplying them with a some greater than 1 factor, say, α = 1.8 . Then the altered data looks like this:

Logarithmic data with heteroscedastic skewed noise with outliers

Here is a plot of altered data and the regression quantiles and least squares fit for the altered data:

Logarithmic data with outliers with regression quantiles and least squares

Let us pair up the old regression quantile formulas with the new ones. We can see that the new regression quantiles computed for 0.05, 0.25, and 0.5 have not changed significantly:

Regression quantiles logarithmic data with outliers

Also, there is a significant change in least squares fit:

(i) original data: 5.02011 + 0.000708203 x + 1.14048 Log[x],

(ii) altered dataL 6.60508 + 0.0183379 x + 0.494545 Log[x].

Here is a table of applying the altered regression quantiles to the original data:

Table of separation for regression quantiles logarithmic data with outliers

Now let us do a different type of alternation of the original data. Suppose that for some reason 70 % of the data Y-values above 0.95 regression quantile are the altered by multiplying them with a some greater than 1 factor, say, α = 10 . Then the altered data looks like this (using a log-plot):

Logarithmic data with heteroscedastic skewed noise and regression based outliers

Here is a plot of the altered data and all fitted functions:

Logarithmic data with regression based outliers with regression quantiles and least squares

Note that the least squares fit is quite off (the plot has a logarithmic scale on the y-axis). We can see that the new regression quantiles computed for 0.05, 0.25, 0.5, 0.75, and 0.95 have not changed significantly:

Regression quantiles logarithmic data with regression outliers

Here is a table of applying the altered regression quantiles to the original data:

Table of separation for regression quantiles logarithmic data with regression outliers

The examples clearly demonstrate the robustness of quantile regression when compared to the least squares fit. As in the single distribution case, computing quantiles can be very useful for identifying outliers. For example, we can do the regression analogue of standardizing the data by subtracting the median and dividing by the interquartile distances, and declare any point outside of a specified range as an outlier.

Outlier detection in a list of numbers

Introduction

I added today a Mathematica package with outlier detection algorithms in the project MathematicaForPrediction at GitHub. I also wrote and uploaded a guide to using the functions in this package — see “Outlier detection in a list of numbers“.

I frequently include outlier identification in the interfaces and algorithms I make for search and prediction engines. Of course, outlier identification is also indispensable for data cleaning, normalization, and analysis.

My first introduction to outlier detection was through the book “Mining Imperfect Data: Dealing with Contamination and Incomplete Records” by Ronald K. Pearson. (Here is also a link to that book at amazon.com.)

Outlier detection basics

The purpose of the outlier detection algorithms is to find those elements in a list of numbers that have values significantly higher or lower than the rest of the values.

Taking a certain number of elements with the highest values is not the same as an outlier detection, but it can be used as a replacement.

Let us consider the following set of 50 numbers:

pnts = RandomVariate[GammaDistribution[5, 1], 50]

(* {7.59113, 5.57539, 3.77879, 3.14141, 5.25833, 7.4036, 5.7324, 4.3612, \
3.04052, 4.30872, 7.79725, 8.12759, 6.75185, 4.39845, 8.12002, \
2.95435, 9.55785, 4.06057, 3.27145, 10.2521, 3.52952, 6.25076, \
1.82362, 2.8844, 4.22082, 1.94356, 2.39944, 3.97935, 6.11422, \
4.80236, 8.56683, 3.61667, 2.53776, 3.60869, 2.56662, 4.05317, \
2.27279, 4.46844, 4.57981, 5.00396, 4.17491, 10.4406, 4.31504, \
4.83232, 4.56361, 5.38989, 2.66085, 2.37754, 3.21949, 3.53067} *)

If we sort those numbers descendingly and plot them we get:

pnts = pnts // Sort // Reverse;
ListPlot[pnts, PlotStyle -> {PointSize[0.015]}, Filling -> Axis, PlotRange -> All]

ListPlotOfSorted50Numbers

Let us use the following outlier detection algorithm:
1. find all values in the list that are larger than the mean value multiplied by 1.5;
2. then find the positions of these values in the list of numbers.

We can implement this algorithm in the following way.

pos = Flatten[Map[Position[pnts, #] &, Select[pnts, # > 1.5 Mean[pnts] &]]]

(* {1, 2, 3, 4, 5, 6, 7, 8, 9} *)

Lets plot all points in blue and the outliers we found in red:

ListPlot[{pnts, Transpose[{pos, pnts[[pos]]}]}, 
 PlotStyle -> {PointSize[0.015], PointSize[0.009]}, Filling -> Axis, 
 PlotRange -> All]

ListPlotOfSorted50NumbersWithOutliers

Instead of the mean value we can use another reference point, like the median value. Obviously, we can also use a multiplier different than 1.5.

Using the implementation

First let us load the outlier identification package:

Get["https://github.com/antononcube/MathematicaForPrediction/blob/master/OutlierIdentifiers.m"]

We can find the outliers in a list of numbers with the function OutlierIdentifier:

OutlierIdentifier[pnts, HampelIdentifierParameters]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, \
7.4036, 6.75185, 6.25076, 2.39944, 2.37754, 2.27279, 1.94356, 1.82362} *)

The package has three functions for the calculation of outlier identifier parameters over a list of numbers

HampelIdentifierParameters[pnts]
(* {2.50757, 6.11619} *)

SPLUSQuartileIdentifierParameters[pnts]
(* {-0.549889, 9.50178} *)

QuartileIdentifierParameters[pnts]
(* {1.79581, 6.82164} *)

Elements of the number list that are outside of the numerical interval made by one of these pairs of numbers are considered outliers.

In many cases we want only the top outliers or only the bottom outliers. We can use the functions TopOutliers and BottomOutliers for that.

OutlierIdentifier[pnts, TopOutliers[HampelIdentifierParameters[#]] &]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, 7.4036, 6.75185, 6.25076} *)