Outlier detection in a list of numbers

Introduction

I added today a Mathematica package with outlier detection algorithms in the project MathematicaForPrediction at GitHub. I also wrote and uploaded a guide to using the functions in this package — see “Outlier detection in a list of numbers“.

I frequently include outlier identification in the interfaces and algorithms I make for search and prediction engines. Of course, outlier identification is also indispensable for data cleaning, normalization, and analysis.

My first introduction to outlier detection was through the book “Mining Imperfect Data: Dealing with Contamination and Incomplete Records” by Ronald K. Pearson. (Here is also a link to that book at amazon.com.)

Outlier detection basics

The purpose of the outlier detection algorithms is to find those elements in a list of numbers that have values significantly higher or lower than the rest of the values.

Taking a certain number of elements with the highest values is not the same as an outlier detection, but it can be used as a replacement.

Let us consider the following set of 50 numbers:

pnts = RandomVariate[GammaDistribution[5, 1], 50]

(* {7.59113, 5.57539, 3.77879, 3.14141, 5.25833, 7.4036, 5.7324, 4.3612, \
3.04052, 4.30872, 7.79725, 8.12759, 6.75185, 4.39845, 8.12002, \
2.95435, 9.55785, 4.06057, 3.27145, 10.2521, 3.52952, 6.25076, \
1.82362, 2.8844, 4.22082, 1.94356, 2.39944, 3.97935, 6.11422, \
4.80236, 8.56683, 3.61667, 2.53776, 3.60869, 2.56662, 4.05317, \
2.27279, 4.46844, 4.57981, 5.00396, 4.17491, 10.4406, 4.31504, \
4.83232, 4.56361, 5.38989, 2.66085, 2.37754, 3.21949, 3.53067} *)

If we sort those numbers descendingly and plot them we get:

pnts = pnts // Sort // Reverse;
ListPlot[pnts, PlotStyle -> {PointSize[0.015]}, Filling -> Axis, PlotRange -> All]

ListPlotOfSorted50Numbers

Let us use the following outlier detection algorithm:
1. find all values in the list that are larger than the mean value multiplied by 1.5;
2. then find the positions of these values in the list of numbers.

We can implement this algorithm in the following way.

pos = Flatten[Map[Position[pnts, #] &, Select[pnts, # > 1.5 Mean[pnts] &]]]

(* {1, 2, 3, 4, 5, 6, 7, 8, 9} *)

Lets plot all points in blue and the outliers we found in red:

ListPlot[{pnts, Transpose[{pos, pnts[[pos]]}]}, 
 PlotStyle -> {PointSize[0.015], PointSize[0.009]}, Filling -> Axis, 
 PlotRange -> All]

ListPlotOfSorted50NumbersWithOutliers

Instead of the mean value we can use another reference point, like the median value. Obviously, we can also use a multiplier different than 1.5.

Using the implementation

First let us load the outlier identification package:

Get["https://github.com/antononcube/MathematicaForPrediction/blob/master/OutlierIdentifiers.m"]

We can find the outliers in a list of numbers with the function OutlierIdentifier:

OutlierIdentifier[pnts, HampelIdentifierParameters]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, \
7.4036, 6.75185, 6.25076, 2.39944, 2.37754, 2.27279, 1.94356, 1.82362} *)

The package has three functions for the calculation of outlier identifier parameters over a list of numbers

HampelIdentifierParameters[pnts]
(* {2.50757, 6.11619} *)

SPLUSQuartileIdentifierParameters[pnts]
(* {-0.549889, 9.50178} *)

QuartileIdentifierParameters[pnts]
(* {1.79581, 6.82164} *)

Elements of the number list that are outside of the numerical interval made by one of these pairs of numbers are considered outliers.

In many cases we want only the top outliers or only the bottom outliers. We can use the functions TopOutliers and BottomOutliers for that.

OutlierIdentifier[pnts, TopOutliers[HampelIdentifierParameters[#]] &]

(* {10.4406, 10.2521, 9.55785, 8.56683, 8.12759, 8.12002, 7.79725, 7.59113, 7.4036, 6.75185, 6.25076} *)
Advertisements

One thought on “Outlier detection in a list of numbers

  1. Pingback: Comparison of PCA and NNMF over image de-noising | Mathematica for prediction algorithms

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s