Comparison of PCA and NNMF over image de-noising

This post compares two dimension reduction techniques Principal Component Analysis (PCA) / Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NNMF) over a set of images with two different classes of signals (two digits below) produced by different generators and overlaid with different types of noise.

PCA/SVD produces somewhat good results, but NNMF often provides great results. The factors of NNMF allow interpretation of the basis vectors of the dimension reduction, PCA in general does not. This can be seen in the example below.

Data

First let us get some images. I am going to use the MNIST dataset for clarity. (And because I experimented with similar data some time ago — see Classification of handwritten digits.).

MNISTdigits = ExampleData[{"MachineLearning", "MNIST"}, "TestData"];
{testImages, testImageLabels} = 
  Transpose[List @@@ RandomSample[Cases[MNISTdigits, HoldPattern[(im_ -> 0 | 4)]], 100]];
testImages

enter image description here

See the breakdown of signal classes:

Tally[testImageLabels]
(* {{4, 48}, {0, 52}} *)

Verify the images have the same sizes:

Tally[ImageDimensions /@ testImages]
dims = %[[1, 1]]

Add different kinds of noise to the images:

noisyTestImages6 = 
  Table[ImageEffect[
    testImages[[i]], 
    {RandomChoice[{"GaussianNoise", "PoissonNoise", "SaltPepperNoise"}], RandomReal[1]}], {i, Length[testImages]}];
RandomSample[Thread[{testImages, noisyTestImages6}], 15]

enter image description here

Since the important values of the signals are 0 or close to 0 we negate the noisy images:

negNoisyTestImages6 = ImageAdjust@*ColorNegate /@ noisyTestImages6

enter image description here

Linear vector space representation

We unfold the images into vectors and stack them into a matrix:

noisyTestImagesMat = (Flatten@*ImageData) /@ negNoisyTestImages6;
Dimensions[noisyTestImagesMat]

(* {100, 784} *)

Here is centralized version of the matrix to be used with PCA/SVD:

cNoisyTestImagesMat = 
  Map[# - Mean[noisyTestImagesMat] &, noisyTestImagesMat];

(With NNMF we want to use the non-centralized one.)

Here confirm the values in those matrices:

Grid[{Histogram[Flatten[#], 40, PlotRange -> All, 
     ImageSize -> Medium] & /@ {noisyTestImagesMat, 
    cNoisyTestImagesMat}}]

enter image description here

SVD dimension reduction

For more details see the previous answers.

{U, S, V} = SingularValueDecomposition[cNoisyTestImagesMat, 100];
ListPlot[Diagonal[S], PlotRange -> All, PlotTheme -> "Detailed"]
dS = S;
Do[dS[[i, i]] = 0, {i, Range[10, Length[S], 1]}]
newMat = U.dS.Transpose[V];
denoisedImages = 
  Map[Image[Partition[# + Mean[noisyTestImagesMat], dims[[2]]]] &, newMat];

enter image description here

Here are how the new basis vectors look like:

Take[#, 50] &@
 MapThread[{#1, Norm[#2], 
    ImageAdjust@Image[Partition[Rescale[#3], dims[[1]]]]} &, {Range[
    Dimensions[V][[2]]], Diagonal[S], Transpose[V]}]

enter image description here

Basically, we cannot tell much from these SVD basis vectors images.

Load packages

Here we load the packages for what is computed next:

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/NonNegativeMatrixFactorization.m"]

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/OutlierIdentifiers.m"]

The blog post “Statistical thesaurus from NPR podcasts” discusses an example application of NNMF and has links to documents explaining the theory behind the NNMF utilization.

The outlier detection package is described in a previous blog post “Outlier detection in a list of numbers”.

NNMF

This command factorizes the image matrix into the product W H :

{W, H} = GDCLS[noisyTestImagesMat, 20, "MaxSteps" -> 200];
{W, H} = LeftNormalizeMatrixProduct[W, H];

Dimensions[W]
Dimensions[H]

(* {100, 20} *)
(* {20, 784} *)

The rows of H are interpreted as new basis vectors and the rows of W are the coordinates of the images in that new basis. Some appropriate normalization was also done for that interpretation. Note that we are using the non-normalized image matrix.

Let us see the norms of $H$ and mark the top outliers:

norms = Norm /@ H;
ListPlot[norms, PlotRange -> All, PlotLabel -> "Norms of H rows", 
  PlotTheme -> "Detailed"] // 
 ColorPlotOutliers[TopOutliers@*HampelIdentifierParameters]
OutlierPosition[norms, TopOutliers@*HampelIdentifierParameters]
OutlierPosition[norms, TopOutliers@*SPLUSQuartileIdentifierParameters]

enter image description here

Here is the interpretation of the new basis vectors (the outliers are marked in red):

MapIndexed[{#2[[1]], Norm[#], Image[Partition[#, dims[[1]]]]} &, H] /. (# -> Style[#, Red] & /@ 
   OutlierPosition[norms, TopOutliers@*HampelIdentifierParameters])

enter image description here

Using only the outliers of $H$ let us reconstruct the image matrix and the de-noised images:

pos = {1, 6, 10}
dHN = Total[norms]/Total[norms[[pos]]]*
   DiagonalMatrix[
    ReplacePart[ConstantArray[0, Length[norms]], 
     Map[List, pos] -> 1]];
newMatNNMF = W.dHN.H;
denoisedImagesNNMF = 
  Map[Image[Partition[#, dims[[2]]]] &, newMatNNMF];

Often we cannot just rely on outlier detection and have to hand pick the basis for reconstruction. (Especially when we have more than one classes of signals.)

Comparison

At this point we can plot all images together for comparison:

imgRows = 
  Transpose[{testImages, noisyTestImages6, 
    ImageAdjust@*ColorNegate /@ denoisedImages, 
    ImageAdjust@*ColorNegate /@ denoisedImagesNNMF}];
With[{ncol = 5}, 
 Grid[Prepend[Partition[imgRows, ncol], 
   Style[#, Blue, FontFamily -> "Times"] & /@ 
    Table[{"original", "noised", "SVD", "NNMF"}, ncol]]]]

enter image description here

We can see that NNMF produces cleaner images. This can be also observed/confirmed using threshold binarization — the NNMF images are much cleaner.

imgRows =
  With[{th = 0.5},
   MapThread[{#1, #2, Binarize[#3, th], 
      Binarize[#4, th]} &, {testImageLabels, noisyTestImages6, 
     ImageAdjust@*ColorNegate /@ denoisedImages, 
     ImageAdjust@*ColorNegate /@ denoisedImagesNNMF}]];
With[{ncol = 5}, 
 Grid[Prepend[Partition[imgRows, ncol], 
   Style[#, Blue, FontFamily -> "Times"] & /@ 
    Table[{"label", "noised", "SVD", "NNMF"}, ncol]]]]

enter image description here

Usually with NNMF in order to get good results we have to do more that one iteration of the whole factorization and reconstruction process. And of course NNMF is much slower. Nevertheless, we can see clear advantages of NNMF’s interpretability and leveraging it.

Gallery with other experiments

In those experiments I had to hand pick the NNMF basis used for the reconstruction. Using outlier detection without supervision would not produce good results most of the time.

Further comparison with Classify

We can further compare the de-noising results by building signal (digit) classifiers and running them over the de-noised images.

For such a classifier we have to decide:

  1. do we train only with images of the two signal classes or over a larger set of signals;
  2. how many signals we train with;
  3. with what method the classifiers are built.

Below I use the default method of Classify with all digit images in MNIST that are not in the noised images set. The classifier is run with the de-noising traced these 0-4 images:

Get the images:

{trainImages, trainImageLabels} = Transpose[List @@@ Cases[MNISTdigits, HoldPattern[(im_ -> _)]]];
pred = Map[! MemberQ[testImages, #] &, trainImages];
trainImages = Pick[trainImages, pred];
trainImageLabels = Pick[trainImageLabels, pred];
Tally[trainImageLabels]
    
(* {{0, 934}, {1, 1135}, {2, 1032}, {3, 1010}, {4, 928}, {5, 892}, {6, 958}, {7, 1028}, {8, 974}, {9, 1009}} *)

Build the classifier:

digitCF = Classify[trainImages -> trainImageLabels]

Compute classifier measurements for the original, the PCA de-noised, and NNMF de-noised images:

origCM = ClassifierMeasurements[digitCF, Thread[(testImages) -> testImageLabels]]
    
pcaCM = ClassifierMeasurements[digitCF, 
      Thread[(Binarize[#, 0.55] &@*ImageAdjust@*ColorNegate /@ denoisedImages) -> testImageLabels]]
    
nnmfCM = ClassifierMeasurements[digitCF, 
      Thread[(Binarize[#, 0.55] &@*ImageAdjust@*ColorNegate /@  denoisedImagesNNMF) -> testImageLabels]]

Tabulate the classification measurements results:

Grid[{{"Original", "PCA", "NNMF"},
{Row[{"Accuracy:", origCM["Accuracy"]}], Row[{"Accuracy:", pcaCM["Accuracy"]}], Row[{"Accuracy:", nnmfCM["Accuracy"]}]},
{origCM["ConfusionMatrixPlot"], pcaCM["ConfusionMatrixPlot"], nnmfCM["ConfusionMatrixPlot"]}}, 
Dividers -> {None, {True, True, False}}]

PCAvsNNMF-for-images-0-4-AllDigitClassifierMeasurement-binarize

Finding outliers in 2D and 3D numerical data

Introduction

This blog post describes a method of finding outliers in 2D and 3D data using Quantile Regression Envelopes discussed in previous blog posts: “Directional quantile envelopes”, “Directional quantile envelopes in 3D”.

Data

In order to provide a good example of the method application it would be better to use “real life” data. I found this one: “UCI Online Retail Data Set”. That dataset has columns for online purchase transactions (quantity and price).

I had problems loading the data with this command:

(*data=Import["http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx","XLSX"];*)

so I downloaded the XLSX file and saved it into a CSV file, then imported it:

data = Import["~/Datasets/UCI Online Retail Data Set/Online Retail.csv", 
   "IgnoreEmptyLines" -> True, "HeaderLines" -> 0];
columnNames = data[[1]];
data = Rest[data];

Here are the dimensions of the dataset (seems “realistically” large):

Dimensions[data]
(* {65499, 8} *)

Here is a summary of the quantative columns:

Grid[{RecordsSummary[N@data[[All, {4, 6}]], columnNames[[{4, 6}]]]}, Dividers -> All]

enter image description here

Adding a “discount” column

Since we have only two quantative columns in the data let us add a third one, and make it have 3 outliers.

dvec = RandomReal[SkewNormalDistribution[1, 2, 4], Dimensions[data][[1]]];
Block[{inds = RandomSample[Range[Length[dvec]], 3]}, 
  dvec[[inds]] = 10*dvec[[inds]]];
testData = MapThread[Append, {N@data[[All, {4, 6}]], dvec}];

Here is the summary:

Grid[{RecordsSummary[testData]}, Dividers -> All]

enter image description here

Standardizing

It is a good idea to standardize the data. This is not necessary for the outlier finding procedure used below, but it makes the data more convenient for visualization or other exploration.

sTestData = Transpose[Standardize /@ Transpose[N@testData]];

Because of the outliers plotting the data might produce uninformative plots. We can use logarithms of the point coordinates and for that we have to shift the standardized data to be positive. This is done with this command:

Block[{offset = -2 (Min /@ Transpose[sTestData])}, 
  sTestData = Map[# + offset &, sTestData]];

Let use show the standardized data summary and visualize:

Grid[{RecordsSummary[sTestData]}, Dividers -> All]

enter image description here

opts = {PlotRange -> All, ImageSize -> Medium, 
  PlotTheme -> "Detailed"}; Grid[{{ListPointPlot3D[sTestData, opts], 
   ListPointPlot3D[Log10@sTestData, opts]}}]

enter image description here

Using Quantile Regression envelopes to find outliers

We are going to find the outliers by computing envelopes around the dataset points that contain almost all points (e.g. 99.7% of them).

The finding of directional quantile envelopes in 2D and 3D is explained in these blog posts:

  1. “Directional quantile envelopes”,

  2. “Directional quantile envelopes in 3D”.

I am a big fan of Quantile Regression (QR) and I have implemented a collection of functions and applications of QR. See these blog posts.

This command imports the package QuantileRegression.m :

Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/QuantileRegression.m"]

Here is an example of an envelope found using directional quantiles over a small sample of the points:

Block[{testData = RandomSample[sTestData, 2000], qreg},
 qreg = QuantileEnvelopeRegion[testData, 0.997, 10];
 Show[{Graphics3D[{Red, Point[testData]}, Axes -> True], 
   BoundaryDiscretizeRegion[qreg]}]
 ]

enter image description here

The outlier points (in red) are outside of the envelope.

The example (and plot) above are just for illustration purposes. We calculate a quantile evelope region using all points.

Block[{testData = sTestData},
 qreg = QuantileEnvelopeRegion[testData, 0.9997, 10];
]

(The function QuantileEnvelopeRegion works also for 2D. For 2D nothing else changes in the steps below except using 2D graphics.)

This command makes a function to test does a point belong to the found envelope region or not:

rmFunc = RegionMember[qreg];

This calculates the membership predicates for all points:

AbsoluteTiming[
 pred = rmFunc /@ sTestData;
]

(* {15.6485, Null} *)

And we can see the membership breakdown:

Tally[pred]
(* {{True, 65402}, {False, 97}} *)

and visualize it (using Pick and taking logarithms):

Graphics3D[{Gray, Point[Log10@Pick[sTestData, pred]], Red, 
  Point[Log10@Pick[sTestData, Not /@ pred]]}, Axes -> True]

enter image description here

The plot above contains both top and bottom outliers (in red). If we are intereseted only in the top outliers we can find these thresholds:

topThresholds = Quantile[#, 0.95] & /@ Transpose[testData]
(* {25, 10.95, 4.89433} *)

and use them to select the top outliers:

Select[Pick[testData, Not /@ pred], 
 Total[Thread[# > topThresholds] /. {True -> 1, False -> 0}] > 1 &]

(* {{1, 836.14, 6.48564}, {1, 16.13, 8.71455}, {5, 25.49, 8.47351}, {-1, 1126, 5.25211}, {1000, 0, 5.4212}, {1, 15.79, 9.14674}, {-1, 544.4, 6.86266}} *)

Note that the quantile regression envelope and the membership predicates were computed over the standardized data, and the predicates were used to retrieve the outliers of the original data.