In this notebook we show information retrieval and clustering
techniques over images of Unicode collection of Chinese characters. Here
is the outline of notebook’s exposition:
Get Chinese character images.
Cluster “image vectors” and demonstrate that the obtained
clusters have certain explainability elements.
Apply Latent Semantic Analysis (LSA) workflow to the character
set.
Show visual thesaurus through a recommender system. (That uses
Cosine similarity.)
Discuss graph and hierarchical clustering using LSA matrix
factors.
Demonstrate approximation of “unseen” character images with an
image basis obtained through LSA over a small set of (simple)
images.
Redo character approximation with more “interpretable” image
basis.
In this section we cluster “image vectors” and demonstrate that the
obtained clusters have certain explainability elements. Expected Chinese
character radicals are observed using image multiplication.
Cluster the image vectors and show summary of the clusters
lengths:
Remark: We can see that the clustering above
produced “semantic” clusters – most of the multiplied images show
meaningful Chinese characters radicals and their “expected
positions.”
Here is one of the clusters with the radical “mouth”:
KeyTake[aCImages, lsClusters[[26]]]
131vpq9dabrjo
LSAMon application
In this section we apply the “standard” LSA workflow, [AA1, AA4].
Make a matrix with named rows and columns from the image vectors:
mat = ToSSparseMatrix[SparseArray[Values@aCImageVecs],"RowNames"-> Keys[aCImageVecs],"ColumnNames"->Automatic]
0jdmyfb9rsobz
The following Latent Semantic Analysis (LSA) monadic pipeline is used
in [AA2, AA2]:
I experimented with clustering and approximation using WL’s function FeatureExtraction.
Result are fairly similar as the above; timings a different (a few times
slower.)
Visual thesaurus
In this section we use Cosine similarity to find visual nearest
neighbors of Chinese character images.
Remark: By careful observation of the clusters and
graph connections we can convince ourselves that the similarities are
based on pictorial sub-elements (i.e. radicals) of the characters.
Hierarchical clustering
In this section we apply hierarchical clustering to the reduced
dimension representation of the Chinese character images.
Here is a heat-map plot with hierarchical clustering dendrogram (with
tool-tips):
gr = HeatmapPlot[W2[[lsFocusIDs,All]],DistanceFunction->{CosineDistance,None}, Dendrogram ->{True,False}];gr /.Map[# ->Tooltip[Style[#,FontSize->16],Style[#,Bold,FontSize->36]] &, lsFocusIDs]
0vz82un57054q
Remark: The plot above has tooltips with larger
character images.
Representing
all characters with smaller set of basic ones
In this section we demonstrate that a relatively small set of simpler
Chinese character images can be used to represent (or approxumate) the
rest of the images.
Remark: We use the following heuristic: the simpler
Chinese characters have the smallest amount of white pixels.
Obtain a training set of images – that are the darkest – and show a
sample of that set :
Remark: By applying the approximation procedure to
all characters in testing set we can convince ourselves that small,
training set provides good retrieval. (Not done here.)
In this presentation we discuss the application of different dimension reduction algorithms over collections of random mandalas. We discuss and compare the derived image bases and show how those bases explain the underlying collection structure. The presented techniques and insights (1) are applicable to any collection of images, and (2) can be included in larger, more complicated machine learning workflows. The former is demonstrated with a handwritten digits recognition
application; the latter with the generation of random Bethlehem stars. The (parallel) walk-through of the core demonstration is in all three programming languages: Mathematica, Python, and R.
This document/notebook is inspired by the Mathematica Stack Exchange (MSE) question “Plotting the Star of Bethlehem”, [MSE1]. That MSE question requests efficient and fast plotting of a certain mathematical function that (maybe) looks like the Star of Bethlehem, [Wk1]. Instead of doing what the author of the questions suggests, I decided to use a generative art program and workflows from three of most important Machine Learning (ML) sub-cultures: Latent Semantic Analysis, Recommendations, and Classification.
Although we discuss making of Bethlehem Star-like images, the ML workflows and corresponding code presented in this document/notebook have general applicability – in many situations we have to make classifiers based on data that has to be “feature engineered” through pipeline of several types of ML transformative workflows and that feature engineering requires multiple iterations of re-examinations and tuning in order to achieve the set goals.
The document/notebook is structured as follows:
Target Bethlehem Star images
Simplistic approach
Elaborated approach outline
Sections that follow through elaborated approach outline:
Remark: The plot above looks prettier in notebook converted with the resource function DarkMode.
Elaborated approach
Assume that we want to automate the simplistic approach described in the previous section.
One way to automate is to create a Machine Learning (ML) classifier that is capable of discerning which RandomMandala objects look like Bethlehem Star target images and which do not. With such a classifier we can write a function BethlehemMandala that applies the classifier on multiple results from RandomMandala and returns those mandalas that the classifier says are good.
Here are the steps of building the proposed classifier:
Generate a large enough Random Mandala Images Set (RMIS)
Create a feature extractor from a subset of RMIS
Assign features to all of RMIS
Make a recommender with the RMIS features and other image data (like pixel values)
Apply the RMIS recommender over the target Bethlehem Star images and determine and examine image sets that are:
the best recommendations
the worse recommendations
With the best and worse recommendations sets compose training data for classifier making
Train a classifier
Examine classifier application to (filtering of) random mandala images (both in RMIS and not in RMIS)
If the results are not satisfactory redo some or all of the steps above
Remark: If the results are not satisfactory we should consider using the obtained classifier at the data generation phase. (This is not done in this document/notebook.)
Remark: The elaborated approach outline and flow chart have general applicability, not just for generation of random images of a certain type.
Flow chart
Here is a flow chart that corresponds to the outline above:
A few observations for the flow chart follow:
The flow chart has a feature extraction block that shows that the feature extraction can be done in several ways.
The application of LSA is a type of feature extraction which this document/notebook uses.
If the results are not good enough the flow chart shows that the classifier can be used at the data generation phase.
If the results are not good enough there are several alternatives to redo or tune the ML algorithms.
Changing or tuning the recommender implies training a new classifier.
Changing or tuning the feature extraction implies making a new recommender and a new classifier.
Data generation and preparation
In this section we generate random mandala graphics, transform them into images and corresponding vectors. Those image-vectors can be used to apply dimension reduction algorithms. (Other feature extraction algorithms can be applied over the images.)
Remark: Note the weights assigned to the pixels and the topics in the recommender object above. Those weights were derived by examining the recommendations results shown below.
Here is the image we want to find most similar mandala images to – the target image:
Remark: Note that although a higher rotational symmetry order is used the highly scored results still seem relevant – they have the features of the target Bethlehem Star images.
In this document we describe the design and implementation of a (software programming) monad, [Wk1], for Latent Semantic Analysis workflows specification and execution. The design and implementation are done with Mathematica / Wolfram Language (WL).
What is Latent Semantic Analysis (LSA)? : A statistical method (or a technique) for finding relationships in natural language texts that is based on the so called Distributional hypothesis, [Wk2, Wk3]. (The Distributional hypothesis can be simply stated as “linguistic items with similar distributions have similar meanings”; for an insightful philosophical and scientific discussion see [MS1].) LSA can be seen as the application of Dimensionality reduction techniques over matrices derived with the Vector space model.
The goal of the monad design is to make the specification of LSA workflows (relatively) easy and straightforward by following a certain main scenario and specifying variations over that scenario.
The data for this document is obtained from WL’s repository and it is manipulated into a certain ready-to-utilize form (and uploaded to GitHub.)
The monadic programming design is used as a Software Design Pattern. The LSAMon monad can be also seen as a Domain Specific Language (DSL) for the specification and programming of machine learning classification workflows.
Here is an example of using the LSAMon monad over a collection of documents that consists of 233 US state of union speeches.
The table above is produced with the package “MonadicTracing.m”, [AAp2, AA1], and some of the explanations below also utilize that package.
As it was mentioned above the monad LSAMon can be seen as a DSL. Because of this the monad pipelines made with LSAMon are sometimes called “specifications”.
Remark: In this document with “term” we mean “a word, a word stem, or other type of token.”
Remark: LSA and Latent Semantic Indexing (LSI) are considered more or less to be synonyms. I think that “latent semantic analysis” sounds more universal and that “latent semantic indexing” as a name refers to a specific Information Retrieval technique. Below we refer to “LSI functions” like “IDF” and “TF-IDF” that are applied within the generic LSA workflow.
Contents description
The document has the following structure.
The sections “Package load” and “Data load” obtain the needed code and data.
The sections “Design consideration” and “Monad design” provide motivation and design decisions rationale.
The sections “LSAMon overview”, “Monad elements”, and “The utilization of SSparseMatrix objects” provide technical descriptions needed to utilize the LSAMon monad .
(Using a fair amount of examples.)
The section “Unit tests” describes the tests used in the development of the LSAMon monad.
(The random pipelines unit tests are especially interesting.)
The section “Future plans” outlines future directions of development.
The section “Implementation notes” just says that LSAMon’s development process and this document follow the ones of the classifications workflows monad ClCon, [AA6].
Remark: One can read only the sections “Introduction”, “Design consideration”, “Monad design”, and “LSAMon overview”. That set of sections provide a fairly good, programming language agnostic exposition of the substance and novel ideas of this document.
Package load
The following commands load the packages [AAp1–AAp7]:
In this section we load data that is used in the rest of the document. The text data was obtained through WL’s repository, transformed in a certain more convenient form, and uploaded to GitHub.
The text summarization and plots are done through LSAMon, which in turn uses the function RecordsSummary from the package “MathematicaForPredictionUtilities.m”, [AAp7].
In some of the examples below we want to explicitly specify the stop words. Here are stop words derived using the built-in functions DictionaryLookup and DeleteStopwords.
Here is a quote from [Wk1] that fairly well describes why we choose to make a classification workflow monad and hints on the desired properties of such a monad.
[…] The monad represents computations with a sequential structure: a monad defines what it means to chain operations together. This enables the programmer to build pipelines that process data in a series of steps (i.e. a series of actions applied to the data), in which each action is decorated with the additional processing rules provided by the monad. […]
Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. […]
Remark: Note that quote from [Wk1] refers to chained monadic operations as “pipelines”. We use the terms “monad pipeline” and “pipeline” below.
Monad design
The monad we consider is designed to speed-up the programming of LSA workflows outlined in the previous section. The monad is named LSAMon for “Latent Semantic Analysis** Mon**ad”.
We want to be able to construct monad pipelines of the general form:
LSAMon-Monad-Design-formula-1
LSAMon is based on the State monad, [Wk1, AA1], so the monad pipeline form (1) has the following more specific form:
LSAMon-Monad-Design-formula-2
This means that some monad operations will not just change the pipeline value but they will also change the pipeline context.
In the monad pipelines of LSAMon we store different objects in the contexts for at least one of the following two reasons.
The object will be needed later on in the pipeline, or
The object is (relatively) hard to compute.
Such objects are document-term matrix, Dimensionality reduction factors and the related topics.
Let us list the desired properties of the monad.
Rapid specification of non-trivial LSA workflows.
The text data supplied to the monad can be: (i) a list of strings, or (ii) an association with string values.
The monad uses the Linear vector space model .
The document-term frequency matrix can be created after removing stop words and/or word stemming.
It is easy to specify and apply different LSI weight functions. (Like “IDF” or “GFIDF”.)
The monad can do dimension reduction with SVD and NNMF and corresponding matrix factors are retrievable with monad functions.
Documents (or query strings) external to the monad are easily mapped into monad’s Linear vector space of terms and Linear vector space of topics.
The monad allows of cursory examination and summarization of the data.
The pipeline values can be of different types. (Most monad functions modify the pipeline value; some modify the context; some just echo results.)
It is easy to obtain the pipeline value, context, and different context objects for manipulation outside of the monad.
It is easy to tabulate extracted topics and related statistical thesauri.
The LSAMon components and their interactions are fairly simple.
The main LSAMon operations implicitly put in the context or utilize from the context the following objects:
document-term matrix,
the factors obtained by matrix factorization algorithms,
LSI weight functions specifications,
extracted topics.
Note the that the monadic set of types of LSAMon pipeline values is fairly heterogenous and certain awareness of “the current pipeline value” is assumed when composing LSAMon pipelines.
Obviously, we can put in the context any object through the generic operations of the State monad of the package “StateMonadGenerator.m”, [AAp1].
LSAMon overview
When using a monad we lift certain data into the “monad space”, using monad’s operations we navigate computations in that space, and at some point we take results from it.
With the approach taken in this document the “lifting” into the LSAMon monad is done with the function LSAMonUnit. Results from the monad can be obtained with the functions LSAMonTakeValue, LSAMonContext, or with the other LSAMon functions with the prefix “LSAMonTake” (see below.)
Here is a corresponding diagram of a generic computation with the LSAMon monad:
LSAMon-pipeline
Remark: It is a good idea to compare the diagram with formulas (1) and (2).
Let us examine a concrete LSAMon pipeline that corresponds to the diagram above. In the following table each pipeline operation is combined together with a short explanation and the context keys after its execution.
Here is the output of the pipeline:
The LSAMon functions are separated into four groups:
operations,
setters and droppers,
takers,
State Monad generic functions.
Monad functions interaction with the pipeline value and context
An overview of the those functions is given in the tables in next two sub-sections. The next section, “Monad elements”, gives details and examples for the usage of the LSAMon operations.
In this section we show that LSAMon has all of the properties listed in the previous section.
The monad head
The monad head is LSAMon. Anything wrapped in LSAMon can serve as monad’s pipeline value. It is better though to use the constructor LSAMonUnit. (Which adheres to the definition in [Wk1].)
The fundamental model of LSAMon is the so called Vector space model (or the closely related Bag-of-words model.) The document-term matrix is a linear vector space representation of the documents collection. That representation is further used in LSAMon to find topics and statistical thesauri.
Here is an example of ad hoc construction of a document-term matrix using a couple of paragraphs from “Hamlet”.
When we construct the document-term matrix we (often) want to stem the words and (almost always) want to remove stop words. LSAMon’s function LSAMonMakeDocumentTermMatrix makes the document-term matrix and takes specifications for stemming and stop words.
After making the document-term matrix we will most likely apply LSI weight functions, [Wk2], like “GFIDF” and “TF-IDF”. (This follows the “standard” approach used in search engines for calculating weights for document-term matrices; see [MB1].)
Frequency matrix
We use the following definition of the frequency document-term matrix F.
Each entry fij of the matrix F is the number of occurrences of the term j in the document i.
Weights
Each entry of the weighted document-term matrix M derived from the frequency document-term matrix F is expressed with the formula
where gj – global term weight; lij – local term weight; di – normalization weight.
Various formulas exist for these weights and one of the challenges is to find the right combination of them when using different document collections.
Here is a table of weight functions formulas.
LSAMon-LSI-weight-functions-table
Computation specifications
LSAMon function LSAMonApplyTermWeightFunctions delegates the LSI weight functions application to the package “DocumentTermMatrixConstruction.m”, [AAp4].
Here we are summaries of the non-zero values of the weighted document-term matrix derived with different combinations of global, local, and normalization weight functions.
Streamlining topic extraction is one of the main reasons LSAMon was implemented. The topic extraction correspond to the so called “syntagmatic” relationships between the terms, [MS1].
Theoretical outline
The original weighed document-term matrix M is decomposed into the matrix factors W and H.
M ≈ W.H, W ∈ ℝm × k, H ∈ ℝk × n.
The i-th row of M is expressed with the i-th row of W multiplied by H.
The rows of H are the topics. SVD produces orthogonal topics; NNMF does not.
The i-the document of the collection corresponds to the i-th row W. Finding the Nearest Neighbors (NN’s) of the i-th document using the rows similarity of the matrix W gives document NN’s through topic similarity.
The terms correspond to the columns of H. Finding NN’s based on similarities of H’s columns produces statistical thesaurus entries.
The term groups provided by H’s rows correspond to “syntagmatic” relationships. Using similarities of H’s columns we can produce term clusters that correspond to “paradigmatic” relationships.
Computation specifications
Here is an example using the play “Hamlet” in which we specify additional stop words.
One of the most natural operations is to find the representation of an arbitrary document (or sentence or a list of words) in monad’s Linear vector space of terms. This is done with the function LSAMonRepresentByTerms.
Here is an example in which a sentence is represented as a one-row matrix (in that space.)
obj =
lsaHamlet⟹
LSAMonRepresentByTerms["Hamlet, Prince of Denmark killed the king."]⟹
LSAMonEchoValue;
Here we display only the non-zero columns of that matrix.
obj⟹
LSAMonEchoFunctionValue[MatrixForm[Part[#, All, Keys[Select[SSparseMatrix`ColumnSumsAssociation[#], # > 0& ]]]]& ];
Transformation steps
Assume that LSAMonRepresentByTerms is given a list of sentences. Then that function performs the following steps.
1. The sentence is split into a list of words.
2. If monad’s document-term matrix was made by removing stop words the same stop words are removed from the list of words.
3. If monad’s document-term matrix was made by stemming the same stemming rules are applied to the list of words.
4. The LSI global weights and the LSI local weight and normalizer functions are applied to sentence’s contingency matrix.
Equivalent representation
Let us look convince ourselves that documents used in the monad to built the weighted document-term matrix have the same representation as the corresponding rows of that matrix.
Here is an association of documents from monad’s document collection.
inds = {6, 10};
queries = Part[lsaHamlet⟹LSAMonTakeDocuments, inds];
queries
(* <|"id.0006" -> "Getrude, Queen of Denmark, mother to Hamlet. Ophelia, daughter to Polonius.",
"id.0010" -> "ACT I. Scene I. Elsinore. A platform before the Castle."|> *)
lsaHamlet⟹
LSAMonRepresentByTerms[queries]⟹
LSAMonEchoFunctionValue[MatrixForm[Part[#, All, Keys[Select[SSparseMatrix`ColumnSumsAssociation[#], # > 0& ]]]]& ];
Another natural operation is to find the representation of an arbitrary document (or a list of words) in monad’s Linear vector space of topics. This is done with the function LSAMonRepresentByTopics.
Here is an example.
inds = {6, 10};
queries = Part[lsaHamlet⟹LSAMonTakeDocuments, inds];
Short /@ queries
(* <|"id.0006" -> "Getrude, Queen of Denmark, mother to Hamlet. Ophelia, daughter to Polonius.",
"id.0010" -> "ACT I. Scene I. Elsinore. A platform before the Castle."|> *)
lsaHamlet⟹
LSAMonRepresentByTopics[queries]⟹
LSAMonEchoFunctionValue[MatrixForm[Part[#, All, Keys[Select[SSparseMatrix`ColumnSumsAssociation[#], # > 0& ]]]]& ];
In order to clarify what the function LSAMonRepresentByTopics is doing let us go through the formulas it is based on.
The original weighed document-term matrix M is decomposed into the matrix factors W and H.
M ≈ W.H, W ∈ ℝm × k, H ∈ ℝk × n
The i-th row of M is expressed with the i-th row of W multiplied by H.
mi ≈ wi.H.
For a query vector q0 ∈ ℝm we want to find its topics representation vector x ∈ ℝk:
q0 ≈ x.H.
Denote with H( − 1) the inverse or pseudo-inverse matrix of H. We have:
q0.H( − 1) ≈ (x.H).H( − 1) = x.(H.H( − 1)) = x.I,
x ∈ ℝk, H( − 1) ∈ ℝn × k, I ∈ ℝk × k.
In LSAMon for SVD H( − 1) = HT; for NNMF H( − 1) is the pseudo-inverse of H.
The vector x obtained with LSAMonRepresentByTopics.
Tags representation
Sometimes we want to find the topics representation of tags associated with monad’s documents and the tag-document associations are one-to-many. See [AA3].
Let us consider a concrete example – we want to find what topics correspond to the different presidents in the collection of State of Union speeches.
Here we find the document tags (president names in this case.)
There are several algorithms we can apply for finding the most important documents in the collection. LSAMon utilizes two types algorithms: (1) graph centrality measures based, and (2) matrix factorization based. With certain graph centrality measures the two algorithms are equivalent. In this sub-section we demonstrate the matrix factorization algorithm (that uses SVD.)
Definition: The most important sentences have the most important words and the most important words are in the most important sentences.
That definition can be used to derive an iterations-based model that can be expressed with SVD or eigenvector finding algorithms, [LE1].
Here we pick an important part of the play “Hamlet”.
focusText =
First@Pick[textHamlet, StringMatchQ[textHamlet, ___ ~~ "to be" ~~ __ ~~ "or not to be" ~~ ___, IgnoreCase -> True]];
Short[focusText]
(* "Ham. To be, or not to be- that is the question: Whether 'tis ....y.
O, woe is me T' have seen what I have seen, see what I see!" *)
LSAMonUnit[StringSplit[ToLowerCase[focusText], {",", ".", ";", "!", "?"}]]⟹
LSAMonMakeDocumentTermMatrix["StemmingRules" -> {}, "StopWords" -> Automatic]⟹
LSAMonApplyTermWeightFunctions⟹
LSAMonFindMostImportantDocuments[3]⟹
LSAMonEchoFunctionValue[GridTableForm];
LSAMon-Find-most-important-documents-table
Setters, droppers, and takers
The values from the monad context can be set, obtained, or dropped with the corresponding “setter”, “dropper”, and “taker” functions as summarized in a previous section.
For example:
p = LSAMonUnit[textHamlet]⟹LSAMonMakeDocumentTermMatrix[Automatic, Automatic];
p⟹LSAMonTakeMatrix
If other values are put in the context they can be obtained through the (generic) function LSAMonTakeContext, [AAp1]:
Short@(p⟹QRMonTakeContext)["documents"]
(* <|"id.0001" -> "1604", "id.0002" -> "THE TRAGEDY OF HAMLET, PRINCE OF DENMARK", <<220>>, "id.0223" -> "THE END"|> *)
Another generic function from [AAp1] is LSAMonTakeValue (used many times above.)
Here is an example of the “data dropper” LSAMonDropDocuments:
(The “droppers” simply use the state monad function LSAMonDropFromContext, [AAp1]. For example, LSAMonDropDocuments is equivalent to LSAMonDropFromContext[“documents”].)
The utilization of SSparseMatrix objects
The LSAMon monad heavily relies on SSparseMatrix objects, [AAp6, AA5], for internal representation of data and computation results.
A SSparseMatrix object is a matrix with named rows and columns.
In some cases we want to show only columns of the data or computation results matrices that have non-zero elements.
Here is an example (similar to other examples in the previous section.)
lsaHamlet⟹
LSAMonRepresentByTerms[{"this country is rotten",
"where is my sword my lord",
"poison in the ear should be in the play"}]⟹
LSAMonEchoFunctionValue[ MatrixForm[#1[[All, Keys[Select[ColumnSumsAssociation[#1], #1 > 0 &]]]]] &];
In the pipeline code above: (i) from the list of queries a representation matrix is made, (ii) that matrix is assigned to the pipeline value, (iii) in the pipeline echo value function the non-zero columns are selected with by using the keys of the non-zero elements of the association obtained with ColumnSumsAssociation.
Similarities based on representation by terms
Here is way to compute the similarity matrix of different sets of documents that are not required to be in monad’s document collection.
Similarly to weighted Boolean similarities matrix computation above we can compute a similarity matrix using the topics representations. Note that an additional normalization steps is required.
Note the differences with the weighted Boolean similarity matrix in the previous sub-section – the similarities that are less than 1 are noticeably larger.
Unit tests
The development of LSAMon was done with two types of unit tests: (i) directly specified tests, [AAp7], and (ii) tests based on randomly generated pipelines, [AA8].
The unit test package should be further extended in order to provide better coverage of the functionalities and illustrate – and postulate – pipeline behavior.
Since the monad LSAMon is a DSL it is natural to test it with a large number of randomly generated “sentences” of that DSL. For the LSAMon DSL the sentences are LSAMon pipelines. The package “MonadicLatentSemanticAnalysisRandomPipelinesUnitTests.m”, [AAp9], has functions for generation of LSAMon random pipelines and running them as verification tests. A short example follows.
AbsoluteTiming[
res = TestRunLSAMonPipelines[pipelines, "Echo" -> False];
]
From the test report results we see that a dozen tests failed with messages, all of the rest passed.
rpTRObj = TestReport[res]
(The message failures, of course, have to be examined – some bugs were found in that way. Currently the actual test messages are expected.)
Future plans
Dimension reduction extensions
It would be nice to extend the Dimension reduction functionalities of LSAMon to include other algorithms like Independent Component Analysis (ICA), [Wk5]. Ideally with LSAMon we can do comparisons between SVD, NNMF, and ICA like the image de-nosing based comparison explained in [AA8].
Another direction is to utilize Neural Networks for the topic extraction and making of statistical thesauri.
Conversational agent
Since LSAMon is a DSL it can be relatively easily interfaced with a natural language interface.
Here is an example of natural language commands parsed into LSA code using the package [AAp13].
The implementation methodology of the LSAMon monad packages [AAp3, AAp9] followed the methodology created for the ClCon monad package [AAp10, AA6]. Similarly, this document closely follows the structure and exposition of the `ClCon monad document “A monad for classification workflows”, [AA6].
A lot of the functionalities and signatures of LSAMon were designed and programed through considerations of natural language commands specifications given to a specialized conversational agent.
This document discusses concrete algorithms for two different approaches of generation of mandala images, [1]: direct construction with graphics primitives, and use of machine learning algorithms.
to show some pretty images exploiting symmetry and multiplicity (see this album),
to provide an illustrative example of comparing dimension reduction methods,
to give a set-up for further discussions and investigations on mandala creation with machine learning algorithms.
Two direct construction algorithms are given: one uses "seed" segment rotations, the other superimposing of layers of different types. The following plots show the order in which different mandala parts are created with each of the algorithms.
In this document we use several algorithms for dimension reduction applied to collections of images following the procedure described in [4,5]. We are going to show that with Non-Negative Matrix Factorization (NNMF) we can use mandalas made with the seed segment rotation algorithm to extract layer types and superimpose them to make colored mandalas. Using the same approach with Singular Value Decomposition (SVD) or Independent Component Analysis (ICA) does not produce good layers and the superimposition produces more "watered-down", less diverse mandalas.
From a more general perspective this document compares the statistical approach of "trying to see without looking" with the "direct simulation" approach. Another perspective is the creation of "design spaces"; see [6].
The idea of using machine learning algorithms is appealing because there is no need to make the mental effort of understanding, discerning, approximating, and programming the principles of mandala creation. We can "just" use a large collection of mandala images and generate new ones using the "internal knowledge" data of machine learning algorithms. For example, a Neural network system like Deep Dream, [2], might be made to dream of mandalas.
Direct algorithms for mandala generation
In this section we present two different algorithms for generating mandalas. The first sees a mandala as being generated by rotation of a "seed" segment. The second sees a mandala as being generated by different component layers. For other approaches see [3].
The request of [3] is for generation of mandalas for coloring by hand. That is why the mandala generation algorithms are in the grayscale space. Coloring the generated mandala images is a secondary task.
By seed segment rotations
One way to come up with mandalas is to generate a segment and then by appropriate number of rotations to produce a mandala.
Here is a function and an example of random segment (seed) generation:
Here is a more concise way to generate symmetric segment mandalas:
Multicolumn[Table[Image@MakeMandala[], {12}], 5]
Note that with this approach the programming of the mandala coloring is not that trivial — weighted blending of colorized mandalas is the easiest thing to do. (Shown below.)
"For this one I’ve defined three types of layer, a flower, a simple circle and a ring of small circles. You could add more for greater variety."
The coloring approach with image blending given below did not work well for this algorithm, so I modified the original code in order to produce colored mandalas.
The most interesting results are obtained with the image blending procedure coded below over mandala images generated with the seed segment rotation algorithm.
In this section we are going to apply the dimension reduction algorithms Singular Value Decomposition (SVD), Independent Component Analysis (ICA), and Non-Negative Matrix Factorization (NNMF) to a linear vector space representation (a matrix) of an image dataset. In the next section we use the bases generated by those algorithms to make mandala images.
We are going to use the packages [7,8] for ICA and NNMF respectively.
The linear vector space representation of the images is simple — each image is flattened to a vector (row-wise), and the image vectors are put into a matrix.
The SVD basis has an average mandala image as its first vector and the other vectors are "differences" to be added to that first vector.
The SVD and ICA bases are structured similarly. That is because ICA and SVD are both based on orthogonality — ICA factorization uses an orthogonality criteria based on Gaussian noise properties (which is more relaxed than SVD’s standard orthogonality criteria.)
As expected, the NNMF basis images have black background because of the enforced non-negativity. (Black corresponds to 0, white to 1.)
Compared to the SVD and ICA bases the images of the NNMF basis are structured in a radial manner. This can be demonstrated using image binarization.
We can see that binarizing of the NNMF basis images shows them as mandala layers. In other words, using NNMF we can convert the mandalas of the seed segment rotation algorithm into mandalas generated by an algorithm that superimposes layers of different types.
Blending with image bases samples
In this section we just show different blending images using the SVD, ICA, and NNMF bases.
What would be the outcomes of the above procedures to mandala images found in the World Wide Web (WWW) ?
Those WWW images are most likely man made or curated.
The short answer is that the results are not that good. Better results might be obtained using a larger set of WWW images (than just 100 in the experiment results shown below.)
Here is a sample from the WWW mandala images:
Here are the results obtained with NNMF basis:
Future plans
My other motivation for writing this document is to set up a basis for further investigations and discussions on the following topics.
Having a large image database of "real world", human made mandalas.
Utilization of Neural Network algorithms to mandala creation.
Utilization of Cellular Automata to mandala generation.
Investigate mandala morphing and animations.
Making a domain specific language of specifications for mandala creation and modification.
The idea of using machine learning algorithms for mandala image generation was further supported by an image classifier that recognizes fairly well (suitably normalized) mandala images obtained in different ways:
Here are the bases built with two different classifiers:
Singular Value Decomposition (SVD)
Non-Negative Matrix Factorization (NNMF)
Here are the confusion matrices of the two classifiers:
SVD
NNMF
The blog post "Classification of handwritten digits" (published 2013) has a related more elaborated discussion over a much smaller database of handwritten digits.
Concrete steps
The concrete steps taken in scripts and documents of this project follow.
Ingest the binary data files into arrays that can be visualized as digit images.
We have two sets: 60,000 training images and 10,000 testing images.
Make a linear vector space representation of the images by simple unfolding.
For each digit find the corresponding representation matrix and factorize it.
Store the matrix factorization results in a suitable data structure. (These results comprise the classifier training.)
One of the matrix factors is seen as a new basis.
For a given test image (and its linear vector space representation) find the basis that approximates it best. The corresponding digit is the classifier prediction for the given test image.
Evaluate the classifier(s) over all test images and compute accuracy, F-Scores, and other measures.
Scripts
There are scripts going through the steps listed above:
I figured out first in R how to ingest the data in the binary files of the MNIST database. There were at least several online resources (blog posts, GitHub repositories) that discuss the MNIST binary files ingestion.
After that making the corresponding code in Mathematica was easy.
Classification results
Same in Mathematica and R for for SVD and NNMF. (As expected.)
Both Mathematica and R have relatively simple set-up of parallel computations.
Graphics
It was not very straightforward to come up in R with visualizations for MNIST images. The Mathematica visualization is much more flexible when it comes to plot labeling.
Going further
Comparison with other classifiers
Using Mathematica’s built-in classifiers it was easy to compare the SVD and NNMF classifiers with neural network ones and others. (The SVD and NNMF are much faster to built and they bring comparable precision.)
It would be nice to repeat that in R using one or several of the neural network classifiers provided by Google, Microsoft, H2O, Baidu, etc.
In a previous blog post, [1], I compared Principal Component Analysis (PCA) / Singular Value Decomposition (SVD) and Non-Negative Matrix Factorization (NNMF) over a collection of noised images of digit handwriting from the MNIST data set, [3], which is available in Mathematica.
This blog post adds to that comparison the use of Independent Component Analysis (ICA) proclaimed in my previous blog post, [1].
Computations
The ICA related additional computations to those in [1] follow.
PCA/SVD produces somewhat good results, but NNMF often provides great results. The factors of NNMF allow interpretation of the basis vectors of the dimension reduction, PCA in general does not. This can be seen in the example below.
Data
First let us get some images. I am going to use the MNIST dataset for clarity. (And because I experimented with similar data some time ago — see Classification of handwritten digits.).
The blog post “Statistical thesaurus from NPR podcasts” discusses an example application of NNMF and has links to documents explaining the theory behind the NNMF utilization.
The rows of H are interpreted as new basis vectors and the rows of W are the coordinates of the images in that new basis. Some appropriate normalization was also done for that interpretation. Note that we are using the non-normalized image matrix.
Let us see the norms of $H$ and mark the top outliers:
norms = Norm /@ H;
ListPlot[norms, PlotRange -> All, PlotLabel -> "Norms of H rows",
PlotTheme -> "Detailed"] //
ColorPlotOutliers[TopOutliers@*HampelIdentifierParameters]
OutlierPosition[norms, TopOutliers@*HampelIdentifierParameters]
OutlierPosition[norms, TopOutliers@*SPLUSQuartileIdentifierParameters]
Here is the interpretation of the new basis vectors (the outliers are marked in red):
Often we cannot just rely on outlier detection and have to hand pick the basis for reconstruction. (Especially when we have more than one classes of signals.)
Comparison
At this point we can plot all images together for comparison:
Usually with NNMF in order to get good results we have to do more that one iteration of the whole factorization and reconstruction process. And of course NNMF is much slower. Nevertheless, we can see clear advantages of NNMF’s interpretability and leveraging it.
Gallery with other experiments
In those experiments I had to hand pick the NNMF basis used for the reconstruction. Using outlier detection without supervision would not produce good results most of the time.
Further comparison with Classify
We can further compare the de-noising results by building signal (digit) classifiers and running them over the de-noised images.
For such a classifier we have to decide:
do we train only with images of the two signal classes or over a larger set of signals;
how many signals we train with;
with what method the classifiers are built.
Below I use the default method of Classify with all digit images in MNIST that are not in the noised images set. The classifier is run with the de-noising traced these 0-4 images:
Five months ago I worked with transcripts of National Public Radio (NPR) podcasts. The transcripts are available at http://www.npr.org — see for example “From child actor to artist…“.
Using nearly 5000 transcripts I experimented with topic extraction and statistical thesaurus derivation. The topics are too bulky to show here, but I am going to show some of the statistical thesaurus entries.
First let me describe the data. The collection has 5123 transcripts.
Here is a sample of the transcripts (only the first 400 characters of each are taken):
Here is the distribution of the string lengths of the transcripts:
I removed custom selected stop words from the transcripts. I also stemmed the words using the stemmer called snowball, see http://snowball.tartarus.org. The stemmed words are called “terms” below.
Here are descriptive statistics and the distribution of the number of transcripts per term:
Here are descriptive statistics and the distribution of the number of terms per transcript:
I did not compute the whole statistical thesaurus. Instead I made a function that computes the thesaurus entry of a given word using the right NNMF factor with proper normalization.
Here are sample results of the thesaurus entry “retrieval” (note that the right column contains word stems):
In this blog post I show some experiments with algorithmic recognition of images of handwritten digits.
I followed the algorithm described in Chapter 10 of the book “Matrix Methods in Data Mining and Pattern Recognition” by Lars Elden.
The algorithm described uses the so called thin Singular Value Decomposition (SVD).
Training phase
1.1. Rasterize each training image into an array of 16 x 16 pixels.
1.2. Each raster image is linearized — the rows are aligned into a one dimensional array. In other words, each raster image is mapped into a R^256 vector space. We will call these one dimensional arrays raster vectors.
1.3. From each set of images corresponding to a digit make a matrix with 256 columns of the corresponding raster vectors.
1.4. Using the matrices in step 1.3 use thin SVD to derive orthogonal bases that describe the image data for each digit.
Recognition phase
2.1. Given an image of an unknown digit derive its raster vector, R.
2.2. Find the residuals of the approximations of R with each of the bases found in 1.4.
2.3. The digit with the minimal residual is the recognition result.
The algorithm is programmed very easily with Mathematica. I did some experiments using training and test digit drawings made with the iPad app Zen Brush. I applied both the SVD recognition algorithm described above and I also applied decision trees in the same way as described in the previous blog post.
Here is a table of the training images:
And here is table of the test images:
Note that the third row is with images drawn with a thinner brush, and the fourth row is with images drawn with a thicker brush.
Here are raster images of the top row of the test drawings:
Here are several plots showing raster vectors:
As I mentioned earlier, raster vectors are very similar to the wave samples described in the previous blog post, so we can apply decision trees to them.
The SVD algorithm misclassified only 3 images out of 36 test digit images, success ratio 92%. Here is a table with digit drawings and list plots of the residuals:
It is interesting to look at the residuals obtained for different recognition instances. For example, the plot on the first row and first column for the recognition of a drawing of “2” shows that the residual corresponding to 2 is the smallest and the residual for 8 is the next smallest one. The residual for 2 is the clear outlier. On the second row and third column we can see that a drawing of “4” has been classified correctly as 4, but the residual for 9 is very close to the residual for 4, we almost had a misclassification. We can see that for the other three test images with “4” the residuals for 4 are clearly separated from the rest, which can be explained with “4” being drawn more slanted, and its angle being more pronounced. Examining the misclassifications in similar way explains why they occurred.
Here are the misclassified images:
Note the misclassified image of 7 is quite different from the training images for 7.
The decision tree misclassified 42% of the images and here is are table of them:
Note that the decision trees would probably perform better if larger training data is used, not just nine drawings per digit. I also experimented with building the classifiers over the “negative” images and aligning the columns of the raster images instead of aligning the rows. The classification results were not better.
Some details about the image preprocessing follow.
As I said, I drew the images using the Zen Brush app. For each digit I drew nine instances on Zen Brush’ canvas and exported to an image — here is an example:
Then I used Mathematica‘s function ImagePartition to partition the image into 9 singe digit drawings, and then applied ImageCrop to all them. Of course the same procedure is done for the testing images.