Tries with frequencies for data mining

In computer science a trie (or a prefix tree) is a data structure that allows quick retrieval of associations between a set of keys and a set of values. It is assumed that each key is a sequence of elements. (Most often the keys are considered to be strings i.e. sequences of characters.) The trie nodes do not store the keys. Instead the position of a node in a trie corresponds to a key.

For example, if we have the list of words

{“war”, “ward”, “warden”, “work”, “car”, “cars”, “carbs”}

we can make the trie:

We can further shrink the trie in order to find the internal “prefixes”:
Shrunk trie

Another transformation of a trie is to replace the numbers of appearances with probabilities — each value of a node is the probability to arrive to that node from the parent:

Shrunk trie with probabilities

Using this last trie we can take a Bayesian perspective and ask questions like: Given that a word starts with “ca” what is the probability the word to end with “s” or “r”? (Within the set of words we built the trie with.)

This document has an introduction to tries with frequencies and it is also a guide to using the package described below. The document has lots of examples and illustrative graph plots; it also discusses how to answer the Bayesian questions.

Several years ago I started experimenting with applying tries to data mining problems. During the last few months I productized my Mathematica trie implementations into a package — see the package TriesWithFrequencies.m provided by the MathematicaForPrediction project at GitHub. The package is for creation, manipulation, and visualization of tries in which the key values are frequencies of appearance in the data that is data mined. The package can also be used to build and use tries as associative arrays.

Using tries for data mining comes from the general idea of utilizing NLP techniques and algorithms in non-language problems. Such an utilization is possible because, philosophically speaking, in the current Big Data + Facebook phase of our civilization every entity can be, and it is, characterzied with tags. If we have a collection of events and they are tagged properly with a good system of tags we can data mine sequences of events with prefix trees.


4 thoughts on “Tries with frequencies for data mining

  1. Pingback: Mosaic plots for data visualization | Mathematica for prediction algorithms

  2. Pingback: Removal of sub-trees in tries | Mathematica for prediction algorithms

  3. Pingback: Mathematica vs. R at GitHub | Mathematica for prediction algorithms

  4. Pingback: Tries with frequencies in Java | Mathematica for prediction algorithms

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s