# Tries with frequencies for data mining

In computer science a trie (or a prefix tree) is a data structure that allows quick retrieval of associations between a set of keys and a set of values. It is assumed that each key is a sequence of elements. (Most often the keys are considered to be strings i.e. sequences of characters.) The trie nodes do not store the keys. Instead the position of a node in a trie corresponds to a key.

For example, if we have the list of words

{“war”, “ward”, “warden”, “work”, “car”, “cars”, “carbs”}

we can make the trie:

We can further shrink the trie in order to find the internal “prefixes”:

Another transformation of a trie is to replace the numbers of appearances with probabilities — each value of a node is the probability to arrive to that node from the parent:

Using this last trie we can take a Bayesian perspective and ask questions like: Given that a word starts with “ca” what is the probability the word to end with “s” or “r”? (Within the set of words we built the trie with.)

This document has an introduction to tries with frequencies and it is also a guide to using the package described below. The document has lots of examples and illustrative graph plots; it also discusses how to answer the Bayesian questions.

Several years ago I started experimenting with applying tries to data mining problems. During the last few months I productized my Mathematica trie implementations into a package — see the package TriesWithFrequencies.m provided by the MathematicaForPrediction project at GitHub. The package is for creation, manipulation, and visualization of tries in which the key values are frequencies of appearance in the data that is data mined. The package can also be used to build and use tries as associative arrays.

Using tries for data mining comes from the general idea of utilizing NLP techniques and algorithms in non-language problems. Such an utilization is possible because, philosophically speaking, in the current Big Data + Facebook phase of our civilization every entity can be, and it is, characterzied with tags. If we have a collection of events and they are tagged properly with a good system of tags we can data mine sequences of events with prefix trees.

## 7 thoughts on “Tries with frequencies for data mining”

This site uses Akismet to reduce spam. Learn how your comment data is processed.