# Simplifying trees via grammar induction

# Background

You have some important data and you’re eager to share it with thinkers and decision-makers. Like a lot of important data, your data is hierarchical, structured like a tree — maybe it’s nested investment portfolios, or supply chain information.

It would be nice if this data could be easily presented on a slide, but the tree is often too big and awkward, with too many nodes; few people can get much usable information from looking at a tree diagram that might have hundreds of tiny nodes. So, you need to simplify that tree data in order to produce a good visualization.

This is a very common scenario.

You could create a hand-made infographic, of course, but that introduces bias and opinion and it doesn’t scale. The challenge, then, is to algorithmically reduce the complexity of the tree, such that it can be put on a slide and understood quickly, while retaining as much as possible of its structure.

This is a difficult and open-ended data visualization problem, one that resides on the very boundary of art and science, which is always an interesting place to work. Happily there are several very effective algorithmic approaches for us to consider. Something I find especially interesting is that grammar induction algorithms, usually associated with data compression, can be extremely useful in this area of data visualization.

In this article, I’ll:

- Present a suitable example dataset
- Explore relative straightforward tree simplification algorithms
- Introduce grammar induction as a family of algorithms
- Show how applying grammar induction to the example data can produce a useful result.

As ever, code is available on request.

# Example data

Our example data for this article needs to be some tree-like data from the real world that is:

- Sufficiently complex and ugly that it cannot easily be visualized as-is
- Sufficiently small that the resulting trees fit in the article nicely
- Boring

Why should the example data be boring? Well, if the data is interesting, we’ll be tempted to use our knowledge of the *meaning *of the data to simplify the tree — whereas for a purely algorithmic approach we can’t do that.

Our example tree, then, will be ugly and boring; it is a somewhat anonymized subset of the parking revenue of a local government unit in the UK. Each node represents some grouping of parking revenue sources, with some amount of total revenue (represented by node size). The full tree is like this:

Fig. 1: Original tree data

As it stands, this tree is too complex to put on a slide and show to decision-makers. We will now look at algorithmic approaches to simplifying it.

# Simple algorithms

First, I’d like to look at some of the more straightforward — but still useful — algorithms we can employ.

## Node-filtering algorithms

Perhaps the most naive approach to tree simplification is to simply filter out some of the nodes based on a rule. Let’s see what our parking revenues tree looks like when we filter out all nodes with a revenue lower than a particular threshold.

This isn’t very useful at all. It loses lots of important structure; the fact that apparently parking is divided up into the three towns of Framley, Wickton and Alderley is lost, and worse yet, entire categories of revenue have vanished with no warning to the decision-maker that they were ever there. Perhaps if we set a more generous threshold and remove only the very small nodes, the result will be better.

This is a very poor result, demonstrating a common problems with algorithmic data simplification — what’s kept is not always what’s important. The simplified tree has the same problems as our first attempt, but now it retains strange aspects of the original tree — such as those long strings of one-child nodes — that probably obscure the information content even further.

Node-filtering algorithms are useful when simplifying very large graph (see this paper for an example), but they aren’t necessarily a good fit for visually summing up a (comparatively) small tree.

## Depth-limiting algorithms

Another basic approach to tree simplification is to reduce the depth of the tree, such that no node can be more than *n* nodes from the root. In theory, we’d expect this to often yield fairly good results, because it should preserve nodes that represent large aggregations of data, and remove their less-important descendants. Let’s try limiting the depth to 5 nodes.

This isn’t quite as awful as the simple node-filtering algorithm but it has major flaws. For one thing, it compresses the tree so that all aspects of it appear to have the same depth — hiding the fact that apparently some types of parking revenue are more complex than others. Another problem is that it gives the illusion that ‘Framley’ is in some way the same ‘kind’ of node as ‘outstanding’, by presenting them both as leaf nodes. It also retains lots of detail under ‘Fines’, just because ‘Fines’ happened to already have a lot of nodes at depth 5.

Note that there are several less naive ways to limit tree depth; two popular ones are:

**Leaf pruning**: eliminate all leaf nodes. Repeat until max depth is as desired. This does a better job of retaining some of the shape of the tree.**Bottom-up depth limiting**: limit depth by removing nodes from near the root, instead of removing nodes far from the root. This approach is quite complex but yields good results for certain use cases.

## Breadth-limiting algorithms

Now we try a somewhat more computationally challenging set of approaches. There are various breadth-limiting algorithms, which generally rely on two steps:

- calculate a maximum number of nodes that can exist at a particular depth
- at each depth, remove, move, or combine sibling nodes until that maximum is no longer exceeded

For example, let’s try a breadth-limiting approach in which sibling nodes representing less revenue are combined, into an ‘Other’ node, until no node has more than two siblings.

That worked much, much better than the filtering and depth-limiting approaches! For the first time, something of the shape of the tree has been preserved (never mind that the visual spacing isn’t very good — that’s just a layout issue). Perhaps the result would be even better if we set harsher parameters, forcing more nodes to be combined into ‘Other’.

Well, never mind. But the breadth-limiting algorithm with moderate settings was certainly effective.

## Topology-preserving algorithms

When simplifying any graph, it’s possible to preserve the graph’s overall topology by removing any order-2 nodes — i.e. nodes that have exactly 2 connections. That’s only one kind of topology-preserving simplification algorithm, but it’s a powerful one. When applied to trees, it comes down to removing all nodes that have exactly 1 child node, preserving the child node and joining it to its grandparent.

This is called ‘singleton removal’, it’s a very simple and widely-useful algorithm, and in our case the result is this:

This hasn’t affected most of the tree but it’s really helped with those long ‘strings’ that were there in the original tree and which seemed to convey very little useful information. By itself it hasn’t produced a useful tree, but perhaps it will shine in combination with other algorithms.

## Stacking algorithms

All tree simplification algorithms can be stacked — applied sequentially, hopefully improving the visualization each time. The order in which they are applied is extremely important. Applying successive simplifications to a tree is rather like applying successive lossy compression algorithms to an image; the details of how it’s done can determine whether the useful information is kept and the noise thrown away… or vice versa.

In this example, the tree has been subjected to first a minimum amount filter on each node, and then a depth-limiting step:

This isn’t perfect but it’s much better than what either approach produced by itself. Some algorithms (such as singleton removal) are very often best performed as a final step; some work better as a first step.

# Complex algorithms

We have looked at various fairly simple and intuitive approaches to tree simplification and seen how they can be combined. These approaches are useful but one thing they have in common, which limits their power, is that they do not take the *whole tree* into account. Each one looks at a specific node, or at a set of siblings, and decides whether to perform a node removal or a node combining operation. We could achieve much better results, even on our dull parking revenue tree, by using algorithms that take the whole structure of the tree into account.

Grammar induction algorithms are one family of algorithms that can achieve this.

## What is grammar induction?

Grammar induction is the process of deriving a grammar from a set of observations such that the grammar can then, in turn, produce the observations.

If this doesn’t sound useful, consider that it’s what you do when you compress a file; the algorithm in your compression program is generating a grammar (in some sense) which can then be applied to some small amount of starting data to re-create the original file.

Grammar induction also has many other uses. The grammar can be interpreted as a catalog of what the important (i.e. frequently recurring) parts of the input are; this can allow us to distinguish more-random or more-repetitive areas of data in large unstructured datasets, which has lots of uses in forensics, telecoms, and the military.

## Grammar induction over sequential data: LZW

Perhaps the best-known and most widely used grammar induction algorithm is Lempel-Ziv-Welch, aka LZW, the algorithm used in .gif files. LZW achieves speed by processing a list of input symbols sequentially, building up a dictionary of commonly-encountered symbol sequences as it goes. This is a very fast way to compress a file, but not that relevant for data visualization, as there’s nothing particularly sequential about out tree data.

## Grammar induction over random-access data: Sequitur

Sequitur is a simple yet brilliant algorithm that operates over a list of input symbols which need not be accessed sequentially. As a result, Sequitur and its derivatives are able to build a tree-like grammar rather than a dictionary; the nodes in this tree represent commonly recurring patterns in the input. Sequitur is dramatically useful for interpreting lists of symbols about which nothing is known.

## Grammar induction over a tree

In our case, though, we *start *with a tree, so much of the work Sequitur does is not useful to us. In fact, by starting with a tree, we’ve been handed a grammar on a plate; it’s just not a very useful grammar since the grammar tree has one node for every node in the original tree. But we can process that grammar in several ways to shorten the grammar — thus producing a simpler tree that has many of the properties of the original tree. One thing we can do is to identify the important nodes in that grammar, the nodes that would be applied most if we were actually using the grammar to reproduce the original tree. To do that, we’ll be using an algorithm of this form:

- Define a ‘subtree summary’ function whose input is a subtree and whose output is a summary — a data item that can be compared and that represents the structure of the subtree.
- Apply that function to each node in the tree.
- Examine the tree and compare the subtree summaries with a ‘subtree match’ function. Identify nodes whose summaries match.
- Replace those nodes with placeholder nodes indicating that a recurring structure was found.
- Go back to step 2, and repeat as many times as desired.

A huge amount of parameterization is possible here.

(Note to lazy and curious readers — yes, it would also be possible to serialize the tree as a string, run Sequitur on it, and then further process the Sequitur-generated tree, which for large real-world use cases will usually be much smaller than the original tree. This approach has some problems, though, in many real-world use cases, because Sequitur can’t easily do a ‘fuzzy match’, whereas in real life we’d often want to identify close but not completely similar subtrees.)

## Applying grammar induction to our example tree

For our simple example, we’ll use these functions and parameters:

- Subtree summary function: the ‘summary’ is a string, created by serializing the subtree, including the names of subtree nodes. In a real-world use case, it would usually be better not to incorporate node names into the summary.
- Subtree match function: We compare the strings to see if they are equal.
- Other parameters: We make only one pass through the tree, and all nodes that match other nodes are replaced with summary nodes.

Summary nodes are given names ending in ‘…’, so you can see which ones they are. If we were bothering with proper layout, we’d do something more complicated.

The result is as follows:

It hasn’t done anything about those long strings of singleton nodes, but that’s not a problem as we already have a solution for those.

What the grammar function *has *done successfully is realize that the various roads and zones are all the ‘same kind of thing’ and replace them with placeholders, ignoring the frequently-repeated details about residents, trade, machines and so on. As a result, the tree has fewer nodes, but the important structures are mostly retained.

Essentially, we’ve applied a level of grammar induction based compression to the tree but then (because our aim is to visualize a simplified tree, not to re-create the original tree) we’ve thrown out the grammar itself, i.e. the tree of subtree summaries.

# Simplifying the example tree

Now let’s bring all the techniques together, stacking our simple grammar induction algorithm with some of the other algorithms we’ve discussed. Here’s what I think is a pretty good simplification of the tree, using grammar induction and singleton removal:

And here’s another interpretation, drastically reducing the node count with a combination of grammar induction, singleton removal, and breadth reduction, to produce a greatly simplified tree that still has much of the shape of the original:

It’s not perfect — and the visuals could certainly do with improvement — but it’s an awful lot more suitable for presenting on a slide than our original tree.

# Next Steps

I hope you enjoyed those algorithms. Of course, algorithms aren’t everything; there are many important techniques which you’d use alongside purely topographical algorithms and which this article has *not* touched on:

**Visuals and layout**. There is a huge amount that can be done to make a tree clearer at the actual display stage.**Semantic processing of the tree**. In real life, domain knowledge or just common-sense knowledge of what the tree represents is very important.**Data shaping and data quality**. Like many real-world data sets, our parking revenue data set is ugly, with long strings of singletons, unnecessary detail, and incomprehensible node labels in some places. Fixing the actual data upstream, when possible, is a powerful tool for producing better decision support downstream!

Nevertheless, in this article, I’ve presented a catalog of tree simplification approaches, all of which are fully automatic and can be used together to reduce the node count of a tree while preserving its shape, thus helping to visualize the data and make decisions.

I’ve also tried to demonstrate how compression algorithms are one of the many powerful techniques that can help summarize and visualize complex data. Over the last decade, humankind has collected petabyte after petabyte of complex graph-like data, and yet our exploration of techniques for visualizing and understanding that data is only just beginning.

Thank you for reading.