Alluvial plots and Sankey diagrams: what’s the difference?
Alluvial plots and Sankey diagrams are both useful visualization tools. They have a superficial similarity to each other, but internally they are completely different solutions with very different underlying assumptions about the data. This post will summarize the differences both in how they are made, and in when they are used.
First, examples of an alluvial plot and a Sankey diagram:
These diagrams have some superficial features in common, which leads to confusion. For example:
- In both diagrams, nodes (or the nearest equivalent to nodes) are represented as vertical bars whose height conveys the idea of ‘count’ or ‘volume’.
- In both diagrams, links between nodes consist of curves whose thickness is related to the height of the nodes and encodes ‘count’ or ‘volume’ just as the node height does.
- In both diagrams, the user will usually need to highlight a ‘stripe’ across the diagram and see how wide that stripe is at various points.
But here, the similarities end.
The two diagrams above are saying very different kinds of thing. The alluvial plot example is showing us what proportion of some population has various characteristics — for example we can easily see that Sex=Female and Survived=Yes are closely associated, giving a large population who are both female and survivors. In other words, this diagram is taking a set of facts (voyagers on the Titanic) and breaking them down by a set of categorical dimensions (sex, survival, etc). Nodes are all lined up in columns because they represent possible values in a dimension and need to be lined up with other possible values for that dimension. This is a typical alluvial plot use case.
By contrast, the Sankey plot is showing us a flow from energy producers on the left, to energy consumers on the right. We don’t learn anything about what dimensions or properties energy might be associated with; we learn about how it moves from one form to another. This diagram is taking a set of objects (generators, consumers, etc) and drawing the quantity of flow between them. Because the diagram is not broken into dimensions, nodes can appear at any point from left to right, and need to be laid out as we would lay out any flow or process diagram.
To summarize, then:
- Shows how a population of facts is allocated across categorical dimensions.
- Left/right position has no particular significance; dimensions could be in any order.
- ‘Nodes’ are lined up in columns.
- Is useful for showing how features of a population are related — for example, answering questions like ‘how many people have features A and B, compared to how many have B but not A?’
- Shows how quantities flow from one state to another.
- Left/right position shows movement or change.
- ‘Nodes’ could be anywhere, and must be laid out by an algorithm.
- Is useful for showing flows or processes where the amount, size, or population of something needs to be tracked — for example, answering questions like ‘out of the energy in system A, how much came from systems B and C and where will most of it go?’
These are profoundly different types of diagram; the alluvial plot is a member of a large family of visualizations that are designed to help visualize facts across several dimensions at once, whereas the Sankey diagram is at heart a general purpose flow diagram in which the height of elements is bound to some concept like ‘amount’. In fact the term ‘Sankey diagram’ really refers to any diagram in which the width of a link is proportional to the quantity visualized; Sankey diagrams with loops and non-linear layouts exist.