Automatically inferring relational structure from files
Background
They’ve called you in because they need insight. A plan for the future, perhaps, or a critical evaluation of the present — either way, your vision will need to be supported by serious concrete evidence. And to achieve that, you’ll need to source some raw data and turn that data into knowledge fast.
What lands on your desk, in practice, is likely to be a bundle of CSV files, hastily assembled by business analysts, or discovered in a dusty server in the basement.
We need to know what these files mean, but we can’t easily employ standard ML algorithms for that purpose, because the kind of business data we’re likely to be looking at is highly categorical — the nouns will be legal entities, staff members, sales, products and customers rather than numerical observations.
In this article, we’ll look at how to quickly achieve insight into the underlying structure of those CSV files, using Python, pandas, and NetworkX. Back-end code (not especially neat) for the functionality discussed in the article is available on request.
Use Case Example
For the sake of argument let’s take this realistic (and realistically dull…) CSV file describing sales.
Clearly some of these columns have particular relationships to each other. A given ‘Country’ is probably always in the same ‘Region’. ‘Region’ and ‘RegID’ seem to be roughly the same thing. It’s not immediately clear whether ‘Currency’ always matches ‘Country’ or ‘City’, or whether ‘ProdGrp’ has any relationship to anything else. ‘SaleID’ looks like a good candidate for a primary key.
What we want is a tool that will produce something like my pencil-and-paper analysis:
Only we want it instant, correct, and presentable. The rest of this article describes an approach to solving this problem.
There are lots of techniques we can use to infer an approximation of the data structure, but broadly we’ll need a few basic building blocks:
- a way to measure how much one column depends on another
- a way to build a graph of column relationships based on that measure
- a way to simplify and present that graph
As a matter of fact, those above three steps are characteristic of many, if not all, challenges around the visualization of relationship information. We’ll now look in turn at defining a measure, building a graph, and presenting useful conclusions from the graph.
Measures
This section considers ways to measure relatedness between columns, and proposes a custom measure, the ‘wharf coefficient’. To infer a relational structure, we mostly want to answer the questions:
- ‘Is column A likely to be a key for columns B, C and D?’
- ‘Is column A a dimension of which column B might be an attribute or subdimension?’
Correlation, Chi-Squared and their friends
We need a measure that tells us whether, and how much, columns are related. Our ML instincts might tell us to look for correlations between the columns. This isn’t very effective, though, since pure correlation isn’t really applicable between categorical variables.
So, we look for a categorical version of correlation. That’s what the common chi-squared measure is; it provides us with a numerical measure of the independence of two categorical variables.
Fig 3: Chi-squared formula
In practice, chi-squared is a horrible measure of dependence in the relational sense as it is not scaled and it’s hard to know how we should react differently to a larger or smaller chi-squared value. To find a scaled measure of relatedness between columns, we can use a measure called Cramer’s V; this measure is a scaled chi-squared:
Wikipedia suggests an additional scaling step for Cramer’s V here, which results in what is probably the ‘most correct’ correlation-like measure for categorical fields. And yet, in practise it’s not that useful for the current task.
The trouble with these measures is that they are reciprocal measures of relatedness. In order to answer our questions, we want a directional measure; we’re looking for asymmetrical relationships in which column A implies column B but not (usually) vice versa.
Bayesian
To measure the degree to which column A implies column B, we could use Bayesianlogic. This has two great benefits:
- It produces a ‘confidence level’, which is about what we’re looking for; we want to know how confident we can be about column B’s value given column A’s value.
- It is a one-way relationship that easily expresses the case in which A implies B but not vice versa.
In practice, Bayesian logic is useful for our problem, but it has some drawbacks. In particular, most available implementations of Bayesian models don’t work especially naturally with categorical inputs. For example, to use the very useful Bernoullian Bayes implementation in sklearn, We have to use what’s called ‘one hot’ encoding to represent our input columns as a set of Booleans:
This is a particular case of the general problem of transforming categorical data into a state suitable for vanilla ML algorithms, which is an interesting problem in its own right (see this Towards Data Science articlefor an introduction).
While Bayesian approaches should certainly not be rejected for structure inference — in fact, they can be very useful — the aim here is to build a quick, simple tool. In particular, to infer structure across the whole file, which might be large, we’ll need to compare every column to almost every other, and that would take time with Bayes. Furthermore, the Bayesian confidence level produces a number that’s hard to interpret when one column almost implies the value of another.
Wharf Coefficient
For these reasons, for the task at hand it’s worth suggesting a new measure, the ‘wharf coefficient’ (so called because it’s useful in Canary Wharf, a part of London that is rich in large CSV files). This is a simple measure answering the question:
‘If I guessed the value of column B using the value of column A, what proportion of the time would I be correct?’
The wharf coefficient is found as follows:
…where a are the possible values of our target column, b are the possible values of our dependent column, and Nba is the number of records that have a particular pair of values in the two columns. In other words, it’s the sum of the most frequent b values for each possible a value.
The wharf coefficient is, of course, 1 when column A implies column B. If it’s marginally below 1, that suggests that column A implies column B except for a specific number of data errors. If the value is lower, then some correlation between the columns is suggested and some tuning will be required to make a good guess as to whether there is an actual relationship.
An advantage of the wharf coefficient is that it can be expressed as a single fast-running line of pandas (strange mental image, there) which is ideal for our use case.
w = (df.groupby([cb,ca])[ca].count().max(level=0)).sum()/df.shape[0]
Another advantage is that it expresses a business-meaningful truth: ‘If you assume column A does imply column B, you’ll get the wrong answer exactly (1-w)*100 percent of the time’.
Building a graph
Having selected a measure of the dependence of one column on another, we go on to build a graph of the inferred underlying data structure. We need the graph because in order to extract knowledge from the implication matrix, we will use graph techniques to simplify the graph for display.
Building an implication matrix
We build an implication matrix by measuring the degree of implication between each pair of columns. In the case of our boring example sales data, the implication matrix happens to be:
Threshold
We can make a decision here as to how weak a relationship is worth considering. We can do this by defining a threshold below which the relationship between two columns is ignored. Assuming we use the wharf coefficient, the following threshold values are a good starting point:
- 1 — use this if we are looking for full relational integrity and we assume the data is not corrupted.
- 0.98 — use this if we are looking for full relational integrity but we expect that the underlying relational structure may be slightly obscured by data quality problems.
- 0.85 — use this we are looking for strong relationships that don’t necessarily amount to actual foreign key constraints
Column removal
Some types of input columns should be removed from the data set before building the matrix. For example:
- Columns that have very low cardinality may not be useful. Columns with a cardinality of 1 are pointless, since every other column implies them.
- Columns that look like numeric measures, e.g. floating point numbers, probably don’t form part of the relational structure.
- Columns that have very high cardinality but are large are probably text values that don’t form part of the relational structure.
In real life, this step is almost always required. In our example data such columns have already been pruned out for the sake of brevity.
Cleansing
It may be necessary to fill NaNs, correct whitespace problems introduced in Excel, and do other cleaning. Fortunately, Python/pandas is a powerful tool for this kind of task.
Building a graph from the matrix
Our ‘implication matrix’ has cells that express the strength of the one-way relationship between columns. This happens to mean that it is also a graph adjacency matrix (in which relationship strength is column weight).
Building a graph representation of the relationships is therefore easy; we wave the python magic wand and the highly useful NetworkX library builds us a graph in one step:
g = nx.from_numpy_matrix(mat,create_using=nx.DiGraph)
In real life, there are a few bookkeeping tasks we have to do here — notably, the original DataFrame column names need to become node names in the NetworkX graph.
Now we have access to a range of techniques (some already available in Python and some not) to extract decision-making inputs, i.e. knowledge, from the graph.
Analysis and presentation of the graph
In this section, we consider steps to present a quick, readable output from our graph of column relationships.
Primary keys
We can easily identify primary key candidates; they’ll be columns which imply every other column.
For several purposes we’ll want to remove primary keys so they don’t clutter up the graph with their many relationships.
Synonyms
We can identify groups of columns that are synonyms fairly easily too; two columns are synonyms if they each imply each other. This happens a lot in real life since so many files contain fields like ‘Car Serial Number’, ‘Car Licence Number’, ‘Car ID’ and so on which we expect to vary together. While finding pairs of columns that are synonyms is easy, identifying large groups of synonyms requires a little bit of graph theory which NetworkX can perform for us easily:
filter(lambda x:len(x)>1, nx.strongly_connected_components(g))
What we’re doing here is identifying ‘strongly connected components’, which are subgraphs of the graph in which all nodes have a two-way relationship. If they have a two-way relationship, they’re synonyms, so each resulting subgraph is a set of synonyms. We then filter out subgraphs with just one node.
Our next step of graph simplification is to compress together all the synonyms. For this we can use various methods but the ‘most correct’ is probably to use the extremely powerful PowerGraph approach. This approach compresses nodes that have similar relationships into single nodes. The following diagram gives an idea:
A full discussion of PowerGraph, and the fascinating world of hypergraphs, is beyond the scope of this article. The cola project contains a fairly accessible implementation of PowerGraph. In my own code that goes with this article, I use a more linear approach; unfortunately, NetworkX doesn’t give us a pre-packaged algorithm this time.
Attributes
Having found keys, and compressed synonyms, we can now look for dimensions. In fact, every column that’s left that is not a key is a dimension (since we cleaned away all the measures earlier on), but merely knowing that isn’t very useful. In real life, interesting dimensions have a structure; for instance, in our example, we expect ‘City’ to have an attribute ‘Country’ and probably an attribute ‘Language’ and/or ‘Currency’, even though we aren’t quite sure how they join together.
We discover these attributes using more handy NetworkX algorithms.
dimensions = list(nx.weakly_connected_components(r))
Weakly connected components are subgraphs joined by one-way relationships; this is exactly what we expect each dimension to be, since the attributes in a dimension are connected by one-way ‘is an attribute of’ relationships.
When we perform this step we find another opportunity for simplification. Our dimensions will wind up with excess relationships in. For example, if the real underlying structure is that ‘city implied country, which implies region’ then the implication matrix will also show that city implies region, even though that’s not useful information for the human reader. We need to remove these excess relationships, as shown below.
To do this, we can use another NetworkX algorithm to find a ‘minumum spanning tree’ of the subgraph that represents the dimension. This is a representation of the graph with as few edges as possible, and results in unnecessary relationships being removed.
Layout
The most common way to represent a graph visually is by what the venerable and vital d3project calls a ‘force directed layout’. If you use NetworkX and matplotlib, this is what you’ll get (although NetworkX calls it a ‘spring layout’). This is generally not a very useful way to present small graphs of business-meaningful items (nor even a useful way to present most graphs in general, I’d argue). By default, NetworkX and matplotlib yield the following visualization for our example graph:
This is just barely adequate for checking that the code works and we have a graph, but not useful for presentation or decision-making.
An appropriate layout choice, in this case, would be to take advantage of the fact that we know we have zero-or-more primary keys in the middle and zero-or-more branching dimensions connected to them. A radial layout like GraphViz twopi would be better but can still introduce visual noise — variations in positioning that look meaningful but are a product of the layout logic rather than of the graph. I’ve implemented a task-specific layout that constrains dimensions to a straight line (with branches) radiating from the primary keys.
Results
Our final output looks like this:
Fig 11: Final result generated in DataSmith, showing the dimensional structure of the input file
Not too far from what we’d expect, but not a bad result for the couple of seconds it took (on a 7000 row input file). Three dimensions are visible, and we can see that while currency is per-country, language is per-city, which is something we might not have picked up on otherwise. A few synonyms have been identified and compressed together. The tool has used a layout that’s suitable for displaying dimension information and it’s added some subtle color to help us group the columns visually.
Further Steps
I’ve covered only the basics here. Further work might go along the following lines:
- Relating data across multiple input files. This involves looking for correlation IDs across files, a non-trivial task but one that’s applicable to many use cases.
- Using Bayesian logic better. I think the wharf coefficient is ideal for data which is fundamentally relational, but for data that is only statistically related, and for making sense of measures rather than dimensions, it would be interesting to blend Bayesian logic in.
- Data Quality use cases. The techniques described here could form a very useful part of a data quality chain, especially for replacing pivot-table based operations.
Get in touch if you’re interested in discussing problems like this. As noted above, the back-end code for the functions shown in this article is available on request (with no guarantees)! This is an area that offers huge opportunities to apply recent advances in analytics and visualization in a way that reduces costs and makes life easier.