Automatically inferring relational structure from files

Background

They’ve called you in because they need insight. A plan for the future, perhaps, or a critical evaluation of the present — either way, your vision will need to be supported by serious concrete evidence. And to achieve that, you’ll need to source some raw data and turn that data into knowledge fast.

For the sake of argument let’s take this realistic (and realistically dull…) CSV file describing sales.

a dull excel file of sales

 

Fig 1: an example CSV file in its natural habitat.

 

Fig 2: two-minute pencil and paper analysis of the CSV file
  • a way to build a graph of column relationships based on that measure
  • a way to simplify and present that graph

Measures

This section considers ways to measure relatedness between columns, and proposes a custom measure, the ‘wharf coefficient’. To infer a relational structure, we mostly want to answer the questions:

  • ‘Is column A a dimension of which column B might be an attribute or subdimension?’

Correlation, Chi-Squared and their friends

We need a measure that tells us whether, and how much, columns are related. Our ML instincts might tell us to look for correlations between the columns. This isn’t very effective, though, since pure correlation isn’t really applicable between categorical variables.

Fig 3: Chi-squared formula

 

Fig 4: Cramer’s V — a scaled chi-squared

Bayesian

To measure the degree to which column A implies column B, we could use Bayesianlogic. This has two great benefits:

  • It is a one-way relationship that easily expresses the case in which A implies B but not vice versa.

Fig5: One-hot encoding; deciding whether ‘Language’ depends on ‘City’

Wharf Coefficient

For these reasons, for the task at hand it’s worth suggesting a new measure, the ‘wharf coefficient’ (so called because it’s useful in Canary Wharf, a part of London that is rich in large CSV files). This is a simple measure answering the question:

 

Fig 6: Wharf coefficient
w = (df.groupby([cb,ca])[ca].count().max(level=0)).sum()/df.shape[0]

Building a graph

Having selected a measure of the dependence of one column on another, we go on to build a graph of the inferred underlying data structure. We need the graph because in order to extract knowledge from the implication matrix, we will use graph techniques to simplify the graph for display.

Building an implication matrix

We build an implication matrix by measuring the degree of implication between each pair of columns. In the case of our boring example sales data, the implication matrix happens to be:

Fig 7: The implication matrix for the example data. We see that ‘City’ nearly implies ‘Store’ but not quite.

Threshold

We can make a decision here as to how weak a relationship is worth considering. We can do this by defining a threshold below which the relationship between two columns is ignored. Assuming we use the wharf coefficient, the following threshold values are a good starting point:

  • 0.98 — use this if we are looking for full relational integrity but we expect that the underlying relational structure may be slightly obscured by data quality problems.
  • 0.85 — use this we are looking for strong relationships that don’t necessarily amount to actual foreign key constraints

Column removal

Some types of input columns should be removed from the data set before building the matrix. For example:

  • Columns that look like numeric measures, e.g. floating point numbers, probably don’t form part of the relational structure.
  • Columns that have very high cardinality but are large are probably text values that don’t form part of the relational structure.

Cleansing

It may be necessary to fill NaNs, correct whitespace problems introduced in Excel, and do other cleaning. Fortunately, Python/pandas is a powerful tool for this kind of task.

Building a graph from the matrix

Our ‘implication matrix’ has cells that express the strength of the one-way relationship between columns. This happens to mean that it is also a graph adjacency matrix (in which relationship strength is column weight).

g = nx.from_numpy_matrix(mat,create_using=nx.DiGraph)

Analysis and presentation of the graph

In this section, we consider steps to present a quick, readable output from our graph of column relationships.

Primary keys

We can easily identify primary key candidates; they’ll be columns which imply every other column.

Synonyms

We can identify groups of columns that are synonyms fairly easily too; two columns are synonyms if they each imply each other. This happens a lot in real life since so many files contain fields like ‘Car Serial Number’, ‘Car Licence Number’, ‘Car ID’ and so on which we expect to vary together. While finding pairs of columns that are synonyms is easy, identifying large groups of synonyms requires a little bit of graph theory which NetworkX can perform for us easily:

filter(lambda x:len(x)>1, nx.strongly_connected_components(g))

Fig 8: PowerGraph approach used to simplify a graph that contains synonyms

Attributes

Having found keys, and compressed synonyms, we can now look for dimensions. In fact, every column that’s left that is not a key is a dimension (since we cleaned away all the measures earlier on), but merely knowing that isn’t very useful. In real life, interesting dimensions have a structure; for instance, in our example, we expect ‘City’ to have an attribute ‘Country’ and probably an attribute ‘Language’ and/or ‘Currency’, even though we aren’t quite sure how they join together.

dimensions = list(nx.weakly_connected_components(r))

Fig 9: Excess relationships removed from a dimension by finding a spanning tree

Layout

The most common way to represent a graph visually is by what the venerable and vital d3project calls a ‘force directed layout’. If you use NetworkX and matplotlib, this is what you’ll get (although NetworkX calls it a ‘spring layout’). This is generally not a very useful way to present small graphs of business-meaningful items (nor even a useful way to present most graphs in general, I’d argue). By default, NetworkX and matplotlib yield the following visualization for our example graph:

Fig 10: Grim result of a naive force-directed layout of the unsimplified graph

Results

Our final output looks like this:

Fig 11: Final result generated in DataSmith, showing the dimensional structure of the input file

Further Steps

I’ve covered only the basics here. Further work might go along the following lines:

  • Using Bayesian logic better. I think the wharf coefficient is ideal for data which is fundamentally relational, but for data that is only statistically related, and for making sense of measures rather than dimensions, it would be interesting to blend Bayesian logic in.
  • Data Quality use cases. The techniques described here could form a very useful part of a data quality chain, especially for replacing pivot-table based operations.

Towards Data Science

A Medium publication sharing concepts, ideas, and codes.

Thanks to Yenson Lau (). 

Tagged : / / /

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles