Tidy Data in finance
Tidy Data: the concept
“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data”
That’s how Hadley Wickham’s seminal 2014 paper, Tidy Data, begins. In it, he sets out a collection of rules and guidelines for making data easier to analyze — for reducing that 80% figure so we can spend more time on exploiting data and less on getting it ready. The paper is freely available here:
Hadley Wickham’s Tidy Data paper
Tidy Data in Finance
The Tidy Data paper is concerned with scientific data, which comes from different studies, different journals, and different institutions, all with their own standards. You might think that in the case of, say, risk data in a financial institution, preparing it for analysis would be much easier — after all, it’s all the output of one single pipeline all controlled by one single organization.
I’m pretty sure, though, that in my experience the figure is at least 80%, what with enrichment, population selection, de-duplication, interpretation of complex business terms from different areas of the business, matching of market data and model parameters…
Perhaps the Tidy Data principles can be specialized to help with financial data? Starting small, I chose a part of Tidy Data that contains guidelines around dataset structure, specifically around spotting problems (messy data) in datasets. Following are those problems, as described in Tidy Data, with some thoughts on why we find them in financial data.
Messy data: Column headers are values, not variable names
This happens a lot in risk data. Often, it’s caused by the way regulators and risk managers would like to see data in a report, as opposed to the way data should be arranged for analysis. For example, a report to the Fed might contain a table like this:
Total Exposure by Loan Size
Region | < $1M | $1 - $5M | $5 - $10M | $10 - $100M | > $100M
----------------------------------------------------------------
N America | $50M | $324M | $1045M | $941M | $1200M
EMEA | $10M | $121M | $77M | $80M | $0M
...
The columns here are mostly ranges of values. This isn’t great for analysis; if we want to pivot this freely, or rebucket the loans in some other way, we wind up going back upstream to get a raw table of loan details. Almost always, the underlying cause of this kind of mis-formed data set is that data prepared for a specific reporting use case has been provided as if it were general purpose. This happens all too often as teams scramble to meet regulation after regulation — improved collaborative analytics workflow is often the solution.
Messy data: Multiple variables are stored in one column.
This is extremely common in market risk data. Why? Because a market risk data point expresses the intersection of a complex set of dimensions describing the financial position, the underlying market factor, and the kind of sensitivity to this factor. An extreme but real-life case is shown below:
Market Risk Sensitivities
Position ID | Exposure | Sensitivity
----------------------------------------------------
USA001034 | Treas_5_2090,050v01_rel | 1.2
USA001034 | Treas_10Y_1y_pv01_rel | 1.1
UK002097 | Eq_gamma_Rel_1pct_term | 0.31
UK002097 | Eq_Gamma_Rel_1pct_inst | 0.302
‘Exposure’ will have to be decomposed into risk factor type, tenor, relative/absolute, and a myriad other dimensions, by the look of it. This will be laborious, but unavoidable if we want to be able to pivot this data to meet arbitrary regulatory needs. This kind of dataset ‘shape’ often arises when the front office are able to label risk points however they wish, and the middle office isn’t able to push back and enforce a standard representation — perhaps something that could be addressed by governance processes and better glossary management.
Messy data: Variables are stored in both rows and columns.
Tidy Data describes this as ‘the most complicated kind of messy data’. It means tables in which some variables have been assigned to columns, but others are ‘dynamic’, like this:
Bond Portfolio Summary
Portfolio | Bond Type | USA | UK | Japan
--------------------------------------------------------
A000D | Municipals | 10M | 0M | 0M
A000D | Corporate | 40M | 5M | 10M
A000E | Treasury | 80M | 10M | 25M
There are two features of a portfolio here: bond type, and issue country. For some reason, issue country has been split into columns whereas bond type is dynamic. No doubt there was a good reason for this from the point of view of whoever originally composed the dataset, but building in quirks like that into a data mart isn’t a good idea — either a flattened, sparse format with bond type spread across columns or a fully dynamic format with a ‘bond type’ and a ‘country’ column would be better. Data platforms should ideally be cleaned of such subtle built-in assumptions, so that they can support varied analytics use cases easily.
Messy Data: Multiple types of observational units are stored in the same table.
This is a classic type of denormalization. In financial data, it’s often caused by datasets that contain a built-in join with some reference data tables:
Bonds Reference Data
IssuerParent | Issuer | IssuerCcy | ISIN | Maturity
--------------------------------------------------------------------
GM | GMAC | USD | US1234567890 | 10Y
GM | GMAC | USD | US2345678910 | 5Y
Is this a file of issuers or of instruments? I don’t think denormalized flat files like this are usually a problem; in finance we have a relatively well understood universe of positions, parties, issuers, and other such data entities, and sometimes it’s handy, not risky, to denormalize. Just as long as there’s a data model upstream that does articulate the relationships between those entities…
Messy Data: A single observational unit is stored in multiple tables.
This happens naturally when inputs are large or come from many sources. For example, information about a counterparty — a single observational unit — might be split between a due diligence data store, a corporate structure/events system, a marketing system, and others (with probably some extra dimensions of splitting caused by GDPR and domesticity rules). Provided all these sources have the same primary keys and the same number of records, everything is fine; but that’s rarely the case in practice, and it’s the job of data architecture to ensure that the separate repositories that a financial institution has to have are joinable with one another, and follow the same understanding of data quality and MDM.
Conclusions
Tidying up data in this sense is a matter of best practice and common sense; but the trouble is that those things are understood differently from person to person, whereas this is an area where consistency is key. Taking general-purpose principles from the wider world of data analysis, specially scientific analysis, and specializing them for financial data is a valid way to produce useful, well defined local best practices. In this example I’ve looked at a fairly small use case (data set shapes that are easy to analyze); the more interesting use cases lie in areas such as, for example, standardizing the semantics of datasets that originate from different modelling systems. But that’s a story for another day…