Did you hear the one about the Berkeleyan idealist who tried to reduce all of quantum mechanics to mutually recursive lambda expressions?
He literally disappeared up his own arse!
I live in a society in which the Government is restrained from taking any action which can be denounced in a 200 word article in the The Sun. How British democracy has attained this pitch of excellence need not concern us here, rather we must ask: how can we increase the explanatory power of articles in The Sun?. To do otherwise is to favour the obstruction of public policies exclusively on grounds of simplicity rather than demerit, with the result that any bad policy may be adopted if its proponents first trouble to complicate it unnecessarily.
Newspaper articles and narratives are but the finished product in a long supply chain of knowledge. The Open Knowledge movement should conceive of itself as like unto the resourceful colonial railway companies of old, which opened up vast new territories and connected raw materials with factories.
Project Ronald, named for Ronalds Coase and McDonald, is my effort to connect up all the world's tabular data. The aim is to enable speedy answers to the question "are these things correlated", where this involves extracting information from public datasets published by different organisations. I take "speedy" to mean "in a couple of seconds", and I have created a short video demonstrating the process from when someone selects successful search terms to find a dataset, to being able to do regressions in R, a statistics system; the video is heavily padded to show I'm not cheating.
The overall approach of the project is the New Jersey / UNIX philosophy, of getting a simple architecture of loosely connected components to work together, and worrying about obtaining the best implementation of each component later. Subsequent blog posts will relate the current implementation and future challenges.
The first objective of Project Ronald is to make it easy to connect tabular datasets quickly: given two openly-licensed tabular datasets containing a common field, but published by different organisations, it ought to be possible to get them downloaded and joined together in a few seconds. The approach is to identify the components of a system which would do this, implement a minimal version of each, check that the system works as a whole, and then go about replacing each component with better tools, preferably ones already written and matured by someone else.
The example research question to motivate my example is about a minor local political controversy: is there a correlation between ethnicity and voting behaviour in London mayoral elections? What I have done with the data is not statistically rigorous at all; it is only to demonstrate the data acquisition and linking tools.
We start with two datasets: the London mayoral election results by ward, and the census ethnicity breakdown by ward. As it happens, I had to wrangle these both to be in the form of percentages, but they contained a standardised code identifying the ward, which helped considerably. Two pieces of metadata are needed for each tabular dataset in CSV form: the CSV dialect and the CSV schema.
The CSV file, and its two metadata objects are registered with a Data Management System the CKAN DMS allows me to attach arbitrary attributes to a dataset, so I use this facility to record which resource is which, in a manner understandable by an automatic tool. It is then trivial to create a tool which knows this convention and can download a CSV with dialect and schema metadata given a dataset name. This information suffices to create a table in a relational database (I used MySQL) and the statistics system, R, has a MySQL driver; both MySQL and R can efficiently join two datasets using a common field.
With such a system, you can get the data into your statistics package and start running regressions in seconds, which is much harder where the data must be licensed or the tools are not coupled together. In addition to open licensing and tool connectivity, the following steps need to have been taken, not necessarily by the same person:
- the dialect of the CSV file must be determined
- the schema of the CSV file must be determined
- the joinable fields in the CSV table must be consistently coded
- the CSV file needs to be clean, in various senses
Much of this work is automatable, which is the topic of a later post
The source code for Project Ronald is now available on GitHub.