The first objective of Project Ronald is to make it easy to connect tabular datasets quickly: given two openly-licensed tabular datasets containing a common field, but published by different organisations, it ought to be possible to get them downloaded and joined together in a few seconds. The approach is to identify the components of a system which would do this, implement a minimal version of each, check that the system works as a whole, and then go about replacing each component with better tools, preferably ones already written and matured by someone else.
The example research question to motivate my example is about a minor local political controversy: is there a correlation between ethnicity and voting behaviour in London mayoral elections? What I have done with the data is not statistically rigorous at all; it is only to demonstrate the data acquisition and linking tools.
We start with two datasets: the London mayoral election results by ward, and the census ethnicity breakdown by ward. As it happens, I had to wrangle these both to be in the form of percentages, but they contained a standardised code identifying the ward, which helped considerably. Two pieces of metadata are needed for each tabular dataset in CSV form: the CSV dialect and the CSV schema.
The CSV file, and its two metadata objects are registered with a Data Management System the CKAN DMS allows me to attach arbitrary attributes to a dataset, so I use this facility to record which resource is which, in a manner understandable by an automatic tool. It is then trivial to create a tool which knows this convention and can download a CSV with dialect and schema metadata given a dataset name. This information suffices to create a table in a relational database (I used MySQL) and the statistics system, R, has a MySQL driver; both MySQL and R can efficiently join two datasets using a common field.
With such a system, you can get the data into your statistics package and start running regressions in seconds, which is much harder where the data must be licensed or the tools are not coupled together. In addition to open licensing and tool connectivity, the following steps need to have been taken, not necessarily by the same person:
- the dialect of the CSV file must be determined
- the schema of the CSV file must be determined
- the joinable fields in the CSV table must be consistently coded
- the CSV file needs to be clean, in various senses
Much of this work is automatable, which is the topic of a later post