Scaling across data types

2012-06-25

Whereas much work has gone into scaling storage and processing as the size of datasets increases, we dont have a clear vision of how to manage a group of vastly diverse datasets, all saying similar things in slightly different ways at different levels of quality and completeness.

The Open Knowledge Foundation has two projects, CKAN and OpenSpending which deal in part with these issues, plus the additional problem of soliciting dataset contributions from the general public. They succeed by shoehorning certain common aspects of the datasets into a standard schema and accepting and supporting broad general variation elsewhere on best effort basis.

To represent tabular data of a consistent single schema, one may use a conventional relational database table. For multiple schemas, one may specify a namespace policy then use multiple tables. This leaves open the question how to handle very similar datasets: e.g., datasets whose schemas differ by a single member. SQL-style relational databases permit the use of NULL values where information is missing, but this has severe drawbacks, and is unhelpful where the schemas differ markedly. One ideally wants arbitrary unions of dissimilar tables.

This is to say, sometimes we want to say each item in this dataset is "either an X or a Y; this is effectively the dual of algebraic data types and indeed of the abstract base class. In the case of big data -scale datasets, applications tend to be beyond the point where Object Relational Mapping systems are feasible, but otherwise, it is very useful to know that information unmarshalled from the data stream does not need to be completely checked at run-time before instantiation with appropriate subtype.

Within or across organisational boundaries, reconciliation and verification processes ought to be isolatable from each other, even at the cost of performance. Suppose we are verifying the contents of a particular cadastral claim, e.g., a title deed in the United States: we will have names and addresses, which must be resolved to counties, states and so on. It should be unnecessary for automated tools retrieving and caching the supporting data to have to reside within the same domain of administrative control. Technologically and economically, however, we are a long way from distributable data reconciliation.