Sunday, May 22, 2011

A Co-Relational Model of Data for Large Shared Data Banks

A Co-Relational Model of Data for Large Shared Data Banks is a paper by Erik Meijer (a used programming language salesman) and Gavin Bierman, in Communications of the ACM, vol. 54, no. 4 (April 2011; originally published in Queue vol. 9, no. 3). The paper is rather pretentious, but fascinating nonetheless.

The authors correctly compare the situation in the current noSQL market with that of the early database days, before Codd's landmark paper from 1970 heralded relational databases. Whereas customers can choose between relational (SQL) DB providers based on product merits, but using a standard interface (which makes switching relatively easy, and the market competitive), no such choice exists in the world of noSQL. While there are multiple noSQL products (Cassandra, Dynamo, Hibari, and many others), there is no unifying standard, and once a project picks a noSQL database, it becomes tightly bound to it.

Meijer and Bierman claim that noSQL should actually be named coSQL, since, in terms of category theory, the two models are complementary. Here's some quick background: category theory explores math by viewing the world as entities and arrows between them (domains and functions). Given a category T of objects and arrows, co(T) is the same set of objects, with the direction of arrows reversed. Well, in SQL, the arrows are between properties of objects and the objects themselves (e.g., if a book has multiple authors, then the table AUTHORS will have a foreign key into table BOOK). In co(SQL), the arrows would be from the book to the authors: and indeed, this is the case in noSQL, where (using in-memory representation) a "Book" object has pointers or references (arrows) to one or more "Author" objects. Ergo, noSQL is coSQL.

The authors then go into further mathematical details, and in particular discuss how, mathematically speaking, a MapReduce operation over a noSQL data store is effectively a Monad in category theory. Ultimately, their claim is that obtaining a standardized, mathematical view of noSQL (just as Todd did with pre-SQL databases, using relational algebra) will result in a universal standard for noSQL products, and a truly competitive and thriving market. They do not hesitate to share their belief that this mathematical formalization of noSQL will lead to similar economic growth. In other words, what we have here is a paper that claims in advance to be a landmark paper.

And it might just be right.

0 comments:

Post a Comment