Tuesday, December 30, 2008

What is Recombinant Data?

I'm kicking-off this blog with a discussion of a general theme, but one that will come up again in subsequent topics. In fact, it's the name of this blog site: Recombinant Data. The reason I went so far as to name this site accordingly, is because the idea behind "Recombinant Data" is very powerful, yet it is counter to practices by software developers for the last several years. It therefore really deserves its own web site for clarification and building on examples, as well as ongoing community discussions. The first mention of Recombinant Data was by Eric Miller, while he was the W3C liaison for the Healthcare and Life Sciences Interest Group. Since then, I've used it countless times in presentations to various groups, since it is an essential cornerstone of the Semantic Web initiative [the topic of many future posts].

First, a little bit of background: The established way of thinking about software and data has been that an application is the primary point of user experience and the data it creates (and reads) is a persistent artifact whose (user) value depends very much on the application "to read it and to know what to do with it". In other words, data semantics is interpreted by a specific application, and therefore only within the context of that app. Consequently, the efficient re-use of data (data interoperability) is impeded, and it is now at the mercy of specialized contracts or "standards" that must be created between application sets (e.g., Adobe-PDF or Office Suite).

Perhaps this model is good-enough for apps used always the same way by millions of consumers for things like word processing or presentations. But if there is to be any hope for improved interoperability in emerging and complex areas such as healthcare, scientific research, or other knowledge managing fields, waiting for the "right standards" to emerge is like waiting for bacteria to grow wings...[more on standard in another blog]. Standards aren't wrong; they should (from now on) be about practice and semantics, rather than data formats and APIs!

Recombinant Data (RD) takes a very different starting point: it is about structuring data with minimum syntactic rules (MSR), yet with enough semantics so that the data output from one app can be easily read and handled by another app, even though neither app has any specific contract apart form the MSR. And though semantics are necessary for understanding what the data is, only knowledge of enough semantics (patients are a kind of person) is required by an external app (myMail) to use the necessary part of the data (patient identifiers about me). Being able to use the right subset of semantics for additional operations by various apps allows for the semantic-invariant mixing and separation data: no matter what gets pulled together from different sources or apps, the collective set (merged graph) is consistent and logically meaningful. And here is where RD gets its name, borrowing heavily from the biological concept of Recombinant DNA: "two sets of genomes can recombine with one another, without losing or destroying any of their genetic code". In Recombinant Data's case, the logic within the data content is preserved.

Implicit here is the free and open access of semantic definitions, such that an app (or the developer) "can learn more about a given data's semantics" when necessary. This translates into the open publishing of semantic schemas and ontologies, to be used from anywhere on the web. Another requirement is for open-world logic assumptions: not having something does not mean it doesn't exist (e.g., just because a data set does not state "my nickname is Phaedrus" doesn't mean it isn't). Recombinant Data does alter some of the basics about trusting the completeness of data, but this can be re-established through other mechanisms (provenance tracking, verification, proofs, NamedGraphs)... but that's for another day. As each issue is sufficiently addressed, we will see data become "application independent", epitomizing true and sustainable interoperability. Applications that can work with RD will also become much more powerful and beneficial to users, and could spawn a new generation of cool, incrementally extensible apps (hint to you vendors!). I also plan to discuss some of these possibilities in the future as well...

In closing of this inaugural blog, I see the emergence of the Semantic Web strongly requiring the rethinking of the relations between applications and data. This applies evenly to both commercial and open source software and resources. In fact, it has some fascinating implications for apps running on personal laptops and hand-helds (that should be addressed on another blog). I will also point out that there are forces that are trying to prevent this from happening. Since current thinking with commercial vendors is that income is associated with licensing apps, and app independent data will free users from data-format imposed lock-in, they will view Recombinant Data as being antithema to their objectives. However, this is completely wrong, since improved app functionality is what people really want, and Recombinant Data should always trump other approaches for improving apps. We just need to get the eco-system positioned properly so that basic market forces can take over...