Tuesday, October 24, 2017

Wrecks at the Bottom of Data Lakes

What are Data Lakes?

Along with all the activity and marketing hype around Big Data, there are still troubling loose ends to contend with: how do we associate disparate but overlapping data to each other if we’re simply to “pour” data together? Using the lake paradigm, how is one to fish out the specific data that match some form of criterion, as well as anything else associated to it? Some explanations point to adding some type information, but this limits how data from different collections (related but not exactly equal types) can be cross-linked when necessary. We can choose to link entities across types using constrained rules or semantics. However, if we are to rely on some form of data semantics to associate related things, how is this data semantics to be established, added to the lake, and then managed? The metaphor for the lake quickly begins to get murky…

But what happened to semantic data aka linked data, to the ability to link data from multiple sources across an organization or even the Internet? What of all the promises of truly interlinked data independent of where they arise? Is the data lake the replacement paradigm? One notable shift has been to the localization of data to within an organization’s auspices, rather than relying on outbound links, as championed by semantic web standards. But is the lake terminology right for this? In the sciences, there are always external resources that need to be updated and merged with the internal sets. If not properly using linked data identifier (URI) semantics, what then? What is really offered here?

To many, the lake analogy affords a serene image of lazy afternoons of sailing and fishing; but it is deceptive nonetheless. Are things best discovered by using simple tags, are these controlled? Are unique relations the key in identifying special objects? Is it a particular tangle of linked things that help fish out a prize catch? Do large assemblages of multiple facts come out whole in a meaningful way, or is it a jumble of stringy facts? It is not a far stretch to conjure up the thought of an Edmund Fitzgerald[1]-size data wreck if one does not take the time to structure the inserted data. Some things dumped into the lake may never see the light of day again. Is data depth now become a good thing or a bad thing? In this article, we will take a deeper dive into the challenges facing data aggregation and struxturing, and some new ideas of how to better organize growing and evolving data resources.
A concept that was introduced in a previous article, is the Yoneda lemma (abstract algebras), which formally ties all records of entities (including keys) from any table to each other to create one large network of composite relations. It makes it possible to define a query algebra (e.g., SQL, SPARQL) that works with any schema for a dataset. In the case of data lakes, this foundation is missing or at least has not been formally introduced, so a large uncertainty exists on what the formal basis will be to ensure data integrity for insertions, updates and queries. Currently data lakes appear to be a convenient option for handling large influx of datasets, coming in varied, disjoint structural forms. Sean Martin of Cambridge Semantics said of current efforts [1]: “We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there”.

An alternative generalized model is the concept of what I call a Datacomb[2], which relies on both efficiency and logic (ala geometric algebras) for storage, structure, and discoverability. Here any typed real-world entity (RWE) or conjunction of RWEs, can be mapped using single or multiple keys. The latter is usually associated with JOIN results (Patient + Primary Physician), but which can be automatically typed as a Cartesian Product (CP) using existing atomic entities: 
PATIENT×PPHYSICIAN. 

Such a relation instance materializes if a fact exists about a patient having a primary physician, as in any join, but now a compound typed-object exists as well. This compound object may uniquely contain data on when the patient first began going to this doctor, and what was the circumstance of the first visit. The actual visits are also compositionally typed (and linked) as VISIT PATIENT×PPHYSICIAN×DATE, which would include the location[3], any tests performed, and what was the diagnosis. Cartesian products have the basic ability to be decomposed (projected) into the set of atomic entities ((PATIENT, PPHYSICIAN), DATE), with their original associated (row) data. If we wish to include prescribed drug therapies, we can organize this by extending the previous objects thusly: PATIENT×PPHYSICIAN×DATE×THERAPY_START. For every a PATIENT, b PPHYSICIAN, c DATE, and d THERAPY_START, a 3-simplex (4 vertices) is created, where each combination from 1-4 conjunctions (total of 15) has compositional semantic meaning:
Text Box: 1 Cell
4 Faces
6 Edges
4 Vertices

Simplicial Databases

The ability to compose and decompose objects is very useful and mathematically sound, and enables databases to be quite flexible. In fact, any set of k-joined entities can be (if one needs to) decomposed generally into k subsets of k-1 CP entities, which then can be decomposed into k(k-1)/2 subsets of k-2 CP entities, etc, until we arrive at the k atomic entities. This structure is commonly known as a Simplex and the data instance constructs are known as Simplicial Sets, and was first described by David Spivak as having many uses in data storage [2]. One application of them is in statistical inference when computing/analyzing joint and marginal frequencies or probabilities of mixed combinations of similar events or attributes. For example, if a patient has a tumor containing somatic mutations [EGFR amp, P53, PTEN], a mutation simplex is defined that may be part of a larger mutation pattern [EGFR amp, CDK4, P53, PTEN] that some patients have, as well as subsuming smaller patterns of others: [EGFR amp, PTEN] and [EGFR amp, P53]. The entities are different subsets of mutations that are co-occurring, and may each contain the incidence counts for each combination found in patients, or an identified molecular interaction between the co-occurring mutations. This is a numeric example, which can be further combined with other data.

It is worthwhile considering that the actual physical storage implementation of a Simplicial database [1] does not have to allocate every mutation combination possible, nor every combination that exists within sets of patients. The logical constraints are complete, so the model may need to only allocate those for which useful data can be associated (e.g., therapies). This can be considered a form of storage caching and compression, for faster look-ups and associations.  Nonetheless, a simple analysis of real genomic data from ~1000 cancer patients required only a few million unique simplicial entities to be allocated and linked, which makes this highly tractable in today’s large-scale storage systems. Moreover, in some data spaces where events are strongly mutually associated, the combinatorics is not unbounded, and often simplicial sets become saturated (relatively sparse) at intermediate and lower levels.

Note the hierarchy of entities from large mutation combinations to smaller subsets form a “sieve”. Each patient’s pattern is linked to the top (complete) entity, and then filters down to all the subsets contained within that pattern, providing information of which patients share a particular sub-pattern. If these mutation distributions are not statistically independent, it provides evidence there is an underlying mechanism at work [see Fichtenholtz, 2016]. The simplicial database makes it very efficient to find all cases of shared patterns, compared to a query filter (for each) in a relational DB or an edge traversal in a data graph. The mutation simplex is formed directly from calculating and indexing the patterns from a list of mutations for each patient’s analysis, and is cost efficient after most patterns are captured.

Returning to our original PATIENT×PHYSICIAN×DATE example, one can build a simplicial model around the PATIENT×PHYSICIAN pair (edge) linked to a sequence of dates (vertices) to create an implicit series visits (=PATIENT×PHYSICIAN×DATE), i.e., triangular faces. This structure includes a PHYSICIAN×DATE edge, which maps to all the patients that doctor has seen on the same day. A clear advantage of this form of database, is that all key combinations are pre-computed (aka pre-joined), so a simple canonical n-way hash of the values can find the full set of data in a single lookup; this is very well-suited for fast analytics, where multiple lookups are equivalent to query caching. Another advantage is that the CP entities have clear automatic types and can be handled exactly by type-dependent downstream processes, specifically by descriptive algebras supporting CP entities (e.g., MUTATION_SIMPLEXDISEASETHERAPY à DISEASERESPONSE). The combined simplicial set naturally lends itself to analytics for effective treatments based on genomics and disease types.

Datacomb

The basis for the ideas presented here arise from Category Theory (CT), which ensures logical consistency within data model schema. The interconnected set of simplicial entities is described as a simplicial complex (partial overlaps of different simplicial elements) and is a well-defined object in CT[4], and is at the heart of the formal definition of what we call a Datacomb. The complex possesses a formal query algebra for any subset of simplicial entities, and can be used to extract any geometric (connected) subset of data, including measurable things like frequencies. Note also that any graph data-model is automatically a subset of a datacomb since it is just the 1-D skeleton (vertices and edges) of the complex. The datacomb model can be implemented on top of a few different storage technologies, such as multi-array DBs, RDBs, key-valued NOSQL DBs, graph DBs, and (materialized) column-stores (relational systems may not be practical since they require explicit types and type-specific keying). The simplicial logic that is required to interface with them can be layered on top of the existing technologies, so that a common API can be installed on different storage technologies. In fact, RDF could actually be used as a universal description for internal structures in any data system (not only triplestores). All in all, the datacomb approach is a more rigorously defined solution for complex data sets than offered by the data lake meme, one with real definable specifications and multiple analytic and mining applications.

The datacomb can be applied to several different settings: most naturally, it can be mapped to any existing data-array storage systems already in place, with the extension to more flexibly and automatically handle complex-typed objects, useful for precomputing data for use in downstream analytics. In relational DB instances, frequently materialized joins can be more formally and efficiently captured and accessed using a datacomb framework, making it easier and faster to query on conjoined content, as well as recalling the atomic entities on demand. Datacombs serve as the common superset for both data-arrays and relational data, and therefore form a powerful higher-order framework that covers both data analytics and full sets of non-numeric data. Inasmuch, the datacomb offers a lot of advantages to organize and define datasets for any machine-learning tasks, by flexibly formatting raw data into pre-processed structures required by many ML platforms.

In addition, when dealing with closely related entities (e.g., lists of genes and their coded proteins), instead of ambiguously choosing one or the other identifier (e.g., P204392) for recalling the whole set of related data records, a simplex of the related entities would provide a much more even and efficient way to get all the matches. It would then be keyed by any one entity (vertices), or the hashed-sum of the full set (k-cell). This would go a long way to solving the biomedical disambiguation problem. This is the formal equivalent of earlier attempts like SRS[5] to connect multiple related molecular entities.

Datacombs can also handle non-local data by serving as local caches of all the intra- and inter-relations between data records (e.g., genomic data references), providing something much more substantial in function and structure than existing data lake models, analogous to a universal data switchboard. A cloud-based implementation should be very effective by managing all the relations between simple and complex entities from thousands (or more) of different sources. It would then effectively solve what the semantic web initiative had always alluded to do but never did: explicitly handling of complex entity logic (indexing, typing, and filtering) from data that resides in multiple sources, which are usually thought to be (yet unsupported) in the purview of ontologies.

Many organizations intending to utilize their collections of data more effectively are positioning themselves around big data. Yet most of the data environments are a mixture of different classes of technologies, developed/installed at different times, for different goals, and accessed/managed by different groups. Trying to unify this heterogeneous mix will have a broad range of costs depending on the type of technology used and the urgency for completing it (and of course thoroughness of the solution). This easily can range from $100,000’s to $millions; but the cost of doing this incorrectly within a time limit may be even orders of magnitude greater (over $100 millions) due to the business impact of a non-optimal solution, and the new added cost—and additional time—of doing it right the second time. The looming challenge facing many organizations means they need to properly and confidently choose the best approach, fully considering both the maturity of the technologies, and enhanced paradigms for reducing development and maintenance costs. There is concern that no database product from any traditional company is quite ready for the challenge. The consumer must therefore rely on their own knowledge of their precise needs and determine what level of innovation in which they will be willing to invest. A brave new world is emerging for information technologies.

References
1 –Stein, Brian; Morrison, Alan (2014). in Data lakes and the promise of unsiloed data (pdf) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.
2 – David I. Spivak.  Simplicial Databases, https://arxiv.org/abs/0904.2012, 2009
3 - Fichtenholtz, AM, Camarda, ND, Neumann EK. Knowledge-Based Bioinformatics Predicting Significance of Unknown Variants In Glial Tumors Through Sub-Class Enrichment. pp 297-308, Pacific Symposium on Biocomputing 2016.




[2] Regularized structures that are semantically flexible, as with honeycombs in beehives
[3] One could argue that EVENT=DATE×LOCATION should be used rather than DATE, but often it is not needed since location does not change within a day.
[4] They are at the heart of new methodologies including topological data analysis (TDA)

Wednesday, June 2, 2010

Is Linked Data too brittle?

"Once we've linked all public data together using RDF, the world will have unprecedented access to real usable data and then things will begin to happen." - OH

Sounds great on the surface, but so far my experience suggests this solves less then 25% of the information problem - here are my thoughts why...
  • powerful data access demands powerful data interfaces - we aren't there yet by a long shot!
  • non-standard URIs prevent commercial acceptance (e.g. in the life sciences) - a social issue!
  • and most importantly: semantic linking offers little improvement if one simply converts one data syntax (tabular) to another (RDF) - here's where I think we can improve things now!
Most tabular semantics are terrible (and often devoid of it), and they were defined so as to quickly arrange and store information to operate within a row-column access protocol (for a more in-depth discussion see From Tables to RDF). But just as bad is the manifestation of a data table to work across the Web. As RDF, data now becomes a kind of global Truth when it really is most often just one Facet of contextual facts associated with some of the contained objects.

Some may argue that NamedGraphs and Reification can come to the rescue here by providing appropriate Fact Semantics- perhaps, but unfortunately they do not appear to be part of any projects like LOD and from what I can tell these possibilities are non-normative, which is opposite of what public efforts need. Projecting data into RDF without consideration of Context or Fact Semantics leads to creating mountains of brittle data that can only be used in limited cases, i.e., only around the context they were created under, such as a gene expression study. Researchers trying to build up knowledge about genes in general will have a tougher time separating universal truths from contextual ones (e.g., experimental results). And since RDF conversions are happening now all over the web, if one does not take care, we all could get contaminated with irregular facts based on the brittleness of the implied data semantics- a very real Tower of Semantic Babel!

In the case of gene expression data which contains genes and their tissues-specific expression measurements, such data must be viewed in the context of the experiment (i.e., the conditions, interventions, tissue sampling, background genetics, etc). Simply turning an expression set into gene-expression-value RDF triples would be an inappropriate form for web publishing: it makes the gene information brittle and of limited use! Unfortunately, I have not seen any recorded discussion on how to address this, since a lot of efforts are about convincing as many people as possible "to convert their data to RDF". I think this is dangerous prescription and a data integrity bubble is growing that will eventually burst!

Let's step back a bit and review the history...

The shift in describing the Semantic Web from a system of information semantics to linking data across resources was a technically subtle but strategically important move. Strong efforts by the W3C trying for many years to explain the need for information semantics were met with confusion and disagreement as to what semantics meant (the irony of poor semantics of semantics is not missed).

At the end of the day, the message of reducing syntactical ambiguity of information (every data type needed a different parser, e.g., XML) was lost on most people (parsers keep people employed!). The notion of turning HTML links from formless web links into clear relation types was not obvious to many. Basically, people felt the web obviously "looked" as if it had semantics (the blue colored links were situated at meaningful locations in text), so why all this extra semantic work? Who really needs machine-readable data? It already goes through web servers and browsers so isn't it already machine-readable?

By shifting focus to "linking data", those individuals involved in data interchange and storage (the IT guys) were brought into the discussion, and they seemed to have been able to better grasp the significance of using a standard like RDF. By saying the linked data enables open connecting and handling of data from diverse locations on the web, many of the subtleties of the Semantic Web began to make more real-world sense to folks. Specifically, most IT developers have struggled for years in companies to provide standardized means of integrating there databases with little practical results to show. This Linked Data idea actually looked like it might have promise... hurray!!

Still, there was some confusion around "what is a URI exactly?", is it an identifier, is it a web location, what do I find when I go there? IMHO this could have been handled better (another post eventually), by discussing the semantic theoretics of URI before moving to discuss RDF (TBL design discussion on URI were not very intuitive to most data experts). I think the issues around URI have begun to get settled and most people are OK with it now-- for the most part, religious wars around LSIDs and other URN approaches seem to have subsided.

However, all these discussions have focused primarily on mapping existing data structures (linked tables) to a web-based way of doing things. OK for some, but many in the life sciences need the newly converted information to be in a form that is ready for day-to-day research (e.g. tab-delimited formats), and not just with public sector data. Data semantics should clearly empower informaticists beyond what the can quickly do with tables and perl scripts - they need gene information to be readily applied to SNP analysis, gene expression studies, or molecular structure analyses. If commercial groups are to get involved, the issues around fact semantics and data brittleness need to be addressed ASAP!

My own efforts involving Data Articulation try to address this by offering a strategy that realizes there is no singular way of describing connected information, yet some forms may be more appropriate for public resource publishing, while others are more suited for deep computational analytics and mining. Data articulation provides a method of taking contextualized data forms (including Named Graphs) and generating internal forms (e.g., workspaces) optimized for computational objectives. In addition, while this approach can take advantage of ontologies, it cannot by itself be captured in any single ontology (it's actually meta-ontological). That's because data articulation is really about applying the right rule transform (SPARQL construct) for the right semantics and context. In fact, it may not even require any complex ontologies to be available to or part of the data sets; perhaps ontologies can be "injected" at the time they are actually required rather than being non-modal and global.

A good example comes from mining and analyzing pathway data that can be obtained in the BioPax OWL format from Reactome and other sources. BioPax supports a lot of semantic structures including recursive protein complex structures; data articulation allows us to create reaction steps from Reactome-BioPax that include the proteins as direct participants of a reaction. This allows much faster pathway queries and traversals and improved pathway visualizations (topic for another blog). These efficient forms are not necessarily what you would wish to publish, but could be (with the proper context) included explicitly within the set. Indeed, I think there is a strong relation between data articulation and semantic data visualization, which I am in the midst of exploring with BBN (yeah, the original guys).

I strongly believe data articulation is key for taking data from context-rich forms from around the Web, and flexibly transforming them into the proper scientific semantic forms for a specific task. For this to work, the initial source forms on the web must explicitly include all contextual and fact semantics, and we'll need to develop proper semantic standards that will work correctly with different data domains coming from their corresponding communities (life sciences, financial, news-media, etc). For now, data articulation is a de facto part of the solutions my company provides, but I hope it becomes common place. There is strong demand for it from my clients when they are presented with the issues of data utilization and life cycles.

As more new linked data apprentices convert their tables into RDF, piles of brittle data will continue to grow, and may actually impede the uptake and use of linked data. For some of us who have advocated semantic approaches for over 10 years, this is a serious concern. We need to be making realistic plans about what kind of semantics need to accompany public and proprietary data sets when converted. Perhaps we should propose an new semantic linked data challenge?

From Tables to RDF

A lot of us have converted basic tabular data into RDF in our local projects, but going beyond these simple examples, the discussions of how best to transform table data seem to be limited. For instance, when should column-based values be treated as direct predicates of a row subject, and when should cells be treated as objects linked by double-key predicates (e.g. one from a gene-probe object and one from a sample object). In the case of gene expression data, clearly the latter should is preferred. But where on the web are these useful rules and pattern written down for interested SW newbies? I hope the following discussion may somehow promote the formation of better RDF data pattern resources...

Most data tabular semantics are quite poor or even non-existent. They are defined to work within particular established information technologies like RDBM, and minimize focus on content meaning (i.e., technologies before content). This can be clearly understood from an economics point of view, where selling a DB technology scales better than building the superior "intelligent" solution for each data set (maybe that's why SW took so long to catch on?). In any case, existing data tables probably lack the necessary semantics most of us in the SW community are used to expecting. In some cases, better semantics can be added since it's a class-general object-attribute adjustment; in other cases, it may require metadata and context that was never properly captured and now is lost for good. Nonetheless, we need to be aware of these issues going forward with RDF-izing both legacy and new data systems.

As a useful example, a table containing rows of patients with certain symptoms or adverse events to drugs SHOULD NOT be RDF-encoded where the patient has a direct symptom-attribute! Why, because the symptom or AE occurred and was observed at a specific time, therefore the patient should really be linked to an observation with metadata on time, place, test, and physician, and then the observed symptoms linked into the observation object. CDISCs SDTM was designed to handled this context of visits to the clinic and clinical findings; much of this comes from the BRIDG model that SDTM, HL7 RCRIM, and NCI follow. In this case semantics are available, but it also means converting SDTM data as row-column-cell triples will not work, since implicit (anonymous) finding and observation nodes need to be properly inserted in between attributes (described in the draft DSE Note at W3C HCLSIG).

But other cases exist where the semantics have yet to be defined. For example DrugBank is a data model for critical information about a drug at the time of its approval. It would therefore make sense to "date" the individual records with this approval date, but that means associating the "creation date" with the DrugCard record, not the approved Drug itself. Any new facts gained about a drug over time, new indications, label changes, adverse events, etc, should be associated to the drug with the proper context, (possibly a versioned DrugCard). Therefore DrugBank has at least 2 classes of primary subjects: DrugCard records and Drugs, both which require URIs. In addition, DrugCard records will need to be versioned and linked to previous forms.  This often cannot be inferred by non-domain informaticists looking just at the data tables, but rather requires working side-by-side with drug experts in the domain. To date, this has been woefully unsupported, even in groups like HCLS where input from pharmaceutical experts in occasional.

Serious LOD efforts should work more closely with the key domain experts to properly preserve and correct the source data semantics. It is also clear we need to create URIs with the proper authorities than the current quickie approach, since it doens't look convincing for a company to see "www.fu-berlin.de" as the data domain.  I have had to convert many of the LODD source data sets into proper semantic forms mainly because the available LODD sets contain flaws and poor semantics that prevent commercial use of them. These are necessary principles to follow if we intend to offer more data value by using semantic linked data standards.

Monday, February 23, 2009

The Graveyards of Knowledge

Web content has its blessings: it is easy to publish and style-edit. The rise of wikis and blogs indicates the Web has come of age... 

But there is also a dark side to some of this as well. There are some lessons to learn from approaches that are not so successful. Content Management is an essential part of any company's existence. Tools that easily enable users to create spaces for uploading thematic content have been gratefully embraced. Yet too often it is easy to upload a document, send a notice to all you have done so, and then loose track of it. We think we're putting it in a safe and accessible place, but human's by themselves can't keep track of thousands of digital assets. 

One colleague of mine at Aventis called a commonly used content management system, "a Graveyard of Knowledge". Technical folks also refer to this as "a technology mouse trap": information goes in but it rarely come out. Of course many of us have been told "that's what search engines are for". But what do you 'search on' to find precisely that one doc you sort of remember in terms of bits and pieces? Once your content management system holds a reasonable 10,000 items, those word pairings used in the search won't always work completely. You find some docs, not quite the right ones, and miss the important ones, and what's worse: you can't even estimate how much is not recovered! And if it's about the metadata and links, who is responsible for that? IT can't do it since it's about knowing the content.

Governance, stewardship, ownership
There is no substitute for taking responsibility of handling content you've either created or requested. You as owner, know what it contains and for what it is relevant. Every digital creation should have a strong link back to its author (yes, I do mean RDF triples). This puts back the 'human-value' into the digital equation. Not only does it allow a reader to go back to the source, but it can also provide information on the circumstances and resolutions of the discussed issues.

Data Stewardship has a special meaning in these days of content management and linking data: data, metadata, and annotations should be the responsibility of each contributor. In cases of some internal databases this translates into knowing a lot about the content, how it is updated, what domain QA principles are in place (rather than simply checking for completed data fields), and most importantly how well data consumers are able to utilize it.

The support provided by RDF and data linking could be applied along specific policies to improve these issues. By themselves they won't solve them, since there needs to be an accompanying change in the culture, but not only for IT. The scientific producers/consumers should should be taking up the stewardship role more often, since it is their content, and so technologies must become useable enough to make their tasks possible. 

All scientists from now on need to become Data Stewards. Consequently, all support systems need to be designed to work easily within their domain, i.e. no need for additional complicated applications or configuration tasks. And their are great examples of this already happening: internal Knowledge Wikis. One example is Pfizerpedia, a system heavily used by Pfizer's researchers based on MediaWiki. Scientists already use them and in many cases, demand access to them.

This wave is promising and should be allowed to grow, but a major element still missing is to easily allow direct links within these wikis to data records and metadata descriptors. These links will serve not only to improve human-requested searches, but machine-driven discovery as well, which is enormously scalable. Once these are integrated into existing content systems, real Knowledge Environments will begin to take shape in companies, and their usage should have a pronounce benefit in company innovation. Perhaps one research group in the company will be able to find the results of what was already performed by another group 8 months earlier, and successfully find a new therapy in half the time? Who can afford not to improve these days.

Tuesday, December 30, 2008

What is Recombinant Data?

I'm kicking-off this blog with a discussion of a general theme, but one that will come up again in subsequent topics. In fact, it's the name of this blog site: Recombinant Data. The reason I went so far as to name this site accordingly, is because the idea behind "Recombinant Data" is very powerful, yet it is counter to practices by software developers for the last several years. It therefore really deserves its own web site for clarification and building on examples, as well as ongoing community discussions. The first mention of Recombinant Data was by Eric Miller, while he was the W3C liaison for the Healthcare and Life Sciences Interest Group. Since then, I've used it countless times in presentations to various groups, since it is an essential cornerstone of the Semantic Web initiative [the topic of many future posts].

First, a little bit of background: The established way of thinking about software and data has been that an application is the primary point of user experience and the data it creates (and reads) is a persistent artifact whose (user) value depends very much on the application "to read it and to know what to do with it". In other words, data semantics is interpreted by a specific application, and therefore only within the context of that app. Consequently, the efficient re-use of data (data interoperability) is impeded, and it is now at the mercy of specialized contracts or "standards" that must be created between application sets (e.g., Adobe-PDF or Office Suite).

Perhaps this model is good-enough for apps used always the same way by millions of consumers for things like word processing or presentations. But if there is to be any hope for improved interoperability in emerging and complex areas such as healthcare, scientific research, or other knowledge managing fields, waiting for the "right standards" to emerge is like waiting for bacteria to grow wings...[more on standard in another blog]. Standards aren't wrong; they should (from now on) be about practice and semantics, rather than data formats and APIs!

Recombinant Data (RD) takes a very different starting point: it is about structuring data with minimum syntactic rules (MSR), yet with enough semantics so that the data output from one app can be easily read and handled by another app, even though neither app has any specific contract apart form the MSR. And though semantics are necessary for understanding what the data is, only knowledge of enough semantics (patients are a kind of person) is required by an external app (myMail) to use the necessary part of the data (patient identifiers about me). Being able to use the right subset of semantics for additional operations by various apps allows for the semantic-invariant mixing and separation data: no matter what gets pulled together from different sources or apps, the collective set (merged graph) is consistent and logically meaningful. And here is where RD gets its name, borrowing heavily from the biological concept of Recombinant DNA: "two sets of genomes can recombine with one another, without losing or destroying any of their genetic code". In Recombinant Data's case, the logic within the data content is preserved.

Implicit here is the free and open access of semantic definitions, such that an app (or the developer) "can learn more about a given data's semantics" when necessary. This translates into the open publishing of semantic schemas and ontologies, to be used from anywhere on the web. Another requirement is for open-world logic assumptions: not having something does not mean it doesn't exist (e.g., just because a data set does not state "my nickname is Phaedrus" doesn't mean it isn't). Recombinant Data does alter some of the basics about trusting the completeness of data, but this can be re-established through other mechanisms (provenance tracking, verification, proofs, NamedGraphs)... but that's for another day. As each issue is sufficiently addressed, we will see data become "application independent", epitomizing true and sustainable interoperability. Applications that can work with RD will also become much more powerful and beneficial to users, and could spawn a new generation of cool, incrementally extensible apps (hint to you vendors!). I also plan to discuss some of these possibilities in the future as well...

In closing of this inaugural blog, I see the emergence of the Semantic Web strongly requiring the rethinking of the relations between applications and data. This applies evenly to both commercial and open source software and resources. In fact, it has some fascinating implications for apps running on personal laptops and hand-helds (that should be addressed on another blog). I will also point out that there are forces that are trying to prevent this from happening. Since current thinking with commercial vendors is that income is associated with licensing apps, and app independent data will free users from data-format imposed lock-in, they will view Recombinant Data as being antithema to their objectives. However, this is completely wrong, since improved app functionality is what people really want, and Recombinant Data should always trump other approaches for improving apps. We just need to get the eco-system positioned properly so that basic market forces can take over...