Wednesday, June 2, 2010

Is Linked Data too brittle?

"Once we've linked all public data together using RDF, the world will have unprecedented access to real usable data and then things will begin to happen." - OH

Sounds great on the surface, but so far my experience suggests this solves less then 25% of the information problem - here are my thoughts why...
  • powerful data access demands powerful data interfaces - we aren't there yet by a long shot!
  • non-standard URIs prevent commercial acceptance (e.g. in the life sciences) - a social issue!
  • and most importantly: semantic linking offers little improvement if one simply converts one data syntax (tabular) to another (RDF) - here's where I think we can improve things now!
Most tabular semantics are terrible (and often devoid of it), and they were defined so as to quickly arrange and store information to operate within a row-column access protocol (for a more in-depth discussion see From Tables to RDF). But just as bad is the manifestation of a data table to work across the Web. As RDF, data now becomes a kind of global Truth when it really is most often just one Facet of contextual facts associated with some of the contained objects.

Some may argue that NamedGraphs and Reification can come to the rescue here by providing appropriate Fact Semantics- perhaps, but unfortunately they do not appear to be part of any projects like LOD and from what I can tell these possibilities are non-normative, which is opposite of what public efforts need. Projecting data into RDF without consideration of Context or Fact Semantics leads to creating mountains of brittle data that can only be used in limited cases, i.e., only around the context they were created under, such as a gene expression study. Researchers trying to build up knowledge about genes in general will have a tougher time separating universal truths from contextual ones (e.g., experimental results). And since RDF conversions are happening now all over the web, if one does not take care, we all could get contaminated with irregular facts based on the brittleness of the implied data semantics- a very real Tower of Semantic Babel!

In the case of gene expression data which contains genes and their tissues-specific expression measurements, such data must be viewed in the context of the experiment (i.e., the conditions, interventions, tissue sampling, background genetics, etc). Simply turning an expression set into gene-expression-value RDF triples would be an inappropriate form for web publishing: it makes the gene information brittle and of limited use! Unfortunately, I have not seen any recorded discussion on how to address this, since a lot of efforts are about convincing as many people as possible "to convert their data to RDF". I think this is dangerous prescription and a data integrity bubble is growing that will eventually burst!

Let's step back a bit and review the history...

The shift in describing the Semantic Web from a system of information semantics to linking data across resources was a technically subtle but strategically important move. Strong efforts by the W3C trying for many years to explain the need for information semantics were met with confusion and disagreement as to what semantics meant (the irony of poor semantics of semantics is not missed).

At the end of the day, the message of reducing syntactical ambiguity of information (every data type needed a different parser, e.g., XML) was lost on most people (parsers keep people employed!). The notion of turning HTML links from formless web links into clear relation types was not obvious to many. Basically, people felt the web obviously "looked" as if it had semantics (the blue colored links were situated at meaningful locations in text), so why all this extra semantic work? Who really needs machine-readable data? It already goes through web servers and browsers so isn't it already machine-readable?

By shifting focus to "linking data", those individuals involved in data interchange and storage (the IT guys) were brought into the discussion, and they seemed to have been able to better grasp the significance of using a standard like RDF. By saying the linked data enables open connecting and handling of data from diverse locations on the web, many of the subtleties of the Semantic Web began to make more real-world sense to folks. Specifically, most IT developers have struggled for years in companies to provide standardized means of integrating there databases with little practical results to show. This Linked Data idea actually looked like it might have promise... hurray!!

Still, there was some confusion around "what is a URI exactly?", is it an identifier, is it a web location, what do I find when I go there? IMHO this could have been handled better (another post eventually), by discussing the semantic theoretics of URI before moving to discuss RDF (TBL design discussion on URI were not very intuitive to most data experts). I think the issues around URI have begun to get settled and most people are OK with it now-- for the most part, religious wars around LSIDs and other URN approaches seem to have subsided.

However, all these discussions have focused primarily on mapping existing data structures (linked tables) to a web-based way of doing things. OK for some, but many in the life sciences need the newly converted information to be in a form that is ready for day-to-day research (e.g. tab-delimited formats), and not just with public sector data. Data semantics should clearly empower informaticists beyond what the can quickly do with tables and perl scripts - they need gene information to be readily applied to SNP analysis, gene expression studies, or molecular structure analyses. If commercial groups are to get involved, the issues around fact semantics and data brittleness need to be addressed ASAP!

My own efforts involving Data Articulation try to address this by offering a strategy that realizes there is no singular way of describing connected information, yet some forms may be more appropriate for public resource publishing, while others are more suited for deep computational analytics and mining. Data articulation provides a method of taking contextualized data forms (including Named Graphs) and generating internal forms (e.g., workspaces) optimized for computational objectives. In addition, while this approach can take advantage of ontologies, it cannot by itself be captured in any single ontology (it's actually meta-ontological). That's because data articulation is really about applying the right rule transform (SPARQL construct) for the right semantics and context. In fact, it may not even require any complex ontologies to be available to or part of the data sets; perhaps ontologies can be "injected" at the time they are actually required rather than being non-modal and global.

A good example comes from mining and analyzing pathway data that can be obtained in the BioPax OWL format from Reactome and other sources. BioPax supports a lot of semantic structures including recursive protein complex structures; data articulation allows us to create reaction steps from Reactome-BioPax that include the proteins as direct participants of a reaction. This allows much faster pathway queries and traversals and improved pathway visualizations (topic for another blog). These efficient forms are not necessarily what you would wish to publish, but could be (with the proper context) included explicitly within the set. Indeed, I think there is a strong relation between data articulation and semantic data visualization, which I am in the midst of exploring with BBN (yeah, the original guys).

I strongly believe data articulation is key for taking data from context-rich forms from around the Web, and flexibly transforming them into the proper scientific semantic forms for a specific task. For this to work, the initial source forms on the web must explicitly include all contextual and fact semantics, and we'll need to develop proper semantic standards that will work correctly with different data domains coming from their corresponding communities (life sciences, financial, news-media, etc). For now, data articulation is a de facto part of the solutions my company provides, but I hope it becomes common place. There is strong demand for it from my clients when they are presented with the issues of data utilization and life cycles.

As more new linked data apprentices convert their tables into RDF, piles of brittle data will continue to grow, and may actually impede the uptake and use of linked data. For some of us who have advocated semantic approaches for over 10 years, this is a serious concern. We need to be making realistic plans about what kind of semantics need to accompany public and proprietary data sets when converted. Perhaps we should propose an new semantic linked data challenge?

No comments:

Post a Comment