Wednesday, June 2, 2010

From Tables to RDF

A lot of us have converted basic tabular data into RDF in our local projects, but going beyond these simple examples, the discussions of how best to transform table data seem to be limited. For instance, when should column-based values be treated as direct predicates of a row subject, and when should cells be treated as objects linked by double-key predicates (e.g. one from a gene-probe object and one from a sample object). In the case of gene expression data, clearly the latter should is preferred. But where on the web are these useful rules and pattern written down for interested SW newbies? I hope the following discussion may somehow promote the formation of better RDF data pattern resources...

Most data tabular semantics are quite poor or even non-existent. They are defined to work within particular established information technologies like RDBM, and minimize focus on content meaning (i.e., technologies before content). This can be clearly understood from an economics point of view, where selling a DB technology scales better than building the superior "intelligent" solution for each data set (maybe that's why SW took so long to catch on?). In any case, existing data tables probably lack the necessary semantics most of us in the SW community are used to expecting. In some cases, better semantics can be added since it's a class-general object-attribute adjustment; in other cases, it may require metadata and context that was never properly captured and now is lost for good. Nonetheless, we need to be aware of these issues going forward with RDF-izing both legacy and new data systems.

As a useful example, a table containing rows of patients with certain symptoms or adverse events to drugs SHOULD NOT be RDF-encoded where the patient has a direct symptom-attribute! Why, because the symptom or AE occurred and was observed at a specific time, therefore the patient should really be linked to an observation with metadata on time, place, test, and physician, and then the observed symptoms linked into the observation object. CDISCs SDTM was designed to handled this context of visits to the clinic and clinical findings; much of this comes from the BRIDG model that SDTM, HL7 RCRIM, and NCI follow. In this case semantics are available, but it also means converting SDTM data as row-column-cell triples will not work, since implicit (anonymous) finding and observation nodes need to be properly inserted in between attributes (described in the draft DSE Note at W3C HCLSIG).

But other cases exist where the semantics have yet to be defined. For example DrugBank is a data model for critical information about a drug at the time of its approval. It would therefore make sense to "date" the individual records with this approval date, but that means associating the "creation date" with the DrugCard record, not the approved Drug itself. Any new facts gained about a drug over time, new indications, label changes, adverse events, etc, should be associated to the drug with the proper context, (possibly a versioned DrugCard). Therefore DrugBank has at least 2 classes of primary subjects: DrugCard records and Drugs, both which require URIs. In addition, DrugCard records will need to be versioned and linked to previous forms.  This often cannot be inferred by non-domain informaticists looking just at the data tables, but rather requires working side-by-side with drug experts in the domain. To date, this has been woefully unsupported, even in groups like HCLS where input from pharmaceutical experts in occasional.

Serious LOD efforts should work more closely with the key domain experts to properly preserve and correct the source data semantics. It is also clear we need to create URIs with the proper authorities than the current quickie approach, since it doens't look convincing for a company to see "www.fu-berlin.de" as the data domain.  I have had to convert many of the LODD source data sets into proper semantic forms mainly because the available LODD sets contain flaws and poor semantics that prevent commercial use of them. These are necessary principles to follow if we intend to offer more data value by using semantic linked data standards.

1 comment:

  1. Hi Eric, thanks for referring back to our old DSE note 0) with some early ideas on a semantic web representation of CDISC SDTM standard.

    Would be great to progress this into "useful rules and pattern written down for interested SW newbies" working with observational data. By pragmatically 1) applying the realism based rigorness urged by Barry Smith and Werner Ceusters, 2) using the Information Artifact Ontology proposed by Alan Ruttenberg et al for information content entities, and 3) assuming the basic principles for real world entities as Kingsley Uyi Idehen propose in the Data 3.0 Manifesto.

    0) W3C note on Drug Safety and Efficacy Note on CDISC's Study Data Tabulation Model (SDTM) http://bit.ly/doZZkT
    1) Ontologies in healthcare and the vision of personalized medicine, especially the slides on "Ontology of General Medical Science" and on "Adverse Event Ontology" http://www.referent-tracking.com/RTU/sendfile/?file=VUBleerstoel20092010-L3.ppt
    2) http://code.google.com/p/information-artifact-ontology/
    3) Data 3.0 Manifesto for Platform Agnostic Structured Data http://bit.ly/9Z7JpE

    ReplyDelete