Recombinant Data: 2009

Web content has its blessings: it is easy to publish and style-edit. The rise of wikis and blogs indicates the Web has come of age...

But there is also a dark side to some of this as well. There are some lessons to learn from approaches that are not so successful. Content Management is an essential part of any company's existence. Tools that easily enable users to create spaces for uploading thematic content have been gratefully embraced. Yet too often it is easy to upload a document, send a notice to all you have done so, and then loose track of it. We think we're putting it in a safe and accessible place, but human's by themselves can't keep track of thousands of digital assets.

One colleague of mine at Aventis called a commonly used content management system, "a Graveyard of Knowledge". Technical folks also refer to this as "a technology mouse trap": information goes in but it rarely come out. Of course many of us have been told "that's what search engines are for". But what do you 'search on' to find precisely that one doc you sort of remember in terms of bits and pieces? Once your content management system holds a reasonable 10,000 items, those word pairings used in the search won't always work completely. You find some docs, not quite the right ones, and miss the important ones, and what's worse: you can't even estimate how much is not recovered! And if it's about the metadata and links, who is responsible for that? IT can't do it since it's about knowing the content.

Governance, stewardship, ownership

There is no substitute for taking responsibility of handling content you've either created or requested. You as owner, know what it contains and for what it is relevant. Every digital creation should have a strong link back to its author (yes, I do mean RDF triples). This puts back the 'human-value' into the digital equation. Not only does it allow a reader to go back to the source, but it can also provide information on the circumstances and resolutions of the discussed issues.

Data Stewardship has a special meaning in these days of content management and linking data: data, metadata, and annotations should be the responsibility of each contributor. In cases of some internal databases this translates into knowing a lot about the content, how it is updated, what domain QA principles are in place (rather than simply checking for completed data fields), and most importantly how well data consumers are able to utilize it.

The support provided by RDF and data linking could be applied along specific policies to improve these issues. By themselves they won't solve them, since there needs to be an accompanying change in the culture, but not only for IT. The scientific producers/consumers should should be taking up the stewardship role more often, since it is their content, and so technologies must become useable enough to make their tasks possible.

All scientists from now on need to become Data Stewards. Consequently, all support systems need to be designed to work easily within their domain, i.e. no need for additional complicated applications or configuration tasks. And their are great examples of this already happening: internal Knowledge Wikis. One example is Pfizerpedia, a system heavily used by Pfizer's researchers based on MediaWiki. Scientists already use them and in many cases, demand access to them.

This wave is promising and should be allowed to grow, but a major element still missing is to easily allow direct links within these wikis to data records and metadata descriptors. These links will serve not only to improve human-requested searches, but machine-driven discovery as well, which is enormously scalable. Once these are integrated into existing content systems, real Knowledge Environments will begin to take shape in companies, and their usage should have a pronounce benefit in company innovation. Perhaps one research group in the company will be able to find the results of what was already performed by another group 8 months earlier, and successfully find a new therapy in half the time? Who can afford not to improve these days.

Recombinant Data

Monday, February 23, 2009

The Graveyards of Knowledge

Blog Archive

Interesting Blogs