What are Data Lakes?
Along with all the activity and marketing hype around Big
Data, there are still troubling loose ends to contend with: how do we associate
disparate but overlapping data to each other if we’re simply to “pour” data
together? Using the lake paradigm, how is one to fish out the specific data
that match some form of criterion, as well as anything else associated to it?
Some explanations point to adding some type information, but this limits how
data from different collections (related but not exactly equal types) can be
cross-linked when necessary. We can choose to link entities across types using constrained
rules or semantics. However, if we are to rely on some form of data semantics
to associate related things, how is this data semantics to be established,
added to the lake, and then managed? The metaphor for the lake quickly begins
to get murky…
But what happened to semantic data aka linked data, to the
ability to link data from multiple sources across an organization or even the
Internet? What of all the promises of truly interlinked data independent of
where they arise? Is the data lake the replacement paradigm? One notable shift
has been to the localization of data to within an organization’s auspices,
rather than relying on outbound links, as championed by semantic web standards.
But is the lake terminology right for
this? In the sciences, there are always external resources that need to be
updated and merged with the internal sets. If not properly using linked data identifier
(URI) semantics, what then? What is really offered here?
To many, the lake analogy affords a serene image of lazy
afternoons of sailing and fishing; but it is deceptive nonetheless. Are things
best discovered by using simple tags, are these controlled? Are unique
relations the key in identifying special objects? Is it a particular tangle of
linked things that help fish out a prize catch? Do large assemblages of
multiple facts come out whole in a meaningful way, or is it a jumble of stringy
facts? It is not a far stretch to conjure up the thought of an Edmund
Fitzgerald[1]-size
data wreck if one does not take the time to structure the inserted data. Some
things dumped into the lake may never see the light of day again. Is data depth
now become a good thing or a bad thing? In this article, we will take a deeper
dive into the challenges facing data aggregation and struxturing, and some new
ideas of how to better organize growing and evolving data resources.
A concept that was introduced in a previous article, is the
Yoneda lemma (abstract algebras), which formally ties all records of entities
(including keys) from any table to each other to create one large network of composite
relations. It makes it possible to define a query algebra (e.g., SQL, SPARQL)
that works with any schema for a dataset. In the case of data lakes, this
foundation is missing or at least has not been formally introduced, so a large
uncertainty exists on what the formal basis will be to ensure data integrity
for insertions, updates and queries. Currently data lakes appear to be a convenient
option for handling large influx of datasets, coming in varied, disjoint
structural forms. Sean Martin of Cambridge Semantics said of current efforts
[1]: “We see customers creating big data graveyards, dumping everything into
HDFS [Hadoop Distributed File System] and hoping to do something with it down
the road. But then they just lose track of what’s there”.
An alternative generalized model is the concept of what I
call a Datacomb[2],
which relies on both efficiency and logic (ala geometric algebras) for storage,
structure, and discoverability. Here any typed real-world entity (RWE) or
conjunction of RWEs, can be mapped using single or multiple keys. The latter is
usually associated with JOIN results (Patient + Primary Physician), but which
can be automatically typed as a Cartesian Product (CP) using existing atomic entities:
PATIENT×PPHYSICIAN.
Such a relation instance materializes if a fact exists
about a patient having a primary physician, as in any join, but now a compound typed-object
exists as well. This compound object may uniquely contain data on when the
patient first began going to this doctor, and what was the circumstance of the
first visit. The actual visits are also compositionally typed (and linked) as
VISIT ≝
PATIENT×PPHYSICIAN×DATE, which would include the location[3],
any tests performed, and what was the diagnosis. Cartesian products have the
basic ability to be decomposed (projected) into the set of atomic entities
((PATIENT, PPHYSICIAN), DATE), with their original associated (row) data. If we
wish to include prescribed drug therapies, we can organize this by extending
the previous objects thusly: PATIENT×PPHYSICIAN×DATE×THERAPY_START. For every a
∊ PATIENT,
b ∊
PPHYSICIAN, c ∊ DATE, and d ∊ THERAPY_START, a 3-simplex (4
vertices) is created, where each combination from 1-4 conjunctions (total of
15) has compositional semantic meaning:
Simplicial Databases
The ability to compose and decompose objects is very useful
and mathematically sound, and enables databases to be quite flexible. In fact,
any set of k-joined entities can be
(if one needs to) decomposed generally into k
subsets of k-1 CP entities, which
then can be decomposed into k(k-1)/2
subsets of k-2 CP entities, etc,
until we arrive at the k atomic
entities. This structure is commonly known as a Simplex and the data instance constructs are known as Simplicial Sets, and was first described
by David Spivak as having many uses in data storage [2]. One application of
them is in statistical inference when computing/analyzing joint and marginal frequencies
or probabilities of mixed combinations of similar events or attributes. For
example, if a patient has a tumor containing somatic mutations [EGFR amp, P53,
PTEN], a mutation simplex is defined
that may be part of a larger mutation pattern [EGFR amp, CDK4, P53, PTEN] that
some patients have, as well as subsuming smaller patterns of others: [EGFR amp,
PTEN] and [EGFR amp, P53]. The entities are different subsets of mutations that
are co-occurring, and may each contain the incidence counts for each
combination found in patients, or an identified molecular interaction between the
co-occurring mutations. This is a numeric example, which can be further
combined with other data.
It is worthwhile considering that the actual physical
storage implementation of a Simplicial database
[1] does not have to allocate every mutation combination possible, nor every
combination that exists within sets of patients. The logical constraints are
complete, so the model may need to only allocate those for which useful data
can be associated (e.g., therapies). This can be considered a form of storage caching
and compression, for faster look-ups and associations. Nonetheless, a simple analysis of real
genomic data from ~1000 cancer patients required only a few million unique
simplicial entities to be allocated and linked, which makes this highly
tractable in today’s large-scale storage systems. Moreover, in some data spaces
where events are strongly mutually associated, the combinatorics is not
unbounded, and often simplicial sets become saturated (relatively sparse) at
intermediate and lower levels.
Note the hierarchy of entities from large mutation
combinations to smaller subsets form a “sieve”. Each patient’s pattern is
linked to the top (complete) entity, and then filters down to all the subsets
contained within that pattern, providing information of which patients share a
particular sub-pattern. If these mutation distributions are not statistically
independent, it provides evidence there is an underlying mechanism at work [see
Fichtenholtz, 2016]. The simplicial database makes it very efficient to find
all cases of shared patterns, compared to a query filter (for each) in a
relational DB or an edge traversal in a data graph. The mutation simplex is
formed directly from calculating and indexing the patterns from a list of
mutations for each patient’s analysis, and is cost efficient after most
patterns are captured.
Returning to our original PATIENT×PHYSICIAN×DATE example,
one can build a simplicial model around the PATIENT×PHYSICIAN pair (edge)
linked to a sequence of dates (vertices) to create an implicit series visits
(=PATIENT×PHYSICIAN×DATE), i.e., triangular faces. This structure includes a
PHYSICIAN×DATE edge, which maps to all the patients that doctor has seen on the
same day. A clear advantage of this form of database, is that all key
combinations are pre-computed (aka pre-joined), so a simple canonical n-way hash
of the values can find the full set of data in a single lookup; this is very
well-suited for fast analytics, where multiple lookups are equivalent to query
caching. Another advantage is that the CP entities have clear automatic types
and can be handled exactly by type-dependent downstream processes, specifically
by descriptive algebras supporting CP entities (e.g., MUTATION_SIMPLEX⊗DISEASE⊗THERAPY
à
DISEASE⊗RESPONSE).
The combined simplicial set naturally lends itself to analytics for effective
treatments based on genomics and disease types.
Datacomb
The basis for the ideas presented here arise from Category
Theory (CT), which ensures logical consistency within data model schema. The
interconnected set of simplicial entities is described as a simplicial complex (partial overlaps of
different simplicial elements) and is a well-defined object in CT[4],
and is at the heart of the formal definition of what we call a Datacomb. The complex possesses a
formal query algebra for any subset of simplicial entities, and can be used to
extract any geometric (connected) subset of data, including measurable things
like frequencies. Note also that any graph data-model is automatically a subset
of a datacomb since it is just the 1-D skeleton (vertices and edges) of the complex.
The datacomb model can be implemented on top of a few different storage
technologies, such as multi-array DBs, RDBs, key-valued NOSQL DBs, graph DBs,
and (materialized) column-stores (relational systems may not be practical since
they require explicit types and type-specific keying). The simplicial logic
that is required to interface with them can be layered on top of the existing
technologies, so that a common API can be installed on different storage
technologies. In fact, RDF could actually be used as a universal description
for internal structures in any data system (not only triplestores). All in all,
the datacomb approach is a more rigorously defined solution for complex data
sets than offered by the data lake meme, one with real definable specifications
and multiple analytic and mining applications.
The datacomb can be applied to several different settings: most
naturally, it can be mapped to any existing data-array storage systems already
in place, with the extension to more flexibly and automatically handle complex-typed
objects, useful for precomputing data for use in downstream analytics. In
relational DB instances, frequently materialized joins can be more formally and
efficiently captured and accessed using a datacomb framework, making it easier and
faster to query on conjoined content, as well as recalling the atomic entities
on demand. Datacombs serve as the common superset for both data-arrays and
relational data, and therefore form a powerful higher-order framework that covers
both data analytics and full sets of non-numeric data. Inasmuch, the datacomb
offers a lot of advantages to organize and define datasets for any machine-learning
tasks, by flexibly formatting raw data into pre-processed structures required
by many ML platforms.
In addition, when dealing with closely related entities
(e.g., lists of genes and their coded proteins), instead of ambiguously
choosing one or the other identifier (e.g., P204392) for recalling the whole
set of related data records, a simplex of the related entities would provide a
much more even and efficient way to get all the matches. It would then be keyed
by any one entity (vertices), or the hashed-sum of the full set (k-cell). This
would go a long way to solving the biomedical disambiguation problem. This is
the formal equivalent of earlier attempts like SRS[5]
to connect multiple related molecular entities.
Datacombs can also handle non-local data by serving as local
caches of all the intra- and inter-relations between data records (e.g.,
genomic data references), providing something much more substantial in function
and structure than existing data lake models, analogous to a universal data
switchboard. A cloud-based implementation should be very effective by managing
all the relations between simple and complex entities from thousands (or more)
of different sources. It would then effectively solve what the semantic web
initiative had always alluded to do but never did: explicitly handling of
complex entity logic (indexing, typing, and filtering) from data that resides
in multiple sources, which are usually thought to be (yet unsupported) in the
purview of ontologies.
Many organizations intending to utilize their collections of
data more effectively are positioning themselves around big data. Yet most of
the data environments are a mixture of different classes of technologies,
developed/installed at different times, for different goals, and
accessed/managed by different groups. Trying to unify this heterogeneous mix
will have a broad range of costs depending on the type of technology used and the
urgency for completing it (and of course thoroughness of the solution). This
easily can range from $100,000’s to $millions; but the cost of doing this incorrectly
within a time limit may be even orders of magnitude greater (over $100
millions) due to the business impact of a non-optimal solution, and the new added
cost—and additional time—of doing it right the second time. The looming
challenge facing many organizations means they need to properly and confidently
choose the best approach, fully considering both the maturity of the
technologies, and enhanced paradigms for reducing development and maintenance
costs. There is concern that no database product from any traditional company
is quite ready for the challenge. The consumer must therefore rely on their own
knowledge of their precise needs and determine what level of innovation in
which they will be willing to invest. A brave new world is emerging for
information technologies.
References
1 –Stein, Brian; Morrison, Alan (2014). in Data lakes and the promise of unsiloed data (pdf) (Report).
Technology Forecast: Rethinking integration. PricewaterhouseCooper.
3 - Fichtenholtz, AM, Camarda,
ND, Neumann EK. Knowledge-Based
Bioinformatics Predicting Significance of Unknown Variants In Glial Tumors
Through Sub-Class Enrichment. pp 297-308, Pacific Symposium on Biocomputing
2016.
[2]
Regularized structures that are semantically flexible, as with honeycombs in
beehives
[3] One could argue that
EVENT=DATE×LOCATION should be used rather than DATE, but often it is not needed
since location does not change within a day.
[4] They
are at the heart of new methodologies including topological data analysis (TDA)
No comments:
Post a Comment