Recombinant Data

Monday, October 28, 2024

Incompleteness in Biology and its Implications for Bioengineering

initially published July 30, 2021

This year (2021) marks the 90th anniversary of Kurt Gödel’s seminal and math-shattering paper "On Formally Undecidable Propositions of Principia Mathematica and Related Systems I" (1931) that irrefutably proves the incompleteness of formal logic (initially for first-order logic, but extended to all higher forms). After encoding the formal limits by defining mathematical completeness, he went on to show the absolute limits of any mathematical system in a most ingenious way, creating and using what we often hear described as Gödel Numbering. His paper deeply impacted the vision for mathematics as defined by-then current leader David Hilbert, demonstrating that not all Truths that can be stated in a logic system are Decidable or Provable (there is a subtle distinction between these, but not relevant for us). This forever altered the direction mathematics would take including the newly emerging field of computer science (see Turing's Halting Problem).

Now that we have arrived to the 21st-century exploring the mechanisms behind life, Gödel's work may be about to influence the directions biology and biomedicine are heading in, including molecular genomics, viral adaptation, molecular basis of diseases, and genetic/molecular engineering. All these areas and more are affected by biological constraints and opportunities; they are also affected by the intrinsic incompleteness of biological systems and their built-in (logic) capacity to handle all situations, from aging, to inflammatory disorders, to infections, and eventually to cancers.

The argument to firmly establish this would take more room than is available here, but the essence can be described briefly as the inherent limitations of any organism along with its genomics, to be prepared and able to survive any situation and even handle genomic alterations. A particular aspect that arises, as it relates to Gödel’s original proof, is the challenge of any living system to distinguish Self (innate as well as what arises by random recombination) from Non-Self. Here undecidability is replaced by Self-verification. This is quite universal for all living things, from protection against bacteriophages, to corals competing, to pollen rejection implants, to immune responses. Vertebrates put a lot of emphasis on this with their histocompatibility genes and antibodies, as well as the energy required by this subsystem. More recently, research into immuno-oncology attempts to leverage what normally could/should take place: a subsystem to safeguard against attacks by microorganisms or genomically-altered tissue. A gödelian view would suggest that auto-immune disorders and cancer's evasion of immune responses are two sides of the same Self vs Non-Self problem.

The formal concept of completeness is described as the ability to derive all valid statements (e.g., theorems from proofs) or downstream derivations from a finite set of axioms and rules; the concept of soundness is the converse and requires all statements to be logically consistent without any contradictions. Gödel proved if a formal system is sound, its statements (reifications) can be true, but they cannot all be derived from initial assertions. This interpretation is clear for mathematics, but what does it mean in biology? What for example is incompleteness or an inconsistency in a biological system?

Recall, Gödel’s logic proof doesn't only apply to some mathematical or logical systems, but all formally defined logic system (FDLS). One may assert biology is not an FDLS, but that is most likely not true: life's logic is formal (consistent and reproducible) since each example of life has rules it has engendered, encoded, and follows via the genetic coding and the regularized causality of molecular interactions all life relies on. It certainly isn't trivial logic, and must support a logic (meta)language that can be demonstrated in various ways, such as insertion of new genetic codes by random natural processes or human design. We have cracked some of that language, but still just a tip of the iceberg.

As a metamathematical aside, living systems can even perform some of what we associate with basic math: they can count and process sets of elements (e.g. Y chromosome presence, copies of DNA, operons, number of cells in fly's ommatidium, nerve tracks), perform negation (using gene or protein regulation), do multiplication by cell division or by regulation via molecular interactions (product of concentrations) of 2 or more reactants, and importantly store symbolic states (cell states in differentiation, phosphorylated proteins, recombined DNA). A complete equivalence isn't easy to show or even necessary, but a correspondence may be provable.

As we enter this new age of bioengineering and genomic therapies, we will need to understand the ability and consequences of biosystems to determine what is (newly engineered) Self and what should be Non-Self. The Incompleteness theorem implies that for logical systems this is not always possible, and some genomic versions of a biosystem (e.g., mutations) may have downstream effects that are not be foreseeable. Clearly some diseases have been anticipated by our immune response, but it cannot be guaranteed to work (from within) 100% of the time. However, the future role of a computational physician will be able to augment and overcome these limitations much of the time. And this corresponds to Gödel's extension of logical systems by inserting new axioms and rules, introduced by agents outside of the biologic system.

Admittedly, this is a different approach to thinking about biomedicine where one typically applies all the latest biomolecular technology, elucidates reductionist biological cause-and-effects, and trains machine-learning algorithms to find therapies for diseases our society deems relevant (either by public, researchers, or investors) for a focused assault.

There's however an alternative approach, which I feel is more appropriate for this century that relies less on biohacking and chance. We will need to take a new take on defining fundamental bioprocesses and how these are encoded, not modeled as contemporary computer programs but as biological algebras based on the fundamental molecular and cellular principles that can be composed together. How things go wrong might also be abstracted from this, such that understanding diseases may not just be about collecting sets of genotypes/phenotypes but spaces around normal functions that can get perturbed in multiple directions!

In actuality, diseases may not even be separate challenges, but tied together by the underlying cross-domain logic. We already see this in efforts to understand/control cancer, immune, and infectious diseases. I specifically suspect that many/most diseases will be related to the different forms of completeness/consistency errors Gödel had already categorized, e.g., undecidable yet True (attack Self ~ false positive) and inconsistent but False (ignore Not Self ~ false negative). I've even come to wonder if the oncogene model in cancer is not the best way of looking at cancer.

So stepping back for a moment, what activities could a gödelian perspective influence?

Bioinformatics? Yes, but certainly in positive ways by expanding how we model and represent biological systems. Perhaps some of the deeper laws reside undetected in yet to be discovered genomic patterns and structures.

Computational Modeling? Very likely, since it would offer us new formalisms (e.g. algebras) that can be defined and even proved logically, and then evaluated (disproven, not proven) experimentally. Perhaps these would enable us to model different immune responses and non-responses.

Biomedicine? Very much so, I think it will offer more reusable concepts and mechanisms for defining diseases and potential therapeutic paths that go beyond ontological definitions (which mainly address data and less about models).

Bioengineering? Indeed, and this is where I think things will get really interesting. The world of CRISPR and other bioengineering technologies bear a subtle echo with the goals set out by David Hilbert a century ago. As a quick example, recall that for every k (specific) interactions a protein p or its gene g (total N) has with other cellular factors, a consistent biosystem (organism) must AVOID interactions with ~N*(N-k) other components, as well as disrupt their interactions. This is a tough requirement for newly designed biomolecules (~half a billion per each bioentity).

There are also some insights from biology into metamathematics that may come to surface. Although there are limits to how a (bio)system can compute decidability or general solutions of repairing itself or treating a disease, it also points to how non-deterministic processes like evolution that can circumvent gödelian limitations via selection. Evolution is how living things get around previous constraints, and the growing genome is basically an accumulation of more necessary axioms; it could explain why organisms with small genomics still survive quite well, while others have continued growing the genomic size to overcome constraints. Indeed, even the power of adaptive immune responses by the evolution of lymphocytes itself creates the opportunity for a novel class of tumors: lymphomas. Each new genetic axiom often requires another axiom to keep it in check! As for Gödel's numbering system and its role, evolution came up with its own: DNA/RNA.

A gödelian formalism for biology will have many implications: Technologies and knowledge will continue to advance and have great importance to mankind, but we will better realize how biology is directed by bio-logic to take advantage of faster and hopefully safer biomedical solutions that rely less on lengthy investments and reduce errors. Time will tell if that will have an impact on the research and industry.

Sunday, June 30, 2019

Interpretable AI: Reasoning about Why (...and why a solution is the right solution)

In the Sherlock Holmes novels, Conan Doyle’s hero is said to use his deductive power to infer by whom and how a crime was committed. He gathers the facts and then proceeds to deduce their logical conclusion. Ideally, given rules [A→B, B→C, …Y→Z] and fact (antecedent) A, Z can be deduced using the rules transitively. But in each of his cases there are gaps, not just in facts, but in available explanations. He therefore has to propose new explanations, since much of the crime was done without any witnesses. He applies abductive reasoning rather than deductive reasoning, to infer, or abduce, the cause and explanation for a certain set of given resulting facts.

Sherlock’s favorite phrase is “Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth”. This is about finding explanations for the result B, beginning with an open set of antecedents {A1,A2,A3…}; it is not simple deduction from A to B to C. And if all explanations are found impossible (possibly even by using deduction, but possible by other means), than the remaining single one must be the answer. But what if the case isn’t so discrete? What if your elimination reduces to a set of several solutions, not just one (e.g., different but overlapping genotypes)? Then you must find the most likely explanation using some oth3re means, like posterior (Bayesian) probabilities. This is precisely what abductive reasoning is all about: to find the best explanation or set of best explanations, not deducing the exact correct solution!

Truth Table: A→B		A
Truth Table: A→B		True	False
B	True	True	True
B	False	False	True

Let’s break this down further. Logically, deduction is sound if the implications (rules) used are all sound. For the implication A→B , modus ponens exactly states: if (A→B) & A then B. That is, given a rule and the knowing the antecedent (A) is true, then B must be true if everything is sound. However, the inverse (if (A→B) & B then A) is not necessarily true (see truth table, purple text). On the other hand, it is also not necessarily false to predict A from B, tough it isn’t exactly (always) sound, and is therefore referred to as the fallacy of the converse. Yet this reasoning when done over the complete set of possible explanations, with them being ranked by which is most probable, it can be used to infer real, possible explanations. This is at the heart of abductive reasoning, which has been cited by many [] as what scientists frequently apply. Researchers using Bayesian Inference to propose explanations or mechanisms are indeed applying abduction.

Abductive reasoning requires a high bar for tools that support it, since it is not only a matter of being able to proffer a few different explanations about a phenomenon B, but to have sufficient coverage on all possible explanations so that the best ones can be ranked (using posterior probabilities), which often requires knowing the sum of (almost) all probabilities. I recommend this being the high-bar for what we have been calling knowledge bases. One can argue that practical knowledge should include verifiable explanations or the ability to find such explanations for evidence-based discovery. The guidance for this should be openly discussed and agreed on soon due to the large number of recent knowledge graph/base related offerings, some which may not meet this requirement. Knowledge systems should serve both human queries and machine-driven interrogations and inference. Currently, there are no well-defined objectives of their use, making recommendations and selection by enterprises and institutions very ambiguous.

In addition, the AI/ML community needs to address the relevance and benefits of using such knowledge resources, and whether further alignment (e.g., APIs) are needed. Specifically, knowledge systems could be used to address interpretability of AI solutions, in order to inject both context and non-technical access to their overall benefits. So far the AI community often views knowledge as something an AI system finds but doesn't require itself to take advantage of, while those developing knowledge graphs view their semantic forms as their interpretation of AI and feeding learning systems transformed data from within the graph. Both views are unproductive, and the real benefits will emerge out of considering how both technologies can be more fundamentally integrated.

Examples of logical inference

Deduction

All oncogenes have the potential of becoming mutated and driving oncogenesis.
Gene W is an oncogene.

∴ Gene W can cause cancer if mutated.

Abduction

All oncogenes have the potential of becoming mutated and driving oncogenesis.
Gene Y is observed to be mutated in a tumor

∴ Gene Y is an oncogene.

→ NO! Counter-example: Altered Gene Y can also affect oncogenesis if it’s a Tumor Suppressor. It could even be passive and simply incidental.

→ However, if one continues to see an association of Y mutants in similar classes of tumors and it appears to be a gain of function and these mutations are rarer in other cases, then the proposition may be verified.

Induction

Genes W,X,Y,Z are RTKs.
Gene Y is observed to be mutated in 20% of lung carcinomas

∴ RTKs can drive oncogenesis in lung when mutated

Tuesday, October 24, 2017

Wrecks at the Bottom of Data Lakes

What are Data Lakes?

Along with all the activity and marketing hype around Big Data, there are still troubling loose ends to contend with: how do we associate disparate but overlapping data to each other if we’re simply to “pour” data together? Using the lake paradigm, how is one to fish out the specific data that match some form of criterion, as well as anything else associated to it? Some explanations point to adding some type information, but this limits how data from different collections (related but not exactly equal types) can be cross-linked when necessary. We can choose to link entities across types using constrained rules or semantics. However, if we are to rely on some form of data semantics to associate related things, how is this data semantics to be established, added to the lake, and then managed? The metaphor for the lake quickly begins to get murky…

But what happened to semantic data aka linked data, to the ability to link data from multiple sources across an organization or even the Internet? What of all the promises of truly interlinked data independent of where they arise? Is the data lake the replacement paradigm? One notable shift has been to the localization of data to within an organization’s auspices, rather than relying on outbound links, as championed by semantic web standards. But is the lake terminology right for this? In the sciences, there are always external resources that need to be updated and merged with the internal sets. If not properly using linked data identifier (URI) semantics, what then? What is really offered here?

To many, the lake analogy affords a serene image of lazy afternoons of sailing and fishing; but it is deceptive nonetheless. Are things best discovered by using simple tags, are these controlled? Are unique relations the key in identifying special objects? Is it a particular tangle of linked things that help fish out a prize catch? Do large assemblages of multiple facts come out whole in a meaningful way, or is it a jumble of stringy facts? It is not a far stretch to conjure up the thought of an Edmund Fitzgerald[1]-size data wreck if one does not take the time to structure the inserted data. Some things dumped into the lake may never see the light of day again. Is data depth now become a good thing or a bad thing? In this article, we will take a deeper dive into the challenges facing data aggregation and struxturing, and some new ideas of how to better organize growing and evolving data resources.

A concept that was introduced in a previous article, is the Yoneda lemma (abstract algebras), which formally ties all records of entities (including keys) from any table to each other to create one large network of composite relations. It makes it possible to define a query algebra (e.g., SQL, SPARQL) that works with any schema for a dataset. In the case of data lakes, this foundation is missing or at least has not been formally introduced, so a large uncertainty exists on what the formal basis will be to ensure data integrity for insertions, updates and queries. Currently data lakes appear to be a convenient option for handling large influx of datasets, coming in varied, disjoint structural forms. Sean Martin of Cambridge Semantics said of current efforts [1]: “We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there”.

An alternative generalized model is the concept of what I call a Datacomb[2], which relies on both efficiency and logic (ala geometric algebras) for storage, structure, and discoverability. Here any typed real-world entity (RWE) or conjunction of RWEs, can be mapped using single or multiple keys. The latter is usually associated with JOIN results (Patient + Primary Physician), but which can be automatically typed as a Cartesian Product (CP) using existing atomic entities:

PATIENT×PPHYSICIAN.

Such a relation instance materializes if a fact exists about a patient having a primary physician, as in any join, but now a compound typed-object exists as well. This compound object may uniquely contain data on when the patient first began going to this doctor, and what was the circumstance of the first visit. The actual visits are also compositionally typed (and linked) as VISIT ≝ PATIENT×PPHYSICIAN×DATE, which would include the location[3], any tests performed, and what was the diagnosis. Cartesian products have the basic ability to be decomposed (projected) into the set of atomic entities ((PATIENT, PPHYSICIAN), DATE), with their original associated (row) data. If we wish to include prescribed drug therapies, we can organize this by extending the previous objects thusly: PATIENT×PPHYSICIAN×DATE×THERAPY_START. For every a ∊ PATIENT, b ∊ PPHYSICIAN, c ∊ DATE, and d ∊ THERAPY_START, a 3-simplex (4 vertices) is created, where each combination from 1-4 conjunctions (total of 15) has compositional semantic meaning:

Simplicial Databases

The ability to compose and decompose objects is very useful and mathematically sound, and enables databases to be quite flexible. In fact, any set of k-joined entities can be (if one needs to) decomposed generally into k subsets of k-1 CP entities, which then can be decomposed into k(k-1)/2 subsets of k-2 CP entities, etc, until we arrive at the k atomic entities. This structure is commonly known as a Simplex and the data instance constructs are known as Simplicial Sets, and was first described by David Spivak as having many uses in data storage [2]. One application of them is in statistical inference when computing/analyzing joint and marginal frequencies or probabilities of mixed combinations of similar events or attributes. For example, if a patient has a tumor containing somatic mutations [EGFR amp, P53, PTEN], a mutation simplex is defined that may be part of a larger mutation pattern [EGFR amp, CDK4, P53, PTEN] that some patients have, as well as subsuming smaller patterns of others: [EGFR amp, PTEN] and [EGFR amp, P53]. The entities are different subsets of mutations that are co-occurring, and may each contain the incidence counts for each combination found in patients, or an identified molecular interaction between the co-occurring mutations. This is a numeric example, which can be further combined with other data.

It is worthwhile considering that the actual physical storage implementation of a Simplicial database [1] does not have to allocate every mutation combination possible, nor every combination that exists within sets of patients. The logical constraints are complete, so the model may need to only allocate those for which useful data can be associated (e.g., therapies). This can be considered a form of storage caching and compression, for faster look-ups and associations. Nonetheless, a simple analysis of real genomic data from ~1000 cancer patients required only a few million unique simplicial entities to be allocated and linked, which makes this highly tractable in today’s large-scale storage systems. Moreover, in some data spaces where events are strongly mutually associated, the combinatorics is not unbounded, and often simplicial sets become saturated (relatively sparse) at intermediate and lower levels.

Note the hierarchy of entities from large mutation combinations to smaller subsets form a “sieve”. Each patient’s pattern is linked to the top (complete) entity, and then filters down to all the subsets contained within that pattern, providing information of which patients share a particular sub-pattern. If these mutation distributions are not statistically independent, it provides evidence there is an underlying mechanism at work [see Fichtenholtz, 2016]. The simplicial database makes it very efficient to find all cases of shared patterns, compared to a query filter (for each) in a relational DB or an edge traversal in a data graph. The mutation simplex is formed directly from calculating and indexing the patterns from a list of mutations for each patient’s analysis, and is cost efficient after most patterns are captured.

Returning to our original PATIENT×PHYSICIAN×DATE example, one can build a simplicial model around the PATIENT×PHYSICIAN pair (edge) linked to a sequence of dates (vertices) to create an implicit series visits (=PATIENT×PHYSICIAN×DATE), i.e., triangular faces. This structure includes a PHYSICIAN×DATE edge, which maps to all the patients that doctor has seen on the same day. A clear advantage of this form of database, is that all key combinations are pre-computed (aka pre-joined), so a simple canonical n-way hash of the values can find the full set of data in a single lookup; this is very well-suited for fast analytics, where multiple lookups are equivalent to query caching. Another advantage is that the CP entities have clear automatic types and can be handled exactly by type-dependent downstream processes, specifically by descriptive algebras supporting CP entities (e.g., MUTATION_SIMPLEX⊗DISEASE⊗THERAPY à DISEASE⊗RESPONSE). The combined simplicial set naturally lends itself to analytics for effective treatments based on genomics and disease types.

Datacomb

The basis for the ideas presented here arise from Category Theory (CT), which ensures logical consistency within data model schema. The interconnected set of simplicial entities is described as a simplicial complex (partial overlaps of different simplicial elements) and is a well-defined object in CT[4], and is at the heart of the formal definition of what we call a Datacomb. The complex possesses a formal query algebra for any subset of simplicial entities, and can be used to extract any geometric (connected) subset of data, including measurable things like frequencies. Note also that any graph data-model is automatically a subset of a datacomb since it is just the 1-D skeleton (vertices and edges) of the complex. The datacomb model can be implemented on top of a few different storage technologies, such as multi-array DBs, RDBs, key-valued NOSQL DBs, graph DBs, and (materialized) column-stores (relational systems may not be practical since they require explicit types and type-specific keying). The simplicial logic that is required to interface with them can be layered on top of the existing technologies, so that a common API can be installed on different storage technologies. In fact, RDF could actually be used as a universal description for internal structures in any data system (not only triplestores). All in all, the datacomb approach is a more rigorously defined solution for complex data sets than offered by the data lake meme, one with real definable specifications and multiple analytic and mining applications.

The datacomb can be applied to several different settings: most naturally, it can be mapped to any existing data-array storage systems already in place, with the extension to more flexibly and automatically handle complex-typed objects, useful for precomputing data for use in downstream analytics. In relational DB instances, frequently materialized joins can be more formally and efficiently captured and accessed using a datacomb framework, making it easier and faster to query on conjoined content, as well as recalling the atomic entities on demand. Datacombs serve as the common superset for both data-arrays and relational data, and therefore form a powerful higher-order framework that covers both data analytics and full sets of non-numeric data. Inasmuch, the datacomb offers a lot of advantages to organize and define datasets for any machine-learning tasks, by flexibly formatting raw data into pre-processed structures required by many ML platforms.

In addition, when dealing with closely related entities (e.g., lists of genes and their coded proteins), instead of ambiguously choosing one or the other identifier (e.g., P204392) for recalling the whole set of related data records, a simplex of the related entities would provide a much more even and efficient way to get all the matches. It would then be keyed by any one entity (vertices), or the hashed-sum of the full set (k-cell). This would go a long way to solving the biomedical disambiguation problem. This is the formal equivalent of earlier attempts like SRS[5] to connect multiple related molecular entities.

Datacombs can also handle non-local data by serving as local caches of all the intra- and inter-relations between data records (e.g., genomic data references), providing something much more substantial in function and structure than existing data lake models, analogous to a universal data switchboard. A cloud-based implementation should be very effective by managing all the relations between simple and complex entities from thousands (or more) of different sources. It would then effectively solve what the semantic web initiative had always alluded to do but never did: explicitly handling of complex entity logic (indexing, typing, and filtering) from data that resides in multiple sources, which are usually thought to be (yet unsupported) in the purview of ontologies.

Many organizations intending to utilize their collections of data more effectively are positioning themselves around big data. Yet most of the data environments are a mixture of different classes of technologies, developed/installed at different times, for different goals, and accessed/managed by different groups. Trying to unify this heterogeneous mix will have a broad range of costs depending on the type of technology used and the urgency for completing it (and of course thoroughness of the solution). This easily can range from $100,000’s to $millions; but the cost of doing this incorrectly within a time limit may be even orders of magnitude greater (over $100 millions) due to the business impact of a non-optimal solution, and the new added cost—and additional time—of doing it right the second time. The looming challenge facing many organizations means they need to properly and confidently choose the best approach, fully considering both the maturity of the technologies, and enhanced paradigms for reducing development and maintenance costs. There is concern that no database product from any traditional company is quite ready for the challenge. The consumer must therefore rely on their own knowledge of their precise needs and determine what level of innovation in which they will be willing to invest. A brave new world is emerging for information technologies.

References

1 –Stein, Brian; Morrison, Alan (2014). in Data lakes and the promise of unsiloed data (pdf) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.

2 – David I. Spivak. Simplicial Databases, https://arxiv.org/abs/0904.2012, 2009

3 - Fichtenholtz, AM, Camarda, ND, Neumann EK. Knowledge-Based Bioinformatics Predicting Significance of Unknown Variants In Glial Tumors Through Sub-Class Enrichment. pp 297-308, Pacific Symposium on Biocomputing 2016.

[1] See https://en.wikipedia.org/wiki/SS_Edmund_Fitzgerald

[2] Regularized structures that are semantically flexible, as with honeycombs in beehives

[3] One could argue that EVENT=DATE×LOCATION should be used rather than DATE, but often it is not needed since location does not change within a day.

[4] They are at the heart of new methodologies including topological data analysis (TDA)

[5] https://www.ncbi.nlm.nih.gov/pubmed/12176845

Recombinant Data

Monday, October 28, 2024

Incompleteness in Biology and its Implications for Bioengineering

Incompleteness in Biology and its Implications for Bioengineering

Sunday, June 30, 2019

Interpretable AI: Reasoning about Why (...and why a solution is the right solution)

Examples of logical inference

Tuesday, October 24, 2017

Wrecks at the Bottom of Data Lakes

What are Data Lakes?

Simplicial Databases

Datacomb

Blog Archive

Interesting Blogs