Biology's complexity problem



In highly complex systems such as computers or living things, * complexity exists at many scales, and it is nontrivial to figure out appropriate coarse-grained or abstracted descriptions of the system's operation * high levels of variability may exist on multiple scales, but certain features are conserved. Knowing the difference between relevant and superfluous variability for a given question allows dimensionality reduction and is essential for reducing the difficulty of understanding these systems * "understanding" can be defined in many ways, and the level of abstraction allowed will depend on the specific questions or problems at hand. Knowing the most effective "levers" to manipulate the system at multiple scales could be a general goal of efforts to solve problems in a given system.

Biology's complexity problem

Biological research is still in its infancy in terms of performing precisely controlled experiments, as only around 40 years ago we did not even have the tools to perturb the informational language of cells - recombinant DNA and genetic engineering. Now that these exist and are rapidly improving, biology has entered an exciting age of discovery and advancement. However, we still largely lack the tools to precisely manipulate small numbers of cells in complex ways, or to make highly precise measurements on many parameters simultaneously. In the first case, desired or unexpected outcomes may have been missed because the "right" perturbation may not be simple. In the second case, an interesting outcome could have been missed because the "right" parameters (possibly thousands) were not measured. In my opinion, biology needs to be able to induce highly complex perturbations and measure simple outcomes, or induce simple perturbations and measure highly complex outcomes. Right now, we are mostly inducing simple perturbations and measuring simple outcomes.

Aliens discover a Dell

I think the following thought experiment demonstrates well the difficulties of studying biological complexity. There is no reason to suspect our brains evolved to introspect into their own inner workings as we have been attempting. To model a similar scenario, imagine that a race of aliens visits earth and discovers our personal computers. Though they're intelligent, they have much different physiological and sensory parameters than us, and their equivalent of a computer is nothing like ours. Perhaps they interact with it not through touch and sight, but through electromagnetic radiation they emit, or precise mechanical vibrations. They don't necessarily use written language and certainly don't understand ours. Anyway, they decide to try to perform experiments on some earthly computers they found to try and understand what they do and how they work. How would this turn out?

Because they have a very different way of interacting with their environment, their intuition will be mostly wrong. They will probably try first to interact with our computer as they do with theirs. This could lead to breaking a bunch of test objects, as their mechanical force or radiation or whatever is not compatible with the construction of our computers. Perhaps eventually they find the right scale of force to apply without breaking stuff. Assuming they have a properly-connected keyboard and mouse, maybe they try pressing many combinations of buttons at different times. Eventually they are lucky enough to figure out that pushing one particular button - the power button - results in a dramatic response: the display starts emitting light in complicated, time-dependent patterns! This would be an alien Nobel-worthy achievement. They now can seek to understand how the inputs correspond with the light patterns on the display, and how this is all coordinated under the hood.

However, the space of possible inputs is gigantic - will they try every combination of the many keyboard buttons across many time points before or after the pressing of the power button? That will take a tremendous amount of time. We humans know that almost all of their attempts will do nothing meaningful. However, suppose they eventually somehow log in to the desktop. What a breakthrough! But they will find much a higher level of complexity in the on-screen patterns, and moreover that the patterns seem to change stochastically. What if the computer needs to update? What if Chrome opens on startup? It would be very easy to erroneously attribute these 'random' changes to the detailed sequence of perturbations (i.e. button presses, etc) that the experimenters are using, because they do not know the underlying principles.

If the experimenters understand electricity, maybe they can grasp from their dissection efforts that the innards of the computer conduct electric current in highly complex and precise ways. They could then take a more bottom-up approach by trying to figure out which channels and components are important for the apparent function of the computer. This too is an enormous problem space - there are literally billions of transistors in a modern CPU alone. Of course, what readout they use of 'function' is very important: if all they have is whether the display lights up when the power button is pressed, they will not gain much information about how the component in question actually contributes to the computer's function.

Let's say the experimenters had a few different types of computers - Windows and Mac, laptops and desktops, maybe even smartphones. How will they know the commonalities? From zero knowledge, different types of computers look very distinct. It should become clear that they all have a display, and some keyboard interface to input information: these are conserved features which can hint at their importance.

However, let's also say they have two different models of Windows PCs - a Dell laptop and a Lenovo laptop, for example. While they look mostly similar, the layout and dimensions of the keyboards are noticeably different, there are some indicator lights and buttons that are different between them, and their on-screen light patterns differ somewhat. How will the experimenters determine if these differences are important or not? This illustrates the challenge of identifying significant variability. The significance of a given type of variability always depends on one's purpose - for certain questions, it matters a lot whether you have a Mac or PC, but for other more "conserved" features it doesn't. If you want to browse the web, the hardware of the computer doesn't really matter; if you want to perform complex climate modeling, it matters a great deal.


The above few examples illustrate some basic absurdities and problems encountered when studying a complex multiscale system even in relatively productive lines of inquiry - knowing that the computer works by orchestrating the flow of electric current, and figuring out how to power it on and use the keyboard for inputs are correct tracks. Similarly, we have made great progress with identifying conserved low-level elements of biology in the form of the genetic code, the central dogma, and molecular evolution.

Most people with a decent grasp on coding and basic hardware would probably say they "know how a computer works", yet they do not know or understand the complete circuit diagram of any part of the machine. But these tools are the levers which most effectively manipulate computers at most scales in a general way. A complete reductionist understanding of the atomistic parts isn't necessary because we have added layers of abstraction on top of it that allow us to effectively manipulate the computer's function without directly dealing with this complexity. The closest one can get to the circuits on most computers is assembly language, which still doesn't deal directly with transistors and logic gates. In a system as complex as a modern computer, it would simply require too much information to deal with these details. Similarly, changes to these low-level details tend not to affect the higher-level behavior in very noticeable ways because they are designed to smooth over and adapt to many different low-level configurations. Changes at all levels seem to become more important when the system is pushed to a limit, though: when running extremely intensive computations, the CPU architecture, assembly language, cache sizes and other low-level features start to matter a lot. Perhaps similarly, the function of many genes or other biological components becomes most apparent in stressed or diseased organisms.

Layers of abstraction also allow us to disregard certain levels of variability. While the circuit diagrams of two different brands of motherboards or CPUs may differ wildly, only the bottom-most levels of the system interact with them, and they present a consistent coarse-grained picture to the upper levels so that the overall operation of the system is mostly unchanged. Abstraction and coarse-graining allow underspecification of low-level details, like how macroscopic thermodynamic observables describe a many-component system without specifying the position and momentum of each component. I think a major challenge of contemporary biology is determining which variability is important for a given question and which can be coarse-grained out. With increasing ability to generate huge, high-dimensional datasets, we always observe variability at many levels. Without further insights into which quantities or characteristics are conserved or important for given questions, we are stuck in a data-rich, understanding-poor regime. We are also prone to erroneously attribute causal power to coincidental correlations. Think of the aliens recording the shape and size of every key on a Mac versus a PC and positing that these are causally relevant to the differences between the two.

Fixing most problems in a computer also doesn't require detailed knowledge of the low-level details. The command line and GUI present the levels for manipulating things at many levels. Hardware problems are usually handled by modularity - instead of tracking down the exact circuit that is causing a GPU failure, the entire unit can be removed and replaced without understanding any of the details. Medicine does a similar thing with organ transplants - we don't know how to grow a new heart yet or exactly understand how it works at every level, but we have identified empirical "boundaries" that allow it and other organs to be treated (imperfectly) as modules. Similarly in software, portions of the RAM content of a modern computer can be examined in relative isolation as the operating system logically allocates them to processes.

I wonder if understanding "modules" at many levels and the levers that manipulate them will be a fruitful approach for human diseases. Traditional pharmaceutical approaches tend to target low-level elements such as specific proteins that are thought to be compromised in a disease. These approaches lie on the assumption that molecular targets are an appropriate module and lever for modulating the system's behavior. They work best when this assumption is most true - digestive enzyme deficienies and cystic fibrosis come to mind. But they tend to fail or have mixed results with conditions that are multiscale or involve high order interactions, like mental health conditions. Imagine trying to fix a problem in Microsoft Word by inactivating a certain capacitor on the motherboard that interacts with Word. It might sort of fix (at least the symptoms) of the problem, but many other programs depend on that component, and Word is affected by many subsystems of the computer beyond that component. High-level interventions are showing promise for certain extremely complex conditions, especially neurodegenerative and mental health conditions for which typical reductionist approaches fall short.

The major challenge is that unlike computers, biology was not designed by humans and had no reason to designate scales, modules, and abstraction levels according to any scheme that is intuitive or sensical to our brains. It verges on an epistemological question whether we can ever fully grasp these partitions, if they exist. It seems to me that using mathematical principles to remove our flawed intuition and biases is important (however, mathematics may be a construct reflecting the structure of the human mind as well; perhaps a future post on this).

Conclusion: where from here?

At the beginning I mentioned the scale of perturbations and measurements we're able to make. It would seem one way the aliens could proceed is by high-dimensional perturbations followed by simple readouts, such as whether the computer turns on. Another way is simple perturbations followed by high-dimensional readouts, such as contents of the display, voltage recordings on various components, etc. Both of these options are not great. They both involve 'brute-force' explorations of enormous spaces of possible perturbations or measurements. Because we designed computers to make sense, we know exactly which perturbations and measurements one should make. But with our own biology, we have only beginnings of ideas of what scales and levers are most relevant to perturb and measure. I tend to prefer the one perturbation-many measurements regime, which is what I work on in my PhD and what a lot of modern biology focuses on, but I'm not sure it's objectively superior. Both spaces, of perturbations and measurements, are basically infinite a priori, but the space of practical measurements is limited/defined by the current experimental methods. This makes it seem more tractable to try to understand effects of certain simple perturbations at many scales as measurement methods improve. Plus it seems easier to associate cause with a simple perturbation than with many, where further dissection would be required to identify necessary and accessory elements.

Both perturbation and measurement techniques encounter difficulties applying across scales, for example to in vitro cultured cells then to whole organisms. Measuring low-level parameters such as individual genes in individual cells is possible in vitro and in small organismal systems such as parts of mouse organs or embryos; but it becomes virtually intractable at the scale of whole mouse (let alone human) organs, since these datasets scale with the volume of the system. The neuroscience subfield of connectomics encounters dramatically these scaling problems. This field attempts to produce a connection map of all the synapses of an organism's nervous system. So far, only the ~6000 connections in the millimeter-scale nematode Caenorhabditis elegans have been fully mapped. Heroic efforts are currently underway to map the millions of connections in the fruitfly, producing an enormous amount of data. Scaling up to mouse and eventually human seems impractical at the current rate of experiments, and the capacity of data storage and analysis: the human brain has on the order of $10^{15}$ synapses. Furthermore, this "circuit diagram" approach would seem to encounter a big problem of inter-individual variability. C. elegans have a prototyped nervous system that forms about the same connections each time - each of its few hundred neurons has its own name and neighbors. More complex organisms will not provide such a luxury. Once the incredible effort to map the fruitfly or rodent connectome is complete, the question will remain: what is the significance of each connection? How prototyped are they? The C. elegans connectome paved the way for much smarter questioning and is an invaluable reference, but it by no means "solved" the animal's neuroscience.

Similarly, knowing the incredibly complex circuit diagram of my Dell laptop will not mean the aliens have finished their task and understand human-made computers. Moreover, this knowledge will hardly transfer to a different brand and architecture of computer, even though we know that these share many abstract similarities. Some knowledge of the conserved features that provide for abstraction and coarse-graining is necessary.

Overall, we probably need to take measurements at multiple scales, and consider how layers of abstraction might be built in to biology for robustness and efficiency. What "interfaces" do low-level components provide for larger-scale manipulation? What assumptions do higher-level modules make about lower-level ones? I think we have enough knowledge to make some educated first guesses about some of this. I'm not sure if we have the measurement tools yet to directly query them. That is perhaps why this type of complex systems thinking applied to biology has mostly been limited to theoretical treatments and not solving real problems.

Further Reading

I am currently on the hunt for more to read on this topic, as I am new to it. I have found some very interesting stuff from the folks at the Santa Fe Institute, specifically at the Collective Computation Group.