Thursday, April 28, 2011

In defense of 'Omics

High-throughput studies tend to have a bad reputation. They are often derided as little more than fishing expeditions. Few have summarized these feelings as sharply as Sydney Brenner:
"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology"
Having dealt with these type of data for so long, I am often in the strange position of having to defend the approaches. As I was in a real need to procrastinate, I decide to try to write some of these thoughts down.

Error rates
One of the biggest complaints directed at large-scale methods is that they have very high error rates. Usually these complaints come from scientists interested in studying system X or protein Y, that dig into these datasets only to find out that their protein of interest is missing. Are the error rates high ? While this might be true for some methods it is important to note that the error rates are almost always quantified and that those developing the methods keep pushing the rates down.

When thinking about 'small-scale' studies I could equally ask - why should I trust a single western blot image ? How many westerns were put in the garbage bin before you got that really nice one that is featured in the paper ? In fact, some methods for reducing the error become only feasible when operating in high-throughout. As an example, when conducting pull-down experiments to determine protein-protein interactions, unspecific binding becomes much easier to call. This has lead to the development of analysis tools that cannot be employed on single pull down experiments.

So, by quantifying the error rates and driving these down via experimental or analysis improvements, 'omics research is in fact, on the forefront of data quality. At the very least, you know what the error rate is and can use the information accordingly. Once the methods are improved to an extent that the errors are negligible or manageable they are quietly no longer consider "omics". The best example of this I think is genome sequencing. Even with the current issues with next-gen sequencing, few put 'traditional' genome sequencing in the same bag with the other 'omics tools, although they have quantifiable errors.

Related to error quantification is standardization. To put is simply, large-scale data is typically deposited in databases and is available for re-use. What is the point of having really careful experiments if they will only be available for re-use, in any significant way, when a (potentially sloppy) curator digs the info out of papers ? This availability fuels research by others that are not set-up to perform the measurements. This is one of the reasons why bioinformatics thrives. The limitations become the ideas not the experimental observations/measurements. Anyone can sit down, think of a problem and with some luck the required measurements (or proxy of them) have been made by others for some unrelated purpose. This is why publications of large-scale studies are so highly cited, they are re-used over and over again.

Engineering mindset and costs
One other very common complaint about these methods is cost. It is common to feel that 'omics research is 'trendy', expensive and consumes too much of the science budgets. While the part about budget allocation might be true, the issue with costs is most certainly not. Large-scale methods are developed by people with an engineering mindset. The problems in this type of research are typically on how to make the methods work effectively, which includes making them cheaper, smaller, faster, etc. 'Omics research drives costs down.

Cataloging diversity
Besides these technical comments the highest barrier to deal with, when discussing these methods with others is a conceptual one.  Is there such a thing as 'hypothesis free' research ? To address this point let me go off on a small tangent. I am currently reading a neuroscience book - Beyond Boundaries - by Miguel Nicolelis, a researcher at Duke University.  I will leave a proper review for some later post but, at some point, Nicolelis talks about the work of Santiago Ramon y Cajal. Ramon y Cajal is usually referred to as the father of the neuron theory that postulates that the nervous systems is made up of fundamental discrete units (neurons).  His drawings of neuronal circuits of different species are famous and easily recognizable. The amazing level of detail and effort that he put into these drawings really underscores his devotion for cataloging diversity. These observations inspired a revolution in neuroscience, much the same way Darwin's catalogs of diversity impacted biology. Should we not build catalogs of protein-interactions, gene-expression, post-translational modifications, etc ? I would argue that we must. Omics research drives errors and price down, creates catalogs of easily accessible and re-usable observations that fuels research. I actually think that it frees researchers. While a few specialize in method developments others are free to dream up biological problems to solve with the data gathering effort shortened to a digital query.

So why the negative connotations ? Part of it is simple backlash against the hype. As we know, most technologies tend to follow a hype cycle where early exaggerated excitement is usually followed by disappointment and backlash when they fail to deliver. A second important aspect is simply a lack of understanding of how to make use of the available data. This model of data generation separated from the problem solving and analysis only makes sense if researchers can query the repositories and integrate the data into their research. It is sad to note that this capacity is far from universal. While new generations are likely to bring with them a different mindset, those developing the large scale methods should also bear the responsibility of improving the re-usability of the data.