Wednesday, March 28, 2012

Individual genomics of yeast

Nature Genetics used to be one of my favorite science journals. It consistently had papers that I found exciting. That changed about 5 years ago or so when they had a very clear editorial shift into genome-wide association studies (GWAS). Don't take me wrong, I think GWAS are important and useful but I don't find it very exciting to have lists of regions of DNA that might be associated with a phenotype. I want to understand how variation at the level of DNA gets propagated through structures and interaction networks to cause these differences in phenotype. I mostly stayed out of GWAS since I was focusing on the evolution of post-translational networks using proteomics data but I always felt that this line research was not making full use of what we know already about how a cell works.

In this context, I want to tell you about a paper that came out from Ben Lehner's lab that finally made me excited about individual variation and why I think it is such a great study. I was playing around with the similar idea when the paper came out so I will start with the (very) preliminary work I did and continue with their paper. I hope it can serve as small validation of their approach.

As I just mentioned, I think we can make use of what we know about cell biology to interpret the consequence of genetic variation. Instead of using association studies to map DNA regions that might be linked to a phenotype, we can take a full genome and try to guess what could be deleterious changes and their consequences. It is clear that full genome sequences for individuals are going to be the norm so how do we start to interpret the genetic variations that we see ? For human genetic variation, this is a highly complex and challenging task.

Understanding the consequences of human genetic variation from the DNA to phenotype requires knowledge of how variation will impact on proteins's stability, expression and kinetics; how this in turn changes interaction networks; how this variation is reflected in each tissue function; and ultimately to a fitness difference, disease phenotype or response to drugs. Ultimately we would like to be able to do this but we can start with something simpler. We can take unicellular species (like yeast) and start by understanding cellular phenotypes before we move to more complex species.

To start we need full genome sequences for many different individuals of the same species. For S. cerevisiae we have genome sequences for 38 different isolates by Liti et al. We then need phenotypic differences across these different individuals. For S. cerevisiae there was a great study published June last year by Warringer and colleagues were they tested the growth rate of these isolates under ~200 conditions.  Having these data together we can attempt to predict how the observed mutations might result in the differences in growth. As a first attempt we can look at the non-synonymous coding mutations. For these 38 isolates there are something like 350 thousand non-synonymous coding mutations. We can predict the impact of these mutations on a protein either by analyzing sequence alignments or using structures and statistical potentials. There are advantages and disadvantages to both of the approaches but I think they end up being complementary. The sequence analysis required large alignments while the structural methods require a decent structural model of the protein. I think we will need a mix of both to achieve a good coverage of the proteome.

I started with the sequence approach as it was faster. I aligned 2329 S. cerevisiae proteins with more than 15 orthologs in other fungal species and used MAPP from the Sidow lab at Stanford to calculate how constrained each position is. I got about 50K non-synonymous mutations scored with MAPP of which about 1 to 8 thousand could be called potentially deleterious depending on the cut-off. To these we can add mutations that introduce STOP codons, in particular if they occur early in the protein (~710 of these within the first 50 AAs of proteins).

So up to here we have a way to predict if a mutation is likely to impact negatively on a protein's function and/or stability. How do we go from here to a phenotype like a decrease growth rate under the presence of stress X ? This is exactly the question that chemical-genetic studies try to address. Many labs, including our own,  have used knock-out collections (of lab strains) to measure chemical-genetic interactions that give you a quantitative relative importance of each protein in a given condition. So, we can make the *huge* simplification that we can take all deleterious mutations and just sum up the effects assuming a linear combination of the effects of the knock-outs.

To test this idea I picked 4 conditions (out of the 200 from mentioned above) for which we have chemical-genetic information (from Parsons et al. ) and where there is a high growth rate variation across the 38 strains. With everything together I can test how well we can predict the the measured growth rates under these conditions (relative to a lab strain):
Each entry in the plot represents 1 strain in a given condition. Higher values report worse predicted/experimental growth (relative to a lab strain). There is a highly significant correlation between measured and predicted growth defects (~0.57) overall but cisplain growth differences are not well predicted by these data. Given the many simplifications and poor coverage of some of the methods used I was even surprised to see the correlation at all. This tells us, that at least for some conditions, we can use mutations found in coding regions and appropriately selected gene sets to predict growth differences.

This is exactly the message of the Rob Jelier's paper from Ben Lehner's lab. When they started their work, the phenotypic dataset from Warringer and colleagues was not yet published so they had to generate their own measurements for this study. In addition their study is much more careful in several different ways. For example they only used the sequences for 19 strains that they say have higher coverage and accuracy. They also tried to estimate the impact of indels and they try to increase the size of the alignments (a crucial step in this process) by searching for distant homologs. If you are interested in making use of "personal" genomes you should really read this paper.

Stepping back a bit I think I was excited about this paper because it finally connects the work that has been done in high-throughput characterization of a model organism with the diversity across individuals of that species. It serves as bridge for many people to come to work in this area. There are a large number of immediate questions like how much do we really need to know to make good/better predictions ? What kind of interactions (transcriptional, genetic, conditional genetic) do we need to know to capture most of the variation ? Can we select gene-set and gene weights in other species without the conditional-genetics information (by homogy) ?

As we are constantly told, the deluge of genome sequences will continue so there are plenty of opportunities and data to analyze (I wish I had more time ;). Some recent examples of interest include the sequencing of 162 D. melanogaster lines with associated phenotypic data and the (somewhat narcissistic) personal 'omics study of Michael Snyder. To start to make the jump to human I think it would be great to have cellular phenotypic data (growth rate/survival under different conditions) for the same cells/tissue across a number of human individuals with a sequenced genome. Maybe in a couple of years I wont be as skeptical as I am now about our fortune cookie genomes.