Friday, March 29, 2019

Research summary - Predicting phenotypes of individuals based on missense variants and prior knowledge of gene function

I have been meaning to write blog posts summarising different aspects of the work from our group over the past 6 years, putting it into context with other works and describing also some future perspectives. I have just been at the CSHL Network Biology meeting with some interesting talks that prompted me to put some thoughts to words regarding the issue of mapping genotypes to phenotypes, making use of prior cell biology knowledge. Skip to the last section if you just want a more general take and perspective on the problem. 

Most of the work of our group over the past 6 years has been related to the study of kinase signalling. One smaller thread of research has been devoted to the relation between genotypes and phenotypes of individuals of the same species. My interest in this comes from the genetic and chemical genetic work in S. cerevisiae that I contributed while a postdoc (in Nevan Krogan’s lab). My introduction to genetics was from studies of gene deletion phenotypes in a single strain (i.e. individual) of a model organism. Going back to the works of Charlie Boone and Brenda Andrews, this research always emphasised that, despite rare, non-additive genetic and environment-gene interactions are numerous and constrained in predictable ways by cell biology. To me, this view of genetics still stands in contrast to genome-wide association studies (GWAS) that emphasise a simpler association model between genomic regions and phenotypes. In the GWAS world-view, genetic interactions are ignored and knowledge of cell biology is most often not considered as prior knowledge for associations (I know I am am exaggerating here). 

Predicting phenotypes of individuals from coding variants and gene deletion phenotypes

Over 7 years ago, some studies of strains (i.e. individuals) of S. cerevisiae made available genome and phenotypic traits. Given all that we knew about the genetics and cell biology of S. cerevisiae I thought it would not be crazy to take the genome sequences, predict the impact of the variants on proteins of these strains and then use the protein function information to predict fitness traits. I was brilliantly scooped on these ideas by Rob Jelier (Jelier et al. Nat Genetics 2011) while he was in Ben Lehner’s lab (see previous blog post). Nevertheless, I though this was an interesting direction to explore and when Marco Galardini (group profile, webpage)  joined our group as a postdoc he brought his own interests in microbial genotype-to-phenotype associations and which led to a fantastic collaboration with the Typas lab in Heidelberg pursuing this research line. 

Marco set out to scale up the initial results from Ben’s lab with an application to E. coli. This entailed finding a large collection of strains from diverse sources, by sending emails to the community begging them to send us their collections. We compiled publicly available genome sequences, sequence some more and performed large scale growth profiling of these strains in different conditions. From the genome sequences, Marco calculated the impact of variants, relative to the reference genome and used variant effect predictors to identify likely deleterious variants. Genomes, phenotypes and variant effect predictions are available online for reuse. For the lab reference strain of E. coli, we had also quantitative data of the growth defects caused by deleting each gene in a large panel of conditions. We then tested the hypothesis that the poor growth of a strain of E. coli (in a given condition) could be predicted from deleterious variants in genes known to be important in that same condition (Galardini et al. eLife 2017). While our growth predictions were significantly related to experimental observations the predictive power was very weak. We discuss the potential reasons in the paper but the most obvious would be errors in the variant effect predictions and differences in the impact of gene deletion phenotypes in different genomic contexts (see below). 

Around the same time Omar Wagih (group profile, twitter), a former PhD student, started the construction of a collection of variant effect predictors, expanding on the work that Marco was doing to try to generalise to multiple mechanisms of variant effects and to add predictors for S. cerevisiae and H. sapiens. The result of this effort was the resource (Wagih et al. MSB 2018). Given a set of variants for a genome in one of the 3 species mutfunc will try to say which variants may have an impact on protein stability, protein interactions, conserved regions, PTMs, linear motifs and TF binding sites. There is a lot of work that went into getting all the methods together and a lot of computational time spent on pre-computing the potential consequence of every possible variant. We illustrate in the mutfunc paper some examples of how it can be used. 

Modes of failure – variant effect predictions and genetic background dependencies

One of the potential reasons why the growth phenotypes of individual stains may be hard to predict based on loss of function mutations could be that the variant effect predictors are simply not good enough. We have looked at recent data on deep mutational scanning experiments and we know there is a lot of room for improvement. For example, the predictors (e.g. FoldX, SIFT) can get the trends for single variants but really fail for more than one missense variant. We will try to work on this and the increase in mutational scanning experiments will provide a growing set of examples on which to derive better computational methods. 

A second potential reason why loss of function of genes may not cause predictable growth defects would be that the gene deletion phenotypes depends on the rest of the genetic background. Even if we were capable of predicting perfectly when a missense variant causes loss of function  we can’t really assume that the gene deletion phenotypes will be independent of the other variants in the genome. To test this we have recently measured gene deletion phenotypes in 4 different genetic backgrounds of S. cerevisiae. We observed 16% to 42% deletion phenotypes changing between pairs of strains and described the overall findings in this preprint that is currently under review. This  is consistent with other works, including RNAi studies in C. elegans where 20% of 1,400 genes tested had different phenotypes across two backgrounds. Understanding and taking into account these genetic background dependencies is not going to be trivial.

Perspectives and different directions on genotype-to-phenotype mapping

Where do we go from here ? How do make progress in mapping how genotype variants impact on phenotypes ? Of course, one research path that is being actively worked on is the idea that one can perform association studies between genotypes and phenotypes via “intermediate” traits such as gene expression and all other sorts of large scale measurements. The hope is that by jointly analysing such associations there can be a gain in power and mechanistic understanding. Going back to the Network Biology meeting this line of research was represented with a talk by Daifeng Wang describing the PsychENCODE Consortium with data for the adult brain across 1866 individuals with measurements across multiple different omics (Wang et al. Science 2018). My concern with this line of research is that it still focuses on fairly frequent variants and continues not to make full use of prior knowledge of biology. If combinations of rare or individual variants contribute significantly to the variance of phenotypes such association approaches will be inherently limited. 

A few talks at the meeting included deep mutational scanning experiments where the focus is mapping (exhaustively) genotype-to-phenotype on much simpler systems, sometimes only a single protein. This included work from Fritz Roth and Ben Lehner labs. For example, Guillaume Diss (now a PI at FMI), described his work in Ben’s lab where they studied the impact of  >120,000 pairs of mutations on an protein interaction (Diss & Lehner eLife 2018). Ben’s lab has several other examples where they have look in high detail and these fitness maps for specific functions (e.g. splicing code, tRNA function). From these, one can imagine slowly increasing the system complexity including for example pathway models. This is illustrated in a study of natural variants of the GAL3 gene in yeast (Richard et al. MSB 2018). This path forward is slower than QTL everything but the hope would be that some models will start to generalise well enough to apply them computationally at a larger scale. 

Yet another take on this problem was represented by Trey Ideker at the meeting. He covered a lot of ground on his keynote but he showed how we can take the current large scale (unbiased) protein-protein functional association networks to create a hierarchical view of the cellular functions, or a cellular ontology (Dutkowski et al. Nat Biotech 2013 , Then this hierarchical ontology can be used to learn how perturbations of gene functions combine in unexpected ways and at different levels of the hierarchy (Ma et al. Nat Methods 2018). The notion being that higher levels in the hierarchy could represent the true cellular cause of a phenotype. In other words, DNA damage repair deficiency could be underlying cause of a given disease and there are multiple ways by which such deficiency can be caused by mutations. Instead of performing linear associations between DNA variants and the disease, the variants can be interpreted at the level of this hierarchical view of gene function to predict the DNA damage repair deficiency and then associate that deficiency with the phenotype. The advantages of this line of research would be to be able to make use of prior cell biology knowledge and in a framework that explicitly considers genetic interactions and can interpret rare variants.  

I think these represent different directions to address the same problem. Although they are all viable, as usual, I don't think they are equally funded and explored.