Cellular Consequences of Genetic variation

Thursday, June 10, 2021

A not so bold proposal for the future of scientific publishing

Around 15 years ago I wrote a blog post about how we could open up more of the scientific process. The particular emphasis that I had in mind was to increase the modularity of the process in order to make it easier to change parts of it without needing a revolution. The idea would be that manuscripts would be posted to preprint servers that could accumulate comments and be revised until they are considered suitable for accreditation as a peer review publication. At the time I also though we could even be more extreme and have all of the lab notebooks open to anyone which I no longer consider to be necessarily useful.

Around 15 years have passed and while I was on point with the direction of travel I was very off the mark in terms of how long it would take us to get there. Quite a lot has happened in the last 15 years with the biggest changes being the rise of open access, preprint servers and social media. PLoS One started as a journal that wanted us to do post-publication peer review. It started with peer reviewed focused on accuracy, wanting then to leverage the magic of internet 2.0 to rank articles by how important they were through likes and active commenting by other scientists. The post-publication peer review aspect was a total failure but the journal was an economic success that led to the great PLoS One Clone Wars with consequences that are still being felt today - just go and see how many new journals your favourite publisher opened this year.

The rise of preprint servers has been the real magic for me. We live in each others scientific past by at least 2 years or so. If you sit down and have a science chat with me I can tell you about all of the work that we are doing which won't be public for some 2 years. If I didn't put our group's papers out as preprints you would be waiting at least 6-12 months to know about them. Preprint servers are a time machine, they move everyone forward in time by 12 months and speed up the exchange of ideas as they are being generated around the globe. If you don't post your manuscripts as preprints you are letting others live in the past and you are missing out on increased visibility of your own research.

Preprint servers also serve the crucial need to dissociate the act of making a manuscript public from the process of peer review, certification as a peer-reviewed paper and dissemination. This is important because it allows the whole scientific publishing system to innovate. This is needed because we waste too much money and time on a system that is currently not working to serve the authors or readers efficiently.

So after nearly 15 the updated version of the proposal is almost unchanged:

I no longer think it would be that useful to have lab notebooks freely available to anyone to read. There are parts of research that are too unclear and I suspect that the noise to information ratio would be too high for this to be of value. However, useful datasets that are not yet published could be more readily made available prior to publication. Along these lines, the ideas in the form of funded grant proposals should be disclosed after the funding period has lapsed. As for the flow from manuscript to publication, the main ideas remain and the system already exist to make these more than just ideas. There are already independent peer review systems like Review Commons. Such systems could eventually be paid and could lead to the establishment of professional paid peer reviewers. Such costs would then be deducted from other publishing costs depending on how the accreditation was done. Eventually "traditional" publishing could be replaced by overlay journals, like preLights, whose job would be to identify peer reviewed preprints that are of interest to a certain community.

Social media for me has been the most surprising change in scientific communication. I didn't expect so many scientists to join online discussions via social media. Then again, I didn't foresee the geekification of society. In many ways social media is already acting as a "publishing" system in the sense of distribution. Most of the articles I read today I find through twitter or Google Scholar recommendations. As we are all limited by the attention we can give, I think one day, instead of complaining about how impact factors distort hiring decisions we will be complaining about how social media biases distort what we think is high value science.

So finally, what can you do to move things along if you feel it is important ? If you think we have too many wasteful rounds of peer reviewing across different journals; that the cost of open access publishing is too high or even simply that publicly funded research should be free to read and openly available to mine ? Then the best single thing you can do today is make your manuscripts available via preprint servers.

Friday, May 21, 2021

Lab move to ETH Zurich, the job search and fixed term PI positions

ETH Zurich (credit)

Next January, after 9 years at the EMBL, I will be joining ETH Zurich as a tenured faculty of the Department of Biology with my research group hosted at the Institute for Molecular and Systems Biology (IMSB). I am really excited about this move and I think the IMSB is a perfect fit for the type of research that we do. We primarily use computational approaches to study the relation between genotype and phenotype with a specific focus on post-translational regulatory systems (more on the EBI website or my GScholar page). IMSB has a long tradition of method development in large scale measurements of biological systems with a current interest in mechanistically explaining trait variation. The smaller experimental component of our group uses yeast genetics which is also a great fit for the groups around including our future neighbours in the Institute of Biochemistry. Research wise the group will remain focused on: studying the evolution and functional importance of post-translational regulation; determining the regulatory networks of a cell, and how they change under different conditions including disease. More broadly we also study the mechanisms that underlie trait variation across individuals of the same species. In terms of methods it will remain primarily computational with around 30% of the group devoted to lab work. The lab will be fully equipped for large scale yeast genetics with the exciting addition of having funding for a MS instrument for the proteomics.

Teaching, scientific integration and group structure

With any move there is always some thoughts about the challenges ahead. Professionally, the types of things on my mind are that I will need to setup the group, integrate myself scientifically and prepare myself for teaching. Setting up the group and integrating myself within the local environment won't be new experiences. I feel I was too slow with both of these things when I first joined EMBL-EBI so I am curious if I will be able to move things along faster this time. Coming from EMBL and the local EBI/Sanger campus I have the impression that ETH is less collaborative but there were clearly many people interested in collaborating just from the small sample I got during interviews. There is an interesting difference in group structure between EMBL and ETH where at ETH a group can have sub-groups with junior PIs that can have varying degrees of independence as per the decision of the more senior PI. Organising a lab in this way will be something new. Finally, I will have to teach at the undergraduate level for the first time. I have always said that students coming out of biology or related topics need to have better training in bioinformatics. While daunting this will be my chance to contribute to this training directly.

The interview process and decisions

For those less familiar with the EMBL, group leaders are hired for a maximal period of 9 years with only a few exceptions (around 10%) that end up having an open-ended contract. We get generous core funding and get to tap into a great scientific network which more than compensates for the lack of tenure. This means that around year 7 your thoughts start moving into the future. At faculty presentations I would often write how many years I had left in the tittle slide as a personal reminder. Towards the end of year 7 I started applying and spent most of year 8 applying and interviewing. The first time I applied for PI positions it was all very unidirectional, with myself looking broadly for possible places. This time it felt more like dating a potential future university/institute with expressions of interests on both sides. One of the issues in going into this is that I didn't really know what my value would be in the market. I knew I had a good CV and would certainly find a job, I just didn't know where I could aim for in terms of seniority and resources. That become clearer only after the first interview and the expression of interest of places I felt were really fantastic.

The second half of 2020 became then about trying to find the best place professionally and personally. I ended up applying to 10 places, interviewed in 8 and received 5 offers. I tried to find a job in my home country (Portugal) but from the two places I was interested one picked another candidate and the other could not make an offer that was not fixed term. The decision ended up being among 3 places with the major differentiation factor being between 2 offers that had less core funding but higher management responsibilities and ETH with incredibly generous core funding and the best scientific fit (but less seniority). Personally the decisions were about staying in the UK or moving to France or Switzerland. There is quite a lot to be said about this choice (safety, adventure, integration, kid friendly, jobs for partner, etc) and in the end we went with Switzerland. While excited I am also anxious about yet another move to what will be my 5th home country, the now almost familiar sense of uprooting and new beginnings. But this is not yet time for goodbyes.

Non-tenure group leader positions (in Europe)

I don't know who invented the fixed term, non tenure track, group leader positions in academia. It may have been EMBL and this model has clearly spread across Europe with many research institutes having some form of junior positions that have a variable number of years (5 to 12) to set up a group and then necessarily need to move on to a different place. EMBL does this because it is funded by many member state countries to train the next generation of "academic leaders" that will lead research groups across the member states. The obvious advantage of hosting these positions is that it keeps the institute forever young if you manage the turnover well. I think these positions can work well if they remain a relatively small proportion of the total PI/faculty positions; there is some level of support to at least kick start the group; and the positions last a sufficient number of years. Having gone through this at EMBL my impression is that 7 years would be the bare minimum and 9-10 years would be ideal. This also depends on the level of support beyond the PI salary. If conditions are not met then it is not worth setting up people for failure with the selfish goal of using the higher turnover to bring in new ideas/methods. Don't give people super postdoc positions for 3-5 years with no funding and no chances of tenure just because you want fresher ideas around. If there is some mechanism for tenure or open ended contract then it should be crystal clear from the start how (un)likely this is and what are the transparent criteria for achieving it.

Friday, January 29, 2021

State of the lab 7 & 8 - The last years at EMBL

This is usually part of a yearly series of posts where I note down thoughts related to managing a research group in academia over the years. This post covers years 7 and 8 and it brings me now to the start of year 9, my last at EMBL. While I usually do one of these posts every year, with all of the craziness of 2020 I ended up skipping one.

Year 7, group turnover

2019 was the year where the group fully turned over all lab members that were with us since the earlier years with 2 postdocs (Haruna Imamura and David Ochoa) and 3 PhD students (David Bradley, Claudia Hernandez-Armenta and Marta Strumillo) leaving. Haruna is now a Research Scientist at the Systems Biology Institute in Japan, David O is a the platform coordinator at Open Targets and Claudia and David B are now doing postdocs. Marta is finding her way through consulting. We were joined by 2 postdocs (David Burke and Miguel Correa) and 2 PhD students (Eirini Petsalaki and Rosana Garrido). This constant turnover of group members is quite difficult to manage both personally and professionally. Year 7 was really the year with largest amount of changes in the group and there is something to be considered about trying to make sure that changes remain gradual. However, it is not always possible to plan for this to happen. While I think that this change in academia is generally positive for science, I do wonder what could be achieved if this was not a requirement (see earlier post).

Managing research focus over the years

Over the last few years, the research in the group had some dispersion in terms of the group research topics. At the start, the group was named "Evolution of cellular interactions" with a primary focus on the evolution and functional relevance of protein phosphorylation. While this remained the central focus there were other areas we worked on including cancer genomics and genetics of human disease and microbial trait diversity. We also have work that is not yet visible on drug mode-of action predictions. This led me to change the group name to "Cellular consequences of genetic variation" which could better serve as umbrella to the different topics. This is, at least in part, a simple reflection of funding opportunities but also a reflection of true movement in my research interests and the environment I have been working in (Genome Campus). On one hand I feel this dispersion is detrimental in that we could do more with a single minded focus, but on the other hand these extensions have not really been the majority of our work and also act as way for the group to explore new directions. My visual reference for this is a cell sending out protrusions in some directions to feel out the environment around. On some of these new areas (e.g. microbial trait diversity) I feel we have done enough, even with a small total investment, to make the work stand on its own.

I have to say that the without explicitly planning for it, the dispersion worked to my advantage when applying for position last year as it allowed me to present the group through slightly different lenses depending on where I was interviewing in. Of course, this is only beneficial if there is sufficient research progress made by the group not to appear superficial or unfocused. I suspect that this movement in research topics is normal but I haven't had many deep conversations with others about how this has happened to them in their research groups. In some cases, the changes in topics for some groups seem more abrupt from the outside but it could be just a perception. I will soon have an opportunity to rethink where we put most of our research efforts and likely cut back on some of these extensions.

Year 8 - A new group, the pandemic and the job market

At the start of last year, I was finally getting comfortable with the idea that the group had changed so much and I was truly excited about the new beginning. Just as the year was starting and I was enjoying this excitement the pandemic hit. As I had described before, we ended up devoting some effort in the group to work on SARS-CoV-2 projects which I think was also good for group morale. However, the changes in working conditions, the effort on the SARS projects and my need to go back to the job market made me less capable of keeping up with some of the projects in the group. While most of the work has kept going there are at least 3 projects/manuscripts that have been neglected simply for my own lack of time/effort. We all know these stories of PIs that let work pile up on their desk and I feel it as a failure although I can rationalise why I really didn't have the time to fully keep up.

Finally, over last year I was fully back on the job market and I am so relieved that this is now over. Since there nothing official that I can announce I will wait to write up in detail what the process was like and compare it to my first attempt to secure a PI position. I can at least say that I will leave EMBL-EBI at the end this year and I will certainly write more about the 9 years of EMBL. I do want to look back to all that has been good (mostly) and bad, make a summary of what I feel were the biggest advances we made, perhaps discuss the finances, and more broadly go over the issues of this lack of tenure for junior PIs now implemented in so many European research institutions.

Friday, December 04, 2020

A year of SARS-CoV-2 research

This post may be premature but I feel like writing down some thoughts about the roller coaster that this year has been. At the start of the year, with the number of reported cases rising in Europe the EMBL and our institute (EMBL-EBI) decided to send everyone home as precautionary measure. As most of our group is computational, this has meant we have been working from home for most of this year. Early on, somewhat frustrated by not being able to help, I emailed a few people that could be working on the virus. Nevan Krogan replied saying our help would be useful and we joined the global effort to contribute to solving this crisis.

Science at science fiction speed

Over the course of 9 months we took part in 4 projects, some of these being the most thrilling science I have ever taken part in. We condensed what would easily be a 3 to 5 years research project into something done in 3-4 months, involving typically 10-20 research groups with a few key people helping to direct the research. We were collecting data, analysing and suggesting new experiments in the span of days with some of the best scientists in the world. Contributing to the direction of this level of resources has been an amazing experience that I wish every scientist could try at least once in their life. These projects were all geared towards studying how SARS-CoV-2 takes control of its target cells to be able to suggest human targeting drugs that could counter the infection. Several of the compounds identified in these studies are in clinical trials for COVID-19 so I feel the projects met their main objective.

While this has been my perspective from working on these specific projects we are all aware of the amazing scientific progress that has been made over the course of this year. I remember seeing the movie Contagion and almost laughing at the unrealistically fast pace of research in the movie. However, SARS-CoV-2 research has in fact happened at an incredibly fast pace that probably matches the movie.

Why don't we do this for disease X?

One discussion point that has come up often is if we can learn from this period to apply it to research into other diseases. Science is an international endeavour but the degree of collaborations for SARS-CoV-2 research has been higher than usual. The effort put into this was also high among the projects I have seen personally but this eventually results in some exhaustion and it is not sustainable. I don't think this is easy to repeat for other diseases without the same external sense of urgency. Most scientists won't just drop what they are working on to fully focus on some other research question. Maybe it is an argument for even higher degree of collaboration, in particular between academia and biotech/pharma. There may be some small increase in productivity of collaborations through the use of online tools like slack and zoom but overall I don't see that the way we do science has been dramatically changed going forward.

The case for higher spending in research

I'm gonna have to science the s**t out of this

Jeremy Farrar has often said that science is our exit strategy for this crisis. From testing, tracking the spread, to treatments and vaccines. It is this single minded effort of so much of the worlds research capacity that will lead to a long lasting solution. This already looks to be within reach with some treatment options, new ways of testing and critically, what appear to be effective vaccines. Soon enough we will be looking back and asking ourselves if there is something we could have done better. As trained scientists our reflex is to pause and think carefully about all the things that could have worked better. Were we efficient ? Did we deal well with the deluge of studies ? Was the peer-review too shallow and quick? It is our instinct to be critical but maybe we should be more vocal about how amazing the response of the scientific community has been. More importantly, this is the time to demand higher funding rates. If society can't see how important science is during a pandemic, when are we going to make our case ? This is the capacity of a research infrastructure that is funded by 1-2% of national budgets, what could humanity achieve if we were to double it ?

Over the last 10 years academic science budgets have been squeezed and a lot has been said about how academic science needs to be more applied and how much we should justify the investment it is being made. This week, DeepMind, a private research institute funded by what is essentially an advertising company (Alphabet/Google) has made headlines with their impressive research into predicting the structure of a protein from its sequence. An advertising company finds the money to invest into what are fundamental biological problems and in the middle of a pandemic that is being solved by a global scientific infrastructure we can't get the EU science budget to increase. We should be ready to make our case over the course of the next months.

Thursday, May 30, 2019

PlanS, the cost of publishing, diversity in publishing and unbundling of services

A few days ago I had another conversation about PlanS with someone involved in a non-profit scientific publisher. I am still sometimes surprised that these publishers have been very much reacting to the changes in the landscape. In hindsight I can understand that the flipping of the revenue model to author fees has been threatened for a long time but always seemed to be moving along slowly. Without going into PlanS at all, the issue for many of the smaller publishers is that they simply cannot survive under an author fee model because their revenue from the subscription would translate to an unacceptable cost per article (given that they reject most articles). These smaller publishers typically use their profit to then fund community activities (e.g. EMBO press). The big publishers will do just fine because they have a structure that captures most articles in *some* journal so their average cost per article would end up being acceptable in a world without subscriptions.

I don’t want to go into the specifics of PlanS at all but I see clearly the perspective of the founders and wider society of wanting to have open access and even reducing the costs of publishing. The publishers have been given quite a lot of time to adapt and maybe some amount of disruption is now needed. One potential outcome of fully flipping the paying model might be that we simply lose the smaller publishers and consequently lose also their community activities if they can’t find alternative ways to fund them. There are enough journals in scientific publishing that, to be honest, I think the disruption will not be large.

Less publishers means less innovation in publishing

What I fear we will lose with the reduction in the number of publishers is the potential to generate new ideas in scientific publishing. Publishers like EMBO press, eLife and others have been a great engine for positive change. Examples include more transparent peer review, protection from scooping, cross-commenting among peer-reviewers, checks on image manipulation, and surfacing the data underlying the figures (see SourceData). While this innovation tends to spread across all publishers it is not rewarded by the market. Scientific publishing does not work within a well-functioning economic market. We submit to the journals that have the highest perceived “impact” and such perceived impact is then self-sustaining. It would take an extraordinary amount of innovation to disrupt leaders in the market. For me, this is a core problem of publishing, the fact that the market is not sensitive to innovation.

To resolve this problem we would have to continue the work to reduce the evaluation of scientists by the journals they publish in. Ideas around alt-metrics have not really moved the needle much. Without any data to support this, my intuition is that the culture has changed somewhat due to people discussing the issue but the change is very slow. I still feel that working on article recommendation engines would be a key part of reducing the “power” of journal brands (see previous post). Surprisingly, preprints and twitter are already working for me in terms of getting reasonable recommendations but peer-review is still a critically important aspect of science.

Potential solutions for small publishers

Going back to the small publishers, one thing that has been on my mind is how they can survive the coming change in revenue model. Several years ago I think the recommendation could have been to just grow and find a way to capture more articles across a scale of perceived impact (previous post). However, there might not be space for other PLOS One clones. An alternative to growing in scale would be to merge with other like-minded publishers. This is probably not achievable in practice but some cooperation is being tested, as for example in the Life Science Alliance journal. Another thought I had was then to try to get the market to appreciate the costs around some of the added value of publishing. This is essentially the often discussed idea of unbundling the services provided by publishers (the Ryanair model?).

Maybe the most concrete example of unbundling of a valuable service could be the checks on non-ethical behavior such as image manipulation or plagiarism. These checks are extremely valuable but right now their costs are not really considered as part of the cost of publishing. Publishers could consider developing a package of such checks, that they use internally, as a service that could be sold to institutions that would like to have their outgoing publications checked. Going forward, some journals could start demanding some certification of ethical checks or funding agencies could also demand such checks to be made on articles resulting from their funded research. Other services could be considered for unbundling in the same way (e.g. peer review) but these checks on non-ethical practices seem the most promising.

(disclosures: I currently serve on the editorial board of Life Science Alliance the Publications Advisory Board for EMBO Press)

Friday, March 29, 2019

Research summary - Predicting phenotypes of individuals based on missense variants and prior knowledge of gene function

I have been meaning to write blog posts summarising different aspects of the work from our group over the past 6 years, putting it into context with other works and describing also some future perspectives. I have just been at the CSHL Network Biology meeting with some interesting talks that prompted me to put some thoughts to words regarding the issue of mapping genotypes to phenotypes, making use of prior cell biology knowledge. Skip to the last section if you just want a more general take and perspective on the problem.

Most of the work of our group over the past 6 years has been related to the study of kinase signalling. One smaller thread of research has been devoted to the relation between genotypes and phenotypes of individuals of the same species. My interest in this comes from the genetic and chemical genetic work in S. cerevisiae that I contributed while a postdoc (in Nevan Krogan’s lab). My introduction to genetics was from studies of gene deletion phenotypes in a single strain (i.e. individual) of a model organism. Going back to the works of Charlie Boone and Brenda Andrews, this research always emphasised that, despite rare, non-additive genetic and environment-gene interactions are numerous and constrained in predictable ways by cell biology. To me, this view of genetics still stands in contrast to genome-wide association studies (GWAS) that emphasise a simpler association model between genomic regions and phenotypes. In the GWAS world-view, genetic interactions are ignored and knowledge of cell biology is most often not considered as prior knowledge for associations (I know I am am exaggerating here).

Predicting phenotypes of individuals from coding variants and gene deletion phenotypes

Over 7 years ago, some studies of strains (i.e. individuals) of S. cerevisiae made available genome and phenotypic traits. Given all that we knew about the genetics and cell biology of S. cerevisiae I thought it would not be crazy to take the genome sequences, predict the impact of the variants on proteins of these strains and then use the protein function information to predict fitness traits. I was brilliantly scooped on these ideas by Rob Jelier (Jelier et al. Nat Genetics 2011) while he was in Ben Lehner’s lab (see previous blog post). Nevertheless, I though this was an interesting direction to explore and when Marco Galardini (group profile, webpage) joined our group as a postdoc he brought his own interests in microbial genotype-to-phenotype associations and which led to a fantastic collaboration with the Typas lab in Heidelberg pursuing this research line.

Marco set out to scale up the initial results from Ben’s lab with an application to E. coli. This entailed finding a large collection of strains from diverse sources, by sending emails to the community begging them to send us their collections. We compiled publicly available genome sequences, sequence some more and performed large scale growth profiling of these strains in different conditions. From the genome sequences, Marco calculated the impact of variants, relative to the reference genome and used variant effect predictors to identify likely deleterious variants. Genomes, phenotypes and variant effect predictions are available online for reuse. For the lab reference strain of E. coli, we had also quantitative data of the growth defects caused by deleting each gene in a large panel of conditions. We then tested the hypothesis that the poor growth of a strain of E. coli (in a given condition) could be predicted from deleterious variants in genes known to be important in that same condition (Galardini et al. eLife 2017). While our growth predictions were significantly related to experimental observations the predictive power was very weak. We discuss the potential reasons in the paper but the most obvious would be errors in the variant effect predictions and differences in the impact of gene deletion phenotypes in different genomic contexts (see below).

Around the same time Omar Wagih (group profile, twitter), a former PhD student, started the construction of a collection of variant effect predictors, expanding on the work that Marco was doing to try to generalise to multiple mechanisms of variant effects and to add predictors for S. cerevisiae and H. sapiens. The result of this effort was the www.mutfunc.com resource (Wagih et al. MSB 2018). Given a set of variants for a genome in one of the 3 species mutfunc will try to say which variants may have an impact on protein stability, protein interactions, conserved regions, PTMs, linear motifs and TF binding sites. There is a lot of work that went into getting all the methods together and a lot of computational time spent on pre-computing the potential consequence of every possible variant. We illustrate in the mutfunc paper some examples of how it can be used.

Modes of failure – variant effect predictions and genetic background dependencies

One of the potential reasons why the growth phenotypes of individual stains may be hard to predict based on loss of function mutations could be that the variant effect predictors are simply not good enough. We have looked at recent data on deep mutational scanning experiments and we know there is a lot of room for improvement. For example, the predictors (e.g. FoldX, SIFT) can get the trends for single variants but really fail for more than one missense variant. We will try to work on this and the increase in mutational scanning experiments will provide a growing set of examples on which to derive better computational methods.

A second potential reason why loss of function of genes may not cause predictable growth defects would be that the gene deletion phenotypes depends on the rest of the genetic background. Even if we were capable of predicting perfectly when a missense variant causes loss of function we can’t really assume that the gene deletion phenotypes will be independent of the other variants in the genome. To test this we have recently measured gene deletion phenotypes in 4 different genetic backgrounds of S. cerevisiae. We observed 16% to 42% deletion phenotypes changing between pairs of strains and described the overall findings in this preprint that is currently under review. This is consistent with other works, including RNAi studies in C. elegans where 20% of 1,400 genes tested had different phenotypes across two backgrounds. Understanding and taking into account these genetic background dependencies is not going to be trivial.

Perspectives and different directions on genotype-to-phenotype mapping

Where do we go from here ? How do make progress in mapping how genotype variants impact on phenotypes ? Of course, one research path that is being actively worked on is the idea that one can perform association studies between genotypes and phenotypes via “intermediate” traits such as gene expression and all other sorts of large scale measurements. The hope is that by jointly analysing such associations there can be a gain in power and mechanistic understanding. Going back to the Network Biology meeting this line of research was represented with a talk by Daifeng Wang describing the PsychENCODE Consortium with data for the adult brain across 1866 individuals with measurements across multiple different omics (Wang et al. Science 2018). My concern with this line of research is that it still focuses on fairly frequent variants and continues not to make full use of prior knowledge of biology. If combinations of rare or individual variants contribute significantly to the variance of phenotypes such association approaches will be inherently limited.

A few talks at the meeting included deep mutational scanning experiments where the focus is mapping (exhaustively) genotype-to-phenotype on much simpler systems, sometimes only a single protein. This included work from Fritz Roth and Ben Lehner labs. For example, Guillaume Diss (now a PI at FMI), described his work in Ben’s lab where they studied the impact of >120,000 pairs of mutations on an protein interaction (Diss & Lehner eLife 2018). Ben’s lab has several other examples where they have look in high detail and these fitness maps for specific functions (e.g. splicing code, tRNA function). From these, one can imagine slowly increasing the system complexity including for example pathway models. This is illustrated in a study of natural variants of the GAL3 gene in yeast (Richard et al. MSB 2018). This path forward is slower than QTL everything but the hope would be that some models will start to generalise well enough to apply them computationally at a larger scale.

Yet another take on this problem was represented by Trey Ideker at the meeting. He covered a lot of ground on his keynote but he showed how we can take the current large scale (unbiased) protein-protein functional association networks to create a hierarchical view of the cellular functions, or a cellular ontology (Dutkowski et al. Nat Biotech 2013 , www.nexontology.org). Then this hierarchical ontology can be used to learn how perturbations of gene functions combine in unexpected ways and at different levels of the hierarchy (Ma et al. Nat Methods 2018). The notion being that higher levels in the hierarchy could represent the true cellular cause of a phenotype. In other words, DNA damage repair deficiency could be underlying cause of a given disease and there are multiple ways by which such deficiency can be caused by mutations. Instead of performing linear associations between DNA variants and the disease, the variants can be interpreted at the level of this hierarchical view of gene function to predict the DNA damage repair deficiency and then associate that deficiency with the phenotype. The advantages of this line of research would be to be able to make use of prior cell biology knowledge and in a framework that explicitly considers genetic interactions and can interpret rare variants.

I think these represent different directions to address the same problem. Although they are all viable, as usual, I don't think they are equally funded and explored.

Wednesday, January 09, 2019

State of the lab 6 – group turnover and getting back in the job market

This blog post is part of a yearly series and marks the end of the 6th year as a group leader at EMBL-EBI. Continuing on the theme of the last post of this series, 2018 was a year of wrapping up projects. We finished and made available 4 preprints (plus a few collaborations) in 2018 with 4 more manuscripts ready to be submitted early this year. As in 2017 the group continued to work at full potential with most lab members having been in the group for several years. Some of the turnover I was expecting last year was postponed for the current year. This will make 2019 particularly challenging both personally and professionally with 3 postdocs and 3 PhD students leaving. I have had a few conversations about lab turnover with more senior colleagues. Their typical responses have been that while it is hard to imagine how the group can survive when experienced people leave the incoming lab members bring new ideas and are a great opportunity to start new directions. Being an optimist I look forward to this new chapter in the group although it will be certainly sad to say goodbye to so many people.

We often talk about the issues in academia that are not great: the publish-or-perish mentality, chasing the big journals, the job market, etc. Looking back through the last 2 years I really want to make the point of how great it has been to manage this team of scientists. We got to hit that sweet spot where most team members have been in the group for a few years, know each other’s’ capacities and there are synergies in skill sets and projects. With group members doing a mix of computational and experimental work and a knowledge base ranging from structural biology to genetics. It feels like we could aim our collective capacity to almost any problem and we would make progress. I guess this is what is expected but for me it was the first time seeing it build up within the group. I am sure the group will hit that sweet spot again with a different configuration of people and ideas but the next few years will be a period of reconfiguration.

Group leaders at EMBL typical have a maximum of 9 years and I am currently left with 3 years to move to a new position. Although it is still some time, 3 years means I am now making the last set of hires. We will have 2 postdoc and 1 PhD positions open this year and the group size will start to decrease. Besides focusing on the start of the new projects I will be very actively applying for funding with the idea of taking that funding with me when I move. Given the time it takes to interview and have decisions made for academic posts I will start applying this year if I find interesting places that will consider hiring me in a joint position or with a delay in the start time. I aim to move the group in 2021 but could start sooner as a joint appointment which would give me time to start the new group and apply to and/or move funding. The job application period at the end of the postdoc was one of the most stressful in my life so I am not looking forward to doing it again.

Scientifically there is much to write about but instead of trying to summarise what we have finished in 2018 I think it is the right time to write a few separate blog posts with a summary of what we have achieved over the last 6 years. There have been a few separate threads of research that have resulted in multiple manuscripts so I will group them, describe the work, the people that did it all and what I think are some of the open questions that we may work on in the future.

Wednesday, January 10, 2018

State of the lab 5 – in the flow with 4 years to go

This blog post is part of a yearly series and marks the end of the 5th year as a group leader at EBI. In March we had an external evaluation of all research groups at EMBL-EBI. It was an interesting experience and overall it was judged a great success for EBI. For our group it was also part of the evaluation towards the standard renewal of contract where I got the 4 year extension. Since there is essentially no tenure at EMBL this also means that I have 4 years until I have to find a senior PI position. This is still a long time but it will increasingly be on my mind going forward. I am not particularly worried but I feel like there are many more places now in Europe with fixed term junior group leader positions. The postdoc bubble will turn into the junior PI bubble and we will have another big barrier and competition in the transition between junior and senior positions.

Personally it is almost strange to stay in the same place after 5 years since I have been typically staying 4-5 years in each place during university (Coimbra), PhD (Heidelberg) and postdoc (San Francisco). It looks like I will have to find some other excuse to thin out my pile of papers on the desk instead of simply moving to a new country and trashing everything.

The end of a cycle
Last year was our most productive year so far, as measured by the number of publications. This year is going to top it based on the manuscripts that I should be working on at the moment instead of writing this post (sorry guys). The research in the group is just flowing with more synergies among the group members. Just when everything is working so well is when so many in the group are leaving. Last year our first PhD student finished (Omar, now at DeepGenomics) and two postdocs have left (Romain moved to benevolentAI and Sheriff is now a project leader at EBI). This year there will be even more people potentially leaving. It is going to be a new challenge to try to keep the science going through the turnover. On the other hand, new arrivals signal the start of new projects and are an opportunity to move the group in new directions. Just at the end of the year, we had 3 new members starting: Allistair (PhD student), Inigo (postdoc) and Abel (visiting PhD student). Abel and Inigo will be working on the impact of mutations in protein interactions and control of protein abundance while Allistair will likely work on the evolution of regulatory networks.

Highlight from 2017 – Predicting condition specific phenotypes from genomes
Most of the work in the group is focused on understanding the function and impact of genetic variants on protein post-translational regulation, in particular for phosphorylation and ubiquitin. However, we have been also working more generically on the genotype to phenotype problem. I think these analyses could use more prior knowledge information and we are trying to contribute in this direction.

Part of this work, led by Marco (GScholar, Twitter) and in collaboration with the Typas lab in Heidelberg was finally published at the end of this year. The question we wanted to address was to what extent we can predict condition specific phenotypes of a strain of E. coli based on its genome and what we know from the well-studied E. coli K-12 lab strain. This is inspired by work that Rob Jelier and Ben Lehner did in S. cerevisiae but on larger scale. To set the project up, imagine we know that a given gene X of E. coli is required for growth under high heat. Then, if that gene X is not present or severely mutated in a strain of E. coli, we would expect that this mutated strain should not survive well in high heat. To test this in large scale we assembled a panel of hundreds of strains of E. coli for which we obtained genomes and fitness measurements under many conditions. We modelled the consequence of mutations using different methods and we collected prior knowledge of which genes are supposed to be important for each condition. In the end we could only predict which strains would tend to grow poorly for around 40% of conditions. This level of success may not be surprising since we didn't take into account for example issues like gene expression levels or compensation by new genes. It could be that gene function may be a lot more plastic than currently assumed but to prove this we will need different experiments.

Besides testing the central question expressed above this collection of E. coli strains with associated data will hopefully serve as resource for future studies. Any additional layer of molecular data (e.g. gene expression) or phenotype (e.g. motility) we measure can make use of all of pre-exiting information. We could ask if motility correlates with the growth under several drugs we tested for example. All of the resources for this collection are freely available and of course this would not be possible without the hard work of the scientist that collected the strains to begin with (listed here).

Highlights for the year ahead
We have 3 different projects that are close to completion that relate to the functional relevance of protein phosphorylation. This is probably going to be our biggest contribution of 2018. We continue to work with the cancer related datasets, primarily using these data to study protein post-translational regulation. Not necessarily to better understand cancer but making use of the large genetic and molecular variation that exists in cancer to better understand the regulatory processes of normal cells. Additionally we will have some progress to report on the evolution of protein kinases and potentially the evolution and regulation of ubiquitylation.

Friday, January 05, 2018

Group member profile - Omar Wagih

The latest instalment of this blog post series is by Omar Wagih (@omarwagih, Gscholar) who has just last month successfully defended his PhD. Along with Marco, Omar has been part of the group working on studying how DNA variants relate to phenotypes. He developed the mutfunc resource and the fantastic guess the correlation game.

What was the path the brought you to the group? Where are you from and what did you work on before arriving in the group?
My love of genetics is, in more ways that one biologically ingrained. Growing up in a family of scientists, I was always surrounded by a wealth of information which I instinctually sought to organise. For this, I pursued my undergraduate and masters degree at the University of Toronto, majoring in computational biology and computer science, respectively. Along the way, I was fortunate to work in some of the leading computational biology labs in Canada including those of Gary Bader, Philip Kim, Charlie Boone, Brenda Andrews, Andrew Fraser and Andrew Emili. I worked on a range of projects which ranged from analysing images of genetic screens of yeast to determining the impact of disease mutations on kinase-substrate phosphorylation. These experiences led me to develop an interest in understanding how changes in the genome translate to variability in cellular physiology, and ultimately phenotype, which prompted me to pursue my PhD.

What are you currently working on?
My current project involves working towards a deeper understanding of how changes in the genome propagate to phenotypic variability by predicting which cellular mechanisms are likely to be impacted. For the past several years I have been developing and using computational methodologies to assess the mechanistic impact of natural and disease-causing mutations. I have been applying these to yeast, human and bacteria models in hopes of streamlining hypothesis-driven variant annotation. I have also been utilising these predictions to assess the overall burden these mutations impose on gene function and putting such information towards conducting gene-phenotype associations.

What are some of the areas of research that excite you right now?
I'm intrigued by novel mutagenesis technologies that are allowing us experimentally assess the impact of genetic variants on cellular fitness and function in a massively parallel fashion. Technologies like deep mutational scanning CRISPR are becoming increasingly common in achieving this and their off-target effects are steadily being reduced.

With such massive amounts of mutagenesis data, I'm also interested in how machine learning methodologies such as deep learning can be applied to learn how mutations collectively impinge on cellular function and ultimately phenotype. This would significantly improve the precision of variant impact predictors and, in my opinion, will have crucial roles in shaping the development of novel and personalised drug therapies.

What sort of things do you like outside of the science?
Whether I'm skiing, hiking, camping or exploring the city, or you'll more likely than less find me outdoors. I often partake in sports. During my time in Cambridge, I rowed for my college and was part of the university boxing team.

I have been fascinated by drones for a while and own a DJI Phantom 3, which I often use for aerial filming. I also enjoy landscape and portrait photography, particularly with my 50mm lens. If I still have extra time on my hands, you'll find me implementing silly ideas that come to mind into apps or games. Here are a few I've made: genewords, pubtex, and guess the correlation.

Monday, June 26, 2017

Building rockets in academia - big goals from individual projects

SpaceX just launched and landed another two rockets over the weekend. I don’t get tired of watching those images of re-entry and landing. The precision is mesmerizing and extremely inspiring. Leading a research group in academia I often look at research intensive companies and wonder about the differences and similarities between how research is done in both. I have never worked in such a company environment so these thoughts are certainly from the perspective of academia.

The big goals and peripheral bets

From reading about big tech companies and start-ups I can relate to how they appear to organize their product portfolio into a small number of main goals – their core product(s) – while at the same time experimenting with peripheral goals/products. Tesla started as a car company but may end up being a large battery company with small side of car manufacturing. As another example, most major tech companies are today experimenting with virtual reality. In these experiments, those involved face similar questions about uncertain outcomes and timeliness of their steps as we do in academia. One of the thrills in academia is that leap into the unknown where it is crucial to ask the right question just at the right time. The speed of progress in research can be very uneven with times spent floundering in the dark and times where you just happen to walk in the right direction and find big riches. Sometimes those explorations will lead you to unintended directions, away from your core research, where it might be worth moving additional resources. Aiming in the right direction at the right time is a rare skill that a researcher must have but that we don’t spend enough time training for. Also, the balance between focusing on the core and exploring other areas of interest is difficult to set. In academia it seems easier to obtain funding to keep working on your core than to move to new areas. I wonder how companies deal with these issues. I am extremely thankful to be working in a research institute where I get core funding that, although I have to justify, I get to use to explore ideas outside the core of what we do. Such flexibility could be a bigger part of how research funding gets distributed.

Individualized contributions to group goals

While setting a big goal and exploring peripheral objectives might have a lot in common between academia and companies, there is one aspect of how we work that appears very different. In setting the big overarching questions we have to accommodate the fact that each individual group member will have to stand out. PhD students are working on their theses and postdocs are building the work on which they will stand as future group leaders. Each project has to brilliantly stand on its own while simultaneously fitting together with other group projects, contributing to an even greater goal. As each research project can be an unpredictable grasp in the dark, as a group leader I feel like I have to be build an alluring house of cards. Projecting how several research projects might move forward and create an illusionary image of how they fit together to solve THE big question. Not only will we build the rocket that will save mankind but every single contribution from each team member has to solve an important problem. It is obvious that the overarching goal will have to shift with time as some projects move to their potential unintended outcomes. In the context of being flexible to follow peripheral bets, maintaining the big picture goal may be challenging. I would not be the first to propose more career tracks in academia where professional researchers don’t have to move into management roles to keep working in academic science. It would be interesting to try it out on some research institutions to see the effect it would have on how research agendas would be organized.

Friday, April 28, 2017

Postdoc positions on context dependent cell signalling (wet and/or dry)

Why do some mutations cause cancer in some tissues and not others ? What happens to the cell signalling pathways during differentiation ? Why are some genes essential in some cell types and not others or why are some drugs more effective at killing some cell types than others ?

We think that this is a great time to be asking these questions of how the genetic background or tissue of origin changes cell states. More precisely for us, how this re-wires cell signalling. It has become routine to measure changes in phosphorylation across different conditions, including different cancer types. The Sanger and others are establishing panels of human cell lines that are being profiled with an increasing array of omics technologies with drug sensitivity and CRISPR based gene essentiality information. These panels offer a great opportunity to address these questions.

We want to combine the work we have been doing in studying human signalling with phosphoproteomic data, with variant effect predictors, microscopy based studies of cell signalling and network modelling to address this question of context dependent changes in cell signalling.

To support this research we have 2 postdoc positions available: one would be primarily computational and would involve image analysis and network modelling in collaboration with microscopy groups (see here for project and application); the second would be primarily experimental with a focus on microscopy. The latter would be available via the ESPOD fellowship scheme in collaboration with Leopold Parts group at Sanger (see here for project description and here to apply). The split between computational and experimental is open and wet/dry mixed candidates are encouraged as well to apply to both.

These projects complement existing work in the group using cancer Omics data to study the genetic determinants of changes in protein abundance and phosphorylation and will be in collaboration with work developed by the Petsalaki group at EBI that is also recruiting. Email me if you have any questions/concerns about the positions.

Monday, April 10, 2017

17 years of systems biology

I know that 17 years is not a very round number. It is also fairly arbitrary as I am assuming systems biology started around 2000 (see below). I was last week in Portugal, where every year for the past 8 years I have been teaching a week long course on Systems and Synthetic Biology to the GABBA PhD program. This might be the last year I take part in this course and so I felt it would be a good time to try to put some thoughts in a blog post. This course has been jointly co-organised from the beginning with Silvia Santos and we had several guests throughout the years including Mol Sys Bio editors Thomas Lemberger and Maria Polychronidou and other PIs: Julio Saez-Rodriguez, Andre Brown, Hyun Youk and Paulo Aguiar. Some of what I write below has been certainly influenced by discussions with them. This is not meant as an extensive review so apologies in advance for missing references.

Where did systems biology come from?

It is not contentious to say that systems biology came about in response to the ever narrower view of reductionist approaches in biology. Reductionism is still extremely important and I assume that, as a movement, it was an opposition to the idea that biology was animated by some magical force that could never be comprehended. Since the beginning of the course we have asked students to read the assay “Can a biologist fix a radio?” by Yuri Lazebnik (2002). The article captures well the limitations of reductionist research. The more we know about a system, apoptosis in Yuri's case, the more complex and non-intuitive some observations may seem. Yuri's description of how a biologist would try to understand how a radio works is comical and still very apt today:

We would “remove components one at a time or to use a variation of the method, in which a radio is shot at a close range with metal particles. In the latter case radios that malfunction (have a “phenotype”) are selected to identify the component whose damage causes the phenotype. Although removing some components will have only an attenuating effect, a lucky postdoc will accidentally find a wire whose deficiency will stop the music completely. The jubilant fellow will name the wire Serendipitously Recovered Component (Src) and then find that Src is required because it is the only link between a long extendable object and the rest of the radio.”

One of the driving forces for the advent of systems biology was this limitation, so brilliantly captured by Yuri, that reductionism can fail when we are overwhelmed with large systems of interconnected components.

Around the time that Yuri wrote this article our capacity to make measurements of biological objects was undergoing a revolution we generally call omics today. In 2001 the first drafts of the human genome were published (Lander et al. 2001; Venter et al. 2001). Between 2000 and 2002 we had the first descriptions of large scale protein-protein (Uetz et al. 2000; Ito et al. 2001; Gavin et al. 2002) and genetic interactions mapping (Tong et al. 2001). The capacity to systematically measure all aspects of biology appeared to be within our grasp. The interaction network representation of nodes connected by edges is now an icon in biology, even if not as recognisable as the double helix. This ever increasing capacity to systematically measure biology was, alongside the complexity of highly connected components, the second major driving force for the advent of systems biology.

What is systems biology?

So around 2000 biology was faced with this upcoming flood of data and highly complex nonlinear systems. Reductionism was failing because mental models were insufficient to cope with the information available. The reaction was a call for increased formalism, better ways to see how the sum of the parts really works. Perspectives were written (Kitano 2002) and institutes were born (Institute for SystemsBiology). Within the apparent complexity of biology there might be emergent principles that we were not seeing simply because we were looking too narrowly and could not combine information in a formal way. Whatever the system of interest (e.g. proteins, cells, organisms, ecosystems) there must ways to take information from one level of abstraction (e.g. proteins) and understand the relevant system's features of the abstraction layer above it (e.g. cell behaviours). This comes closest to a definition of systems biology put forward by Tony Hyman (Hyman 2011) but many others have defined it in vaguely similar ways, or maybe in similarly vague ways.

Power laws and the perils of searching for universal principles

When introducing systems biology I have been giving two examples of work that illustrate some of the benefits (network motifs) but also some of the perils (power law networks) of trying to find universal principles in biology. One of these examples was the research on the organisation of biological networks. As soon as different networks were starting to be assembled, such as protein-protein, genetic and metabolic networks, an observation was made that the distribution of interactions per gene/protein is not random (as studied by Paul Erdös). Most proteins have very few interactions while some rare proteins have a disproportional large amount of interactions – dubbed “hubs”. Barabasi and many others had a series of papers describing these non-random distributions, called power-law networks (Jeong et al. 2000), in all sorts of biological networks. Analogies were drawn to other non-biological networks with similar properties and it is not an understatement to say that there was some hype around this. The hope was that by thinking of the common processes that can give rise to such networks (e.g. preferential attachment) we would know, in some deep way, how biology is organised. I will just say that I don’t think this went very far. Modelling biological networks as nodes and edges allowed the application of graph theory approaches to biology, which has indeed been a very useful inheritance from this work. However, we didn't find deep meaning in the analogies drawn between the different biological and man-made networks, although I am sure some will disagree.

Network motifs, buzzers and blinkers

Around the same time, the group of Uri Alon published very influential work describing recurring network motifs in directed networks (Milo et al. 2002; Shen-Orr et al. 2002). For example, in the E. coli transcriptional network they found some regulatory relationships between 3 different genes/operons that occurred more often than expected by chance. One example, illustrated to the right, was named a coherent feedforward loop where an activating signal was sent from an “upstream” element X to a “downstream” element Z both directly and indirectly via an intermediate third element. The observation begs the question of the usefulness of such an arrangement (Mangan and Alon 2003; Kalir et al. 2005). This has been generalised to studying the relation between any set of such directed interactions with specific reaction parameters – defined as the topology - and their potential functions. In a great review Tyson, Chen and Novak summarise some of these ideas of how regulatory networks can act, among other things as “sniffers, buzzers, toggles and blinkers” (Tyson et al. 2003).

These and other similar works showed that, within the complexity of regulatory networks, design principles can be found that encapsulate the core relationships giving rise to a behaviour. Once these rules are known, an observed behaviour will constrain the possible space of topologies that can explain it. This has led researchers to search for missing regulatory interactions that are needed to satisfy such expected constraints. For example, Holt and colleagues searched for a positive feedback that would be expected to exist for the switch-like dissolution of the sister-chromatid cohesion at the start of anaphase (Holt et al. 2008). This mapping between regulatory networks and their function can be applied to any system of interest and at any scale. The same types of regulatory interactions are used for termites building spatially organised mounds and for growing neurons seeking to form connections (as illustrated in a review by Dehmelt and Bastiaens). Different communities of scientists can come together in systems biology meetings and talk in the same language of design principles. This elegance of finding “universal” rules that seemingly explain complex behaviours across different systems and disciplines has been a great gift of systems biology. It is of course important to point out that such ideas have a much longer history from homoeostasis in biology and control theory in engineering.

Bottom-up network models

Alongside the search for design principles in regulatory interactions the formal mathematical and computational modelling of biological systems gained prominence (e.g. Bhalla and Iyengar 1999). Mathematical models are much older than systems biology but they started to be used more extensively and visibly with the rise of systems biology. Formalising all of the past knowledge of a system was shown to be a useful way to test if what is known was sufficient to explain the behaviour of the system. Models were also perturbed in silico to find the most relevant parameters and generate novel hypothesis to be tested experimentally. This model refinement cycle has been used with success for example in the modelling of cell cycle (Novak and Tyson 1993, Tyson Noval 2001; among many others) or circadian clock (Locke et al. 2005; Locke et al. 2010; Pokhilko et al. 2012). However, this iteration between formal modelling and experiments has not really taken off across many other systems. The reason for the lack of excitement is not clear to me although I have the impression that often the models are not used extensively beyond asking if what we know about a system sufficiently explains all of observed outcomes and perturbations.

Top-down systems biology and everything in between

From the start there has been a division between the researchers that identified themselves as part of the systems biology community. Bottom-up researchers have been focused on the formal modelling of systems, the discovery of design principles and emerging behaviours. Top-down researchers would argue that a truly comprehensive view of a system is needed. These scientists have been more focused on further developing and applying methods to systematically measure biological systems. The emphasis in this camp has been on developing generalizable strategies that can take large-scale observations and identify rules, regardless of the system of interest. I would say that these works, my own included, have been less powerful in identifying elegant universal rules. By this I mean, for example, those initial attempts to find common principles across biological and man-made networks. Instead of principles, what have been readily transposed across systems have been approaches such as machine learning methods. Drug screens with behavioural phenotypes, genetic interaction networks or developmental defect screens with gene knock-downs can all be analysed in the same ways. Such systematic studies have driven costs down (per observation) and contrary to “representative” experiments in small scale studies, the large-scale measurements tend to be properly benchmarked for accuracy and coverage.

What is still missing are ways to bridge the divide between these two camps. Ways to start from large-scale measurements that result in models that can be studied for design features. Studies that include perturbation experiments come closer to achieve this. Examples for network reconstruction methods have shown that it should be possible to achieve this but we are not quite there yet (Hill et al. 2016).

From systems biology to systems everything

As scientific movement systems biology started in cell biology, as far as I can tell, but has since then permeated many other areas of research. As examples, I have heard of systems genetics, systems neuroscience, systems medicine, evolutionary systems biology and systems structural biology. In 2017 we still face a flood of data and highly complex nonlinear systems. However, the reductionist approaches now typically go hand-in-hand with attempts to formalise knowledge in quantitative ways to identify the key relationships that explain the function of interest. In a sense, the movement of systems biology has succeeded to such an extent that it seems less exciting to me as field in itself. It is a fantastic approach that is currently being used across most of biology but there is less developments that alter how we do science. I am curious as to what other researchers that identify themselves with doing systems biology think - What have been great achievements of systems biology? What are the great challenges that are not simply applications of systems biology? Questions to think about for the (equally arbitrary) celebration of the 20 years of the field in 2020.

Friday, February 10, 2017

Predicting E3 or protease targets with paired protein & gene expression data (negative result)

Cancer datasets as a resource to study cell biology

The amazing resources that have been developed in the context of cancer biology can serve as tools to study "normal" cell biology. The genetic perturbations that happen in cancer can be viewed almost as natural experiments that we can use to ask varied questions. Different cancer consortia have produced, for the same patient samples or the same cancer cell lines, data that ranges from genomic information, such as exome sequencing, to molecular, cellular and disease traits including gene expression, protein abundance, patient survival and drug responses. These datasets are not just useful to study cancer biology but more globally to study cell biology processes. If we were interested in asking what is the impact of knocking out a gene we could look into these data to have, at least, an approximate guess of what could happen if this gene is perturbed. We can do this because it is likely that almost any given gene will have changes in copy number or deleterious mutations given a sufficiently large sample of tumours or cell lines. Of course, there will be a whole range of technical issues to deal with since it would not be a "clean" experiment comparing the KO with a control.

Studying complex assembly using protein abundance data

More recently the CPTAC consortium and other groups have released proteomics measurements for some of the reference cancer samples. Given the work that we have been doing in studying post-translational control we started a few projects making use of these data. One idea that we tried and have recently made available online via a pre-print was to study gene dosage compensation. When there are copy number changes, how often are these propagated to changes in gene expression and then to protein level ? This was work done by Emanuel Gonçalves (@emanuelvgo), jointly with Julio Saez-Rodriguez lab. There were several interesting findings from this project, one of these was that we could identify members of protein complexes that indirectly control the degradation of other complex subunits. This was done by measuring, in each sample, how much of the protein abundance changes are not explained by its gene expression changes. This residual abundance change is most likely explained either by changes in the translation or degradation rate of the protein (or noise). We think that, for protein complex subunits, this residual mainly reflects degradation rates. Emanuel then searched for complex members that had copy number changes that predicted the "degradation" rate of other subunits of the same complex. We think this is a very robust way to identify such subunits that act as rate-limiting factors for complex assembly.

Predicting E3 or protease targets

If what I described above works to find some subunits that control the "degradation" of other subunits of a complex then why not use the exact same approach to find the targets of E3 ligases or proteases ? Emanuel gave this idea a try but in some (fairly quick) tests we could not see a strong predictive signal. We collected putative E3 targets from a few studies in the literature (Kim et al. Mol Cell Biol. 2015; Burande et al, Mol Cell Proteomics. 2009; Lee et al. J Biol Chem. 2011; Coyaud et al. Mol Cell Proteomics. 2015; Emanuele MJ et al. Cell 2011). We also we collected protease targets from the Merops database. We then tried to find a significant association between the copy number or gene expression changes of a given E3 with the proxy for degradation, as described above, of any other protein. Using the significance of the association as the predictor with would expect a stronger association between an E3 and their putative substrates than with other random genes. Using a ROC curve as descriptor of the predictive power, we didn't really see robust signals. The figure above shows the results when using gene expression changes in the E3 to associate with the residuals (i.e. abundance change not explained by gene expression change) of the putative targets. The best result, was obtained for CUL4A (AUC=0.59) in this case but overall the predictions are close to random.

A similar poor result was generally observed for protease targets from the merops database although we didn't really make a strong effort to properly map the merops interactions to all human proteins. Emanuel tried a couple of variations. For the E3s he tried restricting the potential target list to proteins that are known to be ubiquitylated in human cells but that did not improve the results. Also, surprisingly, the genes listed as putative targets of these E3s are not very enriched in genes that increase in ubiquitylation after proteasome inhibition (from Kim et al. Mol Cell. 2011) with the clearest signal observed in the E3 targets proposed by Emanuele MJ and colleagues (Emanuele MJ et al. Cell 2011).

Why doesn't it work ?

There are many reasons for the lack of capacity to predict E3/protease targets in this way. The residuals that we calculate across samples may reflect a mixture of effects and degradation may be only a small component. The regulation of degradation is complex and, as we have shown for the complex members, it may be dependent on other factors besides the availability of the E3s/proteases. It is possible that the E3s/proteases are highly regulated and/or redundant such that we would not expect to see a simple relationship between changing the expression of one E3/protease and the abundance level of the putative substrate. The list of E3/protease targets may contain false positives and of course, we may have not found the best way to find such associations in these data. In any case, we though it could be useful to provide this information in some format for others that may be trying similar things.