Cellular Consequences of Genetic variation: phosphorylation

Showing posts with label phosphorylation. Show all posts

Thursday, October 13, 2016

Phylogenetic history of fungal protein phosphorylation – the anti-press release

I have long been interested in studying the rate by which protein interactions change during evolution. A new chapter in this ongoing research agenda has been published this week (article & perspective) in collaboration with the group of Judit Villén in the University of Washington and many contributions from the labs of Maitreya J. Dunham, Eulàlia de Nadal and Francesc Posas. For the first time I tried to engage with the press by putting out a press-release and it was interesting to work with Mary Todd Bergman at EMBL-EBI to digest the work to its core message. However, to atone for my sins of not being able to give sufficient context and credit to the work that has come before this, I decided that I could use this blog to write a sort of anti-press release. Grab some coffee, get confy and don’t expect a punchy fast message here because this manuscript has a long and branched root.

Cue flashback …

For me, this started 15 years ago (gasp, I can't believe it has been this long) when Andreas Wagner published some work trying to measure the conservation of protein interactions after gene duplication. This in turn was made possible by the first protein interaction mapping efforts. In my PhD lab I was using conservation to predict interactions for SH3 domains that bind short linear proline rich peptides. Influenced by Andreas Wagner’s papers, “linear-motif” research at EMBL and the field of evolution of gene expression I hypothesized that domain-peptide interactions could be poorly conserved since they are mediated only by a few residues in a linear unstructured peptide. This idea was first reported in the literature in a perspective by Neduva and Russell also at the EMBL at the time. I tried to generalize the concept that specificity and evolvability could be related such that very unspecific interactions may be more prone to change during evolution (article, blog post). Other groups have also shown that linear motif interactions can be fast evolving (e.g. Chica et al, Edwards et al.,)

Mass spectrometry to the rescue

The problem with trying to compare protein interactions is that you need to measure them first. The domain-peptide interactions mediated by linear motifs are particularly hard to identify because they are usually of low affinity. So, the work described above was based predicted interface sites for linear motifs. At this point, improvements in mass spectrometry and enrichment strategies really made a difference. The identification of protein phosphorylation sites made it possible to find, in large scale, thousands of sites that represent high-confidence interaction sites. The back-story that resulted in these developments in MS is a story our collaborator Judit Villén has been a part of and that I can’t tell as well.

Kinase-target interactions are also linear motif interactions and if the previous linear motif research was correct the phosphosites that represent these interactions should be rapidly evolving. That was exactly what I ended up testing when I started my postdoc. We were just one of several groups working on it and in 2009 several papers got published on the topic including our work (Beltrao et al., blog post) and others (Landry et al., Tan et al., Holt et al, Amoutzias et al. ). All of these together made a really strong case for the fast divergence of protein phosphorylation, although other articles followed to also note the constraints (Nguyen Ba & Moses and Gray & Kumar). At this point the conversation was also shifting to the consequences of these evolutionary changes. Mirroring similar discussions around the consequences of changes in gene-expression there was a sense that some of these phosphosites, and therefore kinase-substrate interactions do not play a functional role (Gustav E. Lienhard 2008 , Landry et al. 2009.). I also tried to contribute to the debate on functional relevance by trying to assign functions to PTM sites computational and extending the conservation analysis to other PTMs (Beltrao et al. 2013, blog post).

What was left to find then?

Most of the studies mentioned so far have relied on pairwise species comparisons. What we tried to do in this more recent study was to obtain a phylogenetic history of protein phosphorylation across a very broad phylogeny. For this, Judit’s lab obtained phosphorylation data for 18 fungal species that shared a common ancestor hundreds of millions of years ago. Romain Studer in our group then tried to combine the phosphorylation observations, which are known to be incomplete, with sequence based predictions of phosphorylation potential and the species phylogenetic tree. This allowed us to predict a likely evolutionary history for thousands of phosphosites.

If you happen to have kept up with the literature that I mentioned above then you might expect some of the findings we observed next - most phosphosites are recent acquisitions and the small fraction of ancient phosphosites is enriched in functionally relevant sites. From the ancient sites we tested a few cases for fitness and functional consequences and we think these serve as great resource for future cell signaling studies (and yes we are chasing that). Given the breath of species we studied we could also measure the changes in phosphorylation “motifs” that are found across species. Kinases recognize their target sites, in part, by the sequences around the phospho-acceptor residue, so-called kinase target motifs. We could observe that the types of target motifs used across species showed changes that we think relates to changes in the types of kinases or their activities. We are now interested in better understanding what determines kinase specificity so that we can study their evolution - what did the first protein kinase look like ?

So who the hell cares?

Many of the methods we are working on are useful to better understand the impact of mutations related to these signaling circuits in cancer or other diseases. We are working on this too but I care about this because I want to know how nature comes up with all these beautiful diverse mechanisms and forms. Coming up with a history of how these phosphosites have been changing across species is really just the first step. We have almost no clue as to what the thousands of observed phosphosites are doing, if anything. Are the signaling pathways changing in a neutral way that conserves the functional outcomes?

From a personal note it is fantastic to be able to connect this work to things that I did all the way back to my first PhD paper and that I can connect this blog post to a chain of several other blog posts covering the research I have done and that our research group is doing now.

Friday, November 27, 2015

Predicting PTM specificities from MS data and interaction networks

Around four years ago I wrote this blog post where I suggested that it might be possible to combine protein interaction data with phosphosites from mass-spectrometry (MS) data to infer the specificity of protein kinases. I did a very simple pilot test and invited others to contribute to the idea. Nobody really picked up on it until Omar Wagih, a PhD student in the group, decided to test the limits of the approach. To his credit I didn't even ask him to do it, his main project was supposed to be on individual genomics. I am glad that he deviated long enough to get some interesting results that have now been published.

As I described four years ago, the main inspiration for this project was the work of Neduva and colleagues. They showed that motif enrichment applied to the interaction partners of peptide binding domains can reveal the binding specificity of the domain. One step of their method was to filter out regions of proteins that were unlikely to be target sequences before doing motif identification. For PTM enzymes or binding domains we should be able to take advantage of the MS derived PTM data to select the peptides for motif identification by just taking the peptide sequences around the PTM sites. This was exactly what Omar set out to do by focusing on human kinases as a test case.

To summarize the outcome of this project the method works with some limitations. For around a third of human kinases that could be benchmarked he got very good predictions (AUC>0.7). For some kinase families the predictions are better than others and we think it due to how specific the kinase is for the residues around the target site. It is known that kinases find their targets via multiple mechanisms (e.g. docking sites, shared interactions, co-localization, etc). This specificity prediction approach will work better for kinases that find their targets mostly by recognizing amino-acids near the phosphosite. With the help of Naoyuki Sugiyama in Yasushi Ishihama's lab we validated the specificity predictions for 4 understudied human kinases. One advantage of using this approach is that it could be very general. Omar tried it also on 14-3-3 domains, that bind phosphosites and also on a bromodomain containing protein that is known to bind acetylated peptides. Finally, we also tried to use this to compare kinase specificity between human and mouse but given the current limitation of the method I don't it is possible to use these predictions alone to find divergent cases of specificity.

The predictions for human kinase specificity can be found here and a tutorial on how to repeat these predictions is here. The motif enrichment was done using the motif-x algorithm. Given that we could not really use the web version Omar implemented the algorithm in R and a package is available here.

There are many other ways to predict specificities for PTM enzymes and binding domains. If you have many known target sites the best way is to train a predictor such as Netphorest or GPS. There is also the possibility of using the known target sites in conjunction with structural data to infer rules about specificity and the specificity determining residues. A great example of this is Predikin and more recently KINspect. Ongoing work in the group now aims to combine what Omar did with some aspects of Predikin to study the evolution of kinase specificity.

Going back to beginning of the post this idea was my second attempt at an open science project. The first attempt was a project on the evolution and function of protein phosphorylation (described here). This ended up being one of the main projects of my postdoc and now the main focus of the group. I am still curious to know if distributed open science projects will ever take off. I don't mean a big project consortia but smaller scale research where several people could easily contribute with their expertise almost as "spare cycles". Often when you are an expert in some analysis or method you could easily add a contribution with little effort. However, there was much more excitement about open science a few years ago whereas now most of the discussions have shifted to pre-prints and doing away with the traditional publishing system. Maybe we just don't have time to pay attention or to contribute to such open projects.

Thursday, July 19, 2012

Evolution and Function of Post-translational Modifications

A significant portion of my postodoctoral work is finally out in the last issue of Cell (link to paper). In this study we have tried to assign a function to post-translational modifications (PTMs) that are derived from mass-spectrometry (MS). This follows directly from previous work where we looked at the evolution of phosphorylation in three fungal species (paper, blog post). We (and other groups) have seen that phosphorylation sites diverge rapidly but we don't really know if this divergence of phosphosites results in meaningful functional consequences. In order to address this we need to know the function of post-translational modifications (if they have any). Since these MS studies now routinely report several thousand PTMs per analysis we have a severe bottleneck in the functional analysis of PTMs. These issues are the motivations for this last work. We collected previously published PTMs (close to 200.000) and obtained some novel ubiquitylation sites for S. cerevisiae (in collaboration with Judit Villen's lab). We revisited the evolutionary analysis and we set up a couple of methods to prioritize those modifications that we think are more likely to be functionally important.

As an example, we have tried to assign function to PTMs by annotation those that likely occur at interface residues. One approach that turned out to be useful was to look for conservation of the modification sites within PFAM domain families. For example, in the figure above and under "Regulation of domain activity", I am depicting a kinase domain. Over 50% of the phosphorylation sites that we find in the kinase domain family occur in the well known activation loop (arrow), suggestion that this is an important regulatory region. We already know that the activation loop is an important regulatory region but we think that this conservation approach will be useful to study the regulation of many other domains. In the article we give other examples and an experimental validation using the HSP70 domain family (in collaboration with the Frydman lab).

I won't describe in detail the work as you can (hopefully) read the paper. Leave a comment or send me an email if you can't and/or if you have any questions regarding the paper or analysis. I also put up the predictions in a database (PTMfunc) for those who want to look at specific proteins. It is still very alpha, I apologize for the bugs and I will try to improve it as quickly as possible. If you want access to the underlying data just ask and I'll send the files. I am also very keen on collaborations with anyone collecting MS data or interested in the post-translational regulation of specific proteins, complexes or domain families.

Blogging and open science
Having a blog means I can give you also some of the thoughts that don't fit in a paper or press release. You can stop reading if you came for the sciency bits. One of the cool things I realized was that I have discussed in this blog three papers in the same research line, that run through my PhD and postdoc. It is fun to be able to go back not just to the papers but to the way I was thinking about these ideas at the time. Unfortunately, although I try to use this blog to promote open science this project was yet-another-failed open science project. Failed in the sense that it started with a blog post and a lot of ambition but never gained any momentum as an online collaboration. Eventually I stopped trying to push it online and as experimental collaborators joined the project I gave up on the open science side of it. I guess I will keep trying whenever if makes sense. This post closes project 1 (P1) but if you are interested in online collaborations have a look at project 2 (P2).

Publishable units and postdoc blues
This work took most of my attention during the past two years and it is probably the longest project I have worked on. Two years is not particularly long but it has certainly made me think about what is an acceptable publishable unit. As I described in the last blog post, this concept is very hard to define. While we probably all agree that a factoid in a tweet is not something I should put on my CV we allow and even cheer for publishing outlets that accept very incremental papers. The work I described above could have easily been sliced into smaller chunks but would it have the same value ? We would have put out the main ideas much faster but it could have been impossible to convince someone to test them. I feel that the combination of the different analysis and experiments has more value as a single story but an incremental approach would have been more transparent. Maybe the ideal situation would be to have the increments online in blogs, wikis and repositories and collect them in stories for publication. Maybe, just maybe, these thoughts are the consequence of postdoc blues. As I was trying to finish and publish this project I was also jumping through the academic track hoops but I will leave that for a separate post.

Wednesday, May 25, 2011

Predicting kinase specificity from phosphorylation data

Over the past few years, improvements in mass-spectrometry methods have resulted in a big increase in throughput for the identification of post-translational modifications (PTMs). It is even hard to keep up with all the phosphoproteomics papers and the accumulation of phosphorylation data. Most often, improvements in methods result in interesting challenges and opportunities. In this case, how can we make use of this explosion in PTM data ? I will try to explore a fairly straightforward idea, on how to use phosphorylation data to predict kinase substrate specificity. I'll describe here the general idea and just the first stab at it to show that I think it can work.

The inspiration for this is the work by Neduva and colleagues that have show that we can search for enriched motifs within proteins that interact with the domain of interest. For example, we can take a protein containing and SH3 domain, find all of it's interaction partners and you will likely see that they are enriched for proline rich motifs of the type PXXP (x = any amino-acid) that is the known binding preference for this domain. So the very obvious application to kinases would be to take the interaction partners of a kinase and find enriched peptide motifs. The advantage of looking at kinases, over any other type of peptide binding domains, is that we can focus specifically on phosphosites.

As a test case I picked the S.cerevisiae Cdc28p (Cdk1) that is known to phosphorylate the motif [ST]PXK. I used the STRING database to identify proteins that functionally interact with Cdc28 with a cut-off of 0.9 and retrieved all currently known phosphosites within these proteins. As a quick check I used Motif-X to search for enriched motifs. The first try was somewhat disappointing but after removing phosphosites that had less than 5 MS spectra and/or experiments supporting it I got back the this logo as the most enriched motif:

This was probably the easiest kinase to try since it is known that it typically phosphorylates its targets at multiple sites and it heavily studied. Still, I think there is a lot of room for exploration here. If anyone is interested in collaborating on this let me know. If your doing computational work I would be interested in some code/tools for motif enrichment. If your doing experimental work let me know about your favorite kinases/species.

Thursday, March 03, 2011

Structure based prediction of kinase interactions

About a year ago Ben Turk's lab published a large scale experimental effort to determine the substrate recognition preferences of most yeast kinases (Mok et al. Sci. Signal. 2010). They used a peptide screening approach to analyze 61 of about 122 known S. cerevisiae kinases in order to derive, for each one, a position specific scoring matrix (PSSM) describing their substrate recognition preference. In the figure below I show an example for the Hog1 MAPK where it is clear that this kinase prefers to phosphorylate peptides that have proline next to the S/T that is going to be phosphorylated.

Figure 1 - Example of Hog1 substrate recognition preference derive from peptide screens. Each spot in the array contains a mixture of peptides that are randomized at all positions except at marked position (-5 to +4 relative to the phosphorylatable residue). Strong signal correlates with a preference for phosphorylating peptides containing that amino-acid at the fixed position.

As was previously known, most kinases don't appear to have very striking substrate binding preferences. Still, these matrices should allow for significant predictions of kinase-site interactions. These matrices should allow us also to benchmark previous efforts by Neil and other members of the Kobe lab on the structural based predictions of kinase substrate recognition. For this, I obtained the predicted substrate recognition matrices from the Predikin server and known kinase-site interactions from the PhosphoGrid database. I used this data to compare the predictive power of the experimentally determined kinase matrices (Mok et al.) with the predicted matrices from Predikin. This analysis was done about a year ago when the Mok et al. paper was published but I don't think Phosphogrid was significantly updated since then.

Phosphogrid had 422 kinase-site interactions for the 61 kinases analyzed in Mok et al. of which ~50% of these have in-vivo evidence for kinase recognition. As expected, the known kinase-site interactions have a stronger experimental matrix score than random kinase-site assignments (Fig 2).

Figure 2 - The set of kinase-site interactions used broken down according the kinases with higher representation. These sites were scored using the experimental matrices along with other randomly selected phosphosites and the scores of both populations are summarized in the boxplots.

A random set of kinase-phosphosite interactions of equal size was used to quantify the predictive power of the experimental and the Predikin matrices with a ROC curve (Fig 3).

Figure 3 - Area under the ROC curve values for kinase-site predictions using both types of matrices.

Overall, the accuracy of the predicted matrices from Predikin matched reasonably well with those derived from the peptide array experiments with only a small difference in AROC values. I broke down the predictions for individual kinases with at least 10 sites known. Benchmarking of such low numbers becomes very unreliable but besides the Cka1 kinase, the performance of the Predikin matrices matched reasonably well the experimental results.

I am assuming here that Predikin was not updated with any information from the Mok et al study to derive their predictions. If this is true it would mean that structural based prediction of kinase recognition preferences, as implemented in Predikin, is almost as accurate as preferences derived from peptide library approaches.

Sunday, January 03, 2010

Stitching different web tools to organize a project

A little over a year ago I mentioned a project I was working on about prediction and evolution of E3 ligase targets (aka P1). As I said back then, I am free to risk as much as I want in sharing ongoing results and Nir London just asked me how the project is going via the comments of that blog post so I decided to give a bit of an update.

Essentially, the project quickly deviated from course since I realized that predicting E3 specificity and experimentally determining ubiquitylation sites in fungal species (without having to resort to strain manipulation) were not going to be an easy tasks.
So, since the goal was to use these data to study the co-evolution of phosphorylation switches (phosphorylation regulating ubiquitylation) it makes little sense to restrain the analysis specifically to one form of post-translational modification (PTM). After a failed attempt to purify ubiquitylated substrates the goal has been to come up with ways to predict the functional consequences of phosphorylation. We will still need to take ubiquitylation into account but that will be a part of the whole picture.

With this goal in mind we have been collecting for multiple species data on phosphorylation as well as other forms of PTMs from databases and the literature and we have been trying to come up with ways to predict the function of these phosphorylation events. These predictions can be broken down mostly intro tree types:
- phosphorylation regulating domain activity
- phosphorylation regulating domain-domain interactions (globular domain interfaces)
- phosphorylation regulating linear motif interactions (phosphorylation switches in disordered regions)

We have set up a notebook where we will be putting some of the results and ways to access the datasets. Any new experimental data and results from the analysis will be posted with a significant delay both to give us some protection against scooping and also to try to guarantee that we don't push out things that are obviously wrong. This brings us to a disclaimer... all data and analysis in that notebook is to be considered preliminary and not peer reviewed, it probably contains mistakes and can change quickly.

I am currently colaborating with Raik Gruenberg on this project and we are open to collaborators that bring new skills to the project. We are particularly interested in experimentalist working in cell biology and cell signalling that could be interested in testing some of the predictions we are getting out of this study.

I won't talk much (yet) about the results we have so far but instead mention some of the tools we are using or planning to use:
- The notebook of the project hosted in openwetware
- The datasets/files are shared via Dropbox
- If need arises code will be shared via Google Code (currently empty)
- Literature will be shared via a Zotero group library
- The papers and other items can be discussed in a Friendfeed group

This will be all for now. I think we are getting interesting results from this analysis on the evolution of the functional consequences of phosphorylation events but we will update the notebook when we are a bit more confident that we ruled out most of the potential artifacts. I think the hardest part about exposing ongoing projects is having to explain to potential collaborators that we intend to do so. This still scares people away.

I'll end with a pretty picture. This is an image of an homology model for the Tup1 -Hhf1 interaction. Highlighted are two residues that are predicted by the model to be in the interface and are phosphorylated in two different fungal species. This exemplifies how the functional consequence of a phosphorylation event can be conserved although the individual phosphorylation sites (apparently) are not.

Tuesday, August 11, 2009

Translationally optimal codons do not appear to significantly associate with phosphorylation sites

I recently read an interesting paper about codon bias at structurally important sites that sent me on a small detour from my usual activities. Tong Zhou, Mason Weems and Claus Wilke, described how translationally optimal codons are associated with structurally important sites in proteins, such as the protein core (Zhou et al. MBE 2009). This work is a continuation of the work from this same lab on what constraints protein evolution. I have written here before a short review of the literature on the subject. As a reminder, it was observed that the expression level is the strongest constraint on a protein's rate of change with highly expressed genes coding for proteins that diverge slower than lowly expressed ones (Drummond et al. MBE 2006). It is currently believed that selection against translation errors is the main driving force restricting this rate of change (Drummond et al. PNAS 2005,Drummond et al. Cell 2008). It has been previously shown that translation rates are introduced, on average, at an order of about 1 to 5 per 10000 codons and that different codons can differ in their error rates by 4 to 9 fold, influenced by translational properties like the availability of their tRNAs (Kramer et al. RNA 2007).

Given this background of information what Zhou and colleagues set out to do, was test if codons that are associated with highly expressed genes tend to be over-represented at structurally important sites. The idea being that such codons, defined as "optimal codons" are less error prone and therefore should be avoided at positions that, when miss-translated, could destabilize proteins. In this work they defined a measure of codon optimality as the odds ratio of codon usage between highly and lowly expressed genes. Without going into many details they showed, in different ways and for different species, that indeed, codon optimality is correlated with the odds of being at a structurally important site.

I decided to test if I could also see a significant association between codon optimality and sites of post-translational modifications. I defined a window of plus or minus 2 amino-acids surrounding a phosphorylation site (of S. cerevisiae) as associated with post-translational modification. The rationale would be that selection for translational robustness could constraint codon usage near a phosphorylation site when compared with other Serine or Threonine sites. For simplification I mostly ignored tyrosine phosphorylation that in S. cerevisiae is a very small fraction of the total phosphorylation observed to date .
For each codon I calculated its over representation at these phosphorylation windows compared to similar windows around all other S/T sites and plotted this value against the log of the codon optimality score calculated by Zhou and colleagues.

Figure 1 - Over-representation of optimal codons at phosphosites

At first impression it would appear that there is a significant correlation between codon optimality and phosphorylation sites. However, as I will try to describe below this is mostly due to differences in gene expression. Given the relatively small number of phosphorylation sites per protein, it is hard to test this association for each protein independently as it was done by Zhou and colleagues for the structurally important sites. The alternative is therefore to try to take into account the differences in gene expression. I first checked if phosphorylated proteins tend to be coded by highly expressed genes.

Figure 2 - Distribution of gene expression of phosphorylated proteins

I figure 2 I plot the distribution of gene expression for phosphorylated and non-phosphorylated proteins. There is only a very small difference observed with phosphoproteins having a marginally higher median gene expression when compared to other proteins. However this difference is small and a KS test does not rule out that they are drawn from the same distribution.

The next possible expression related explanation for the observed correlation would be that highly expressed genes tend to have more phosphorylation sites. Although there is no significant correlation between the gene expression level and the absolute number of phosphorylation sites, what I observed was that highly expressed proteins tend to be smaller in size. This means that there is a significant positive correlation between the fraction of phosphorylated Serine and Threonine sites and gene expression.

Figure 3 - Expression level correlates with fraction of phosphorylated ST sites

Unfortunately, I believe this correlation explains the result observed in figure 1. In order to properly control for this observation I calculated the correlation observed in figure 1 randomizing the phosphorylation sites within each phosphoprotein. To compare I also randomized the phosphorylation sites keeping the total number of phosphorylation sites fixed but not restricting the number of phosphorylation sites within each specific phosphoprotein.

Figure 4 - Distribution of R-squared for randomized phosphorylation sites

When randomizing the phosphorylation sites within each phosphoprotein, keeping the number of phosphorylation sites in each specific phosphoproteins constant the average R-squared is higher than the observed with the experimentally determined phosphorylation sites (pink curve). This would mean that the correlation observed in figure 1 is not due to functional constraints acting on the phosphorylation sites but instead is probably due to the correlation observed in figure 3 between the expression level and the fraction of phosphorylated S/T residues.
The observed correlation would appear to be significantly higher than random if we allow the random phosphorylation sites to be drawn from any phosphoprotein without constraining the number of phosphorylation sites in each specific protein (blue curve). I added this because I thought it was an striking example of how a relatively subtle change in assumptions can change the significance of a score.

I also tested if conserved phosphorylation sites tend to be coded by optimal codons when compared with non-conserved phosphorylation sites. For each phosphorylation site I summed over the codon optimality in a window around the site and compared the distribution of this sum for phosphorylation sites that are conserved in zero, one or more than one species. The conservation was defined based on an alignment window of +/- 10AAs of S. cerevisiae proteins against orthologs in C. albicans, S. pombe, D. melanogaster and H. sapiens.

Figure 5 - Distribution of codon optimality scores versus phospho-site conservation

I observe a higher sum of codon optimality for conserved phosphorylation sites (fig 5A) but this difference is not maintained if the codon optimality score of each peptide is normalized by the expression level of the source protein (fig 5B).

In summary, when the gene expression levels are taken into account, it does not appear to be an association between translationally optimal codons with the region around phosphorylation sites. This is consistent with the weak functional constraints observed by in analysis performed by Landry and colleagues.

Friday, June 26, 2009

Reply: On the evolution of protein length and phosphorylation sites

Lars just pointed out in a blog post that the average protein length of a group of proteins is a strong predictor of average number of phosphorylation sites. Although this is intuitive this is something that I honestly had not fully considered. As Lars mentions this has potential implications for some of the calculations in our recently published study on the evolution of phosphorylation in yeast species.

One potential concern relates to figure 1a. We found that, although protein phosphorylation appears to diverge quickly, there is a high conservation of the relative number of phosphosites per protein for different GO groups. Lars suggests that, at least in part, this could be due to relative differences in average protein size for these different groups that in turn is highly conserved across species.

To test this hypothesis more directly I tried to correct for differences in the average protein size of different functional groups by calculating the average number of phosphorylation sites per amino-acid, instead of psites per protein. These values were then corrected for the average number of phosphorylation sites per AA in the proteome.

As before, there is still a high cross-species correlation for the average number of psites per amino-acid for different GO groups. The correlations are only somewhat smaller than before. The individual correlation coefficients among the three species changed from: S. cerevisiae versus C. albicans – R~0.90 to 0.80; S. cerevisiae versus S. pombe – R~0.91 to 0.84; S. pombe versus C. albicans – R~0.88 to 0.83. It would seem that differences in protein length explains only a small part of the observed correlations. Results in figure 1b are also not qualitative affected by this normalization suggesting that observed differences are not due to potential changes in the average size of proteins. In fact the number of amino acids per GO group is almost perfectly correlated across species.

Another potential concern relates to the sequence based prediction of phosphorylation. As explained in the methods, one of the two approaches used to predict if a protein was phosphorylated was the sum over multiple phosphorylation site predictors for the same sequence. Given the correlation shown by Lars, could it be that, at least for one of the methods, we are mostly predicting the average protein size ? To test this I normalized the phosphorylation prediction for each S. cerevisiae protein by their length. I re-tested the predictive power of this normalized value using ROC curves and the known phosphoproteins of S. cerevisiae as postives. The AROC values changed from 0.73 to 0.68. This shows that the phosphorylation propensity is not just predicting protein size although, as expected from Lars' blog post, size alone is actually a decent predictor for phosphorylation (AROC=0.66). The normalized phosphorylation propensity does not correlate with the protein size (CC~0.05) suggesting that there might ways to improve the predictors we used.

Nature or method bias ?
Are larger proteins more likely to be phosphorylated in a cell or are they more likely to be detected in a mass-spec experiment ? It is likely that what we are observing is a combination of both effects but it would be nice to know how much of this observed correlation is due to potential MS bias. I am open to suggestions for potential tests.
This is also important for what I am planning to work on next. A while ago I had noticed that prediction of phosphorylation propensity could also predict ubiquitination and vice-versa. It is possible that they are mostly related by protein size. I will try to look at this in future posts.

Tuesday, June 23, 2009

Comparative analysis of phosphoproteins in yeast species

My first postdoctoral project has just appeared online in PLoS Biology. It is about the evolution of phosphoregulation in yeast species. This analysis follows from a previous work I had done during my PhD on the evolution of protein-protein interactions after gene duplication (paper / blog post). One of the conclusions from that previous work was that interactions of lower specificity, such as those mediated by short peptides, would be more prone to change. In fact, one of the protein domains that we found associated with high rates of change of protein-protein interactions was the kinase domain.
Given that the substrate specificity of a kinase is usually determined by a few key amino-acids surrounding the target phosphosite it is easy to image how kinase-substrate interactions can be easily created and destroyed with few mutations. It is also well known that these phosphorylation events can have important functional consequences. We therefore postulated that changes in phosphorylation are an important source of phenotypic diversity.

To test this, we collected by mass-spectrometry in vivo phosphorylation sites for 3 yeast species (S. cerevisiae, C. albicans and S. pombe). These were compared in order to estimate the rate of change of kinase-substrate interactions. Since changes in gene expression are generally regarded as one of the main sources of phenotypic diversity we compared these estimates with similar calculations for the rate of change of transcription factor (TF) interactions to promoters. Depending on how we define a divergence of phosphorylation we estimate that kinase-substrate interactions change either at similar rates or at most 2 orders of magnitude slower than TF-promoter interactions.

Although these changes in kinase-substrate interactions appear to be fast, groups of functionally related proteins tend to maintain the same levels of phosphorylation across broad time scales. We could identify a few functional groups and protein complexes with a significant divergence in phosphorylation and we tried to predict the most likely kinases responsible for these changes.

Finally we compiled recently published genetic interaction data for S. pombe (from Assen Roguev's work) and for S. cerevisiae (from Dorothea Fiedler's work) in addition to some novel genetic data produced for this work. We used this information to study the relative conservation of genetic interactions for protein kinases and transcription factors. We observed that both proteins kinases and TFs show a lower than average conservation of genetic interactions.

We think these observations strongly support the initial hypothesis that divergence in kinase-substrate interactions contributes significantly to phenotypic diversity.

Technology opening doors
For me personally it really feels like I was in the right place at the right time. Many of the experimental methods we used are still under heavy development but I was lucky to be very literally next door to the right people. I had the chance to collaborate with Jonathan Trinidad who works for the UCSF Mass Spectrometry Facility directed by Alma Burlingame. I also arrived at a time when the Krogan lab, more specifically Assen Roguev (twitter feed), has been working to develop genetic interaction assays for S. pombe (Roguev A 2007). As we describe in the introduction, these technological developments really allow us to map out the functional and physical interactions of a cell at an incredible rate. What I am hoping for is that soon they are seen in much the same light as genome sequencing. We can and should be using these tools to study, simultaneously, groups of species and not just the same usual model organisms that diverged from each other more than 1 billion years ago.

Evolution of signalling
There are many more protein interactions that are determined by short linear peptide motifs (Neduva PLoS Bio 2005). A large fraction of these determine protein post-translational modifications and are crucial for signal transduction systems. For the next couple of years I will try to continue to study the evolution of signal transduction systems. There are certainly many experimental and computational challenges to address. I am particularly interested in looking at the co-regulation by combinations of post-translational modifications and their co-evolution. I will do my best to share some of that work as it happens here in the blog.