Cellular Consequences of Genetic variation: P2

Around four years ago I wrote this blog post where I suggested that it might be possible to combine protein interaction data with phosphosites from mass-spectrometry (MS) data to infer the specificity of protein kinases. I did a very simple pilot test and invited others to contribute to the idea. Nobody really picked up on it until Omar Wagih, a PhD student in the group, decided to test the limits of the approach. To his credit I didn't even ask him to do it, his main project was supposed to be on individual genomics. I am glad that he deviated long enough to get some interesting results that have now been published.

As I described four years ago, the main inspiration for this project was the work of Neduva and colleagues. They showed that motif enrichment applied to the interaction partners of peptide binding domains can reveal the binding specificity of the domain. One step of their method was to filter out regions of proteins that were unlikely to be target sequences before doing motif identification. For PTM enzymes or binding domains we should be able to take advantage of the MS derived PTM data to select the peptides for motif identification by just taking the peptide sequences around the PTM sites. This was exactly what Omar set out to do by focusing on human kinases as a test case.

To summarize the outcome of this project the method works with some limitations. For around a third of human kinases that could be benchmarked he got very good predictions (AUC>0.7). For some kinase families the predictions are better than others and we think it due to how specific the kinase is for the residues around the target site. It is known that kinases find their targets via multiple mechanisms (e.g. docking sites, shared interactions, co-localization, etc). This specificity prediction approach will work better for kinases that find their targets mostly by recognizing amino-acids near the phosphosite. With the help of Naoyuki Sugiyama in Yasushi Ishihama's lab we validated the specificity predictions for 4 understudied human kinases. One advantage of using this approach is that it could be very general. Omar tried it also on 14-3-3 domains, that bind phosphosites and also on a bromodomain containing protein that is known to bind acetylated peptides. Finally, we also tried to use this to compare kinase specificity between human and mouse but given the current limitation of the method I don't it is possible to use these predictions alone to find divergent cases of specificity.

The predictions for human kinase specificity can be found here and a tutorial on how to repeat these predictions is here. The motif enrichment was done using the motif-x algorithm. Given that we could not really use the web version Omar implemented the algorithm in R and a package is available here.

There are many other ways to predict specificities for PTM enzymes and binding domains. If you have many known target sites the best way is to train a predictor such as Netphorest or GPS. There is also the possibility of using the known target sites in conjunction with structural data to infer rules about specificity and the specificity determining residues. A great example of this is Predikin and more recently KINspect. Ongoing work in the group now aims to combine what Omar did with some aspects of Predikin to study the evolution of kinase specificity.

Going back to beginning of the post this idea was my second attempt at an open science project. The first attempt was a project on the evolution and function of protein phosphorylation (described here). This ended up being one of the main projects of my postdoc and now the main focus of the group. I am still curious to know if distributed open science projects will ever take off. I don't mean a big project consortia but smaller scale research where several people could easily contribute with their expertise almost as "spare cycles". Often when you are an expert in some analysis or method you could easily add a contribution with little effort. However, there was much more excitement about open science a few years ago whereas now most of the discussions have shifted to pre-prints and doing away with the traditional publishing system. Maybe we just don't have time to pay attention or to contribute to such open projects.

Over the past few years, improvements in mass-spectrometry methods have resulted in a big increase in throughput for the identification of post-translational modifications (PTMs). It is even hard to keep up with all the phosphoproteomics papers and the accumulation of phosphorylation data. Most often, improvements in methods result in interesting challenges and opportunities. In this case, how can we make use of this explosion in PTM data ? I will try to explore a fairly straightforward idea, on how to use phosphorylation data to predict kinase substrate specificity. I'll describe here the general idea and just the first stab at it to show that I think it can work.

The inspiration for this is the work by Neduva and colleagues that have show that we can search for enriched motifs within proteins that interact with the domain of interest. For example, we can take a protein containing and SH3 domain, find all of it's interaction partners and you will likely see that they are enriched for proline rich motifs of the type PXXP (x = any amino-acid) that is the known binding preference for this domain. So the very obvious application to kinases would be to take the interaction partners of a kinase and find enriched peptide motifs. The advantage of looking at kinases, over any other type of peptide binding domains, is that we can focus specifically on phosphosites.

As a test case I picked the S.cerevisiae Cdc28p (Cdk1) that is known to phosphorylate the motif [ST]PXK. I used the STRING database to identify proteins that functionally interact with Cdc28 with a cut-off of 0.9 and retrieved all currently known phosphosites within these proteins. As a quick check I used Motif-X to search for enriched motifs. The first try was somewhat disappointing but after removing phosphosites that had less than 5 MS spectra and/or experiments supporting it I got back the this logo as the most enriched motif:

This was probably the easiest kinase to try since it is known that it typically phosphorylates its targets at multiple sites and it heavily studied. Still, I think there is a lot of room for exploration here. If anyone is interested in collaborating on this let me know. If your doing computational work I would be interested in some code/tools for motif enrichment. If your doing experimental work let me know about your favorite kinases/species.

Cellular Consequences of Genetic variation

Friday, November 27, 2015

Predicting PTM specificities from MS data and interaction networks

Wednesday, May 25, 2011

Predicting kinase specificity from phosphorylation data