Cellular Consequences of Genetic variation: specificity

Friday, November 27, 2015

Predicting PTM specificities from MS data and interaction networks

Around four years ago I wrote this blog post where I suggested that it might be possible to combine protein interaction data with phosphosites from mass-spectrometry (MS) data to infer the specificity of protein kinases. I did a very simple pilot test and invited others to contribute to the idea. Nobody really picked up on it until Omar Wagih, a PhD student in the group, decided to test the limits of the approach. To his credit I didn't even ask him to do it, his main project was supposed to be on individual genomics. I am glad that he deviated long enough to get some interesting results that have now been published.

As I described four years ago, the main inspiration for this project was the work of Neduva and colleagues. They showed that motif enrichment applied to the interaction partners of peptide binding domains can reveal the binding specificity of the domain. One step of their method was to filter out regions of proteins that were unlikely to be target sequences before doing motif identification. For PTM enzymes or binding domains we should be able to take advantage of the MS derived PTM data to select the peptides for motif identification by just taking the peptide sequences around the PTM sites. This was exactly what Omar set out to do by focusing on human kinases as a test case.

To summarize the outcome of this project the method works with some limitations. For around a third of human kinases that could be benchmarked he got very good predictions (AUC>0.7). For some kinase families the predictions are better than others and we think it due to how specific the kinase is for the residues around the target site. It is known that kinases find their targets via multiple mechanisms (e.g. docking sites, shared interactions, co-localization, etc). This specificity prediction approach will work better for kinases that find their targets mostly by recognizing amino-acids near the phosphosite. With the help of Naoyuki Sugiyama in Yasushi Ishihama's lab we validated the specificity predictions for 4 understudied human kinases. One advantage of using this approach is that it could be very general. Omar tried it also on 14-3-3 domains, that bind phosphosites and also on a bromodomain containing protein that is known to bind acetylated peptides. Finally, we also tried to use this to compare kinase specificity between human and mouse but given the current limitation of the method I don't it is possible to use these predictions alone to find divergent cases of specificity.

The predictions for human kinase specificity can be found here and a tutorial on how to repeat these predictions is here. The motif enrichment was done using the motif-x algorithm. Given that we could not really use the web version Omar implemented the algorithm in R and a package is available here.

There are many other ways to predict specificities for PTM enzymes and binding domains. If you have many known target sites the best way is to train a predictor such as Netphorest or GPS. There is also the possibility of using the known target sites in conjunction with structural data to infer rules about specificity and the specificity determining residues. A great example of this is Predikin and more recently KINspect. Ongoing work in the group now aims to combine what Omar did with some aspects of Predikin to study the evolution of kinase specificity.

Going back to beginning of the post this idea was my second attempt at an open science project. The first attempt was a project on the evolution and function of protein phosphorylation (described here). This ended up being one of the main projects of my postdoc and now the main focus of the group. I am still curious to know if distributed open science projects will ever take off. I don't mean a big project consortia but smaller scale research where several people could easily contribute with their expertise almost as "spare cycles". Often when you are an expert in some analysis or method you could easily add a contribution with little effort. However, there was much more excitement about open science a few years ago whereas now most of the discussions have shifted to pre-prints and doing away with the traditional publishing system. Maybe we just don't have time to pay attention or to contribute to such open projects.

Thursday, March 03, 2011

Structure based prediction of kinase interactions

About a year ago Ben Turk's lab published a large scale experimental effort to determine the substrate recognition preferences of most yeast kinases (Mok et al. Sci. Signal. 2010). They used a peptide screening approach to analyze 61 of about 122 known S. cerevisiae kinases in order to derive, for each one, a position specific scoring matrix (PSSM) describing their substrate recognition preference. In the figure below I show an example for the Hog1 MAPK where it is clear that this kinase prefers to phosphorylate peptides that have proline next to the S/T that is going to be phosphorylated.

Figure 1 - Example of Hog1 substrate recognition preference derive from peptide screens. Each spot in the array contains a mixture of peptides that are randomized at all positions except at marked position (-5 to +4 relative to the phosphorylatable residue). Strong signal correlates with a preference for phosphorylating peptides containing that amino-acid at the fixed position.

As was previously known, most kinases don't appear to have very striking substrate binding preferences. Still, these matrices should allow for significant predictions of kinase-site interactions. These matrices should allow us also to benchmark previous efforts by Neil and other members of the Kobe lab on the structural based predictions of kinase substrate recognition. For this, I obtained the predicted substrate recognition matrices from the Predikin server and known kinase-site interactions from the PhosphoGrid database. I used this data to compare the predictive power of the experimentally determined kinase matrices (Mok et al.) with the predicted matrices from Predikin. This analysis was done about a year ago when the Mok et al. paper was published but I don't think Phosphogrid was significantly updated since then.

Phosphogrid had 422 kinase-site interactions for the 61 kinases analyzed in Mok et al. of which ~50% of these have in-vivo evidence for kinase recognition. As expected, the known kinase-site interactions have a stronger experimental matrix score than random kinase-site assignments (Fig 2).

Figure 2 - The set of kinase-site interactions used broken down according the kinases with higher representation. These sites were scored using the experimental matrices along with other randomly selected phosphosites and the scores of both populations are summarized in the boxplots.

A random set of kinase-phosphosite interactions of equal size was used to quantify the predictive power of the experimental and the Predikin matrices with a ROC curve (Fig 3).

Figure 3 - Area under the ROC curve values for kinase-site predictions using both types of matrices.

Overall, the accuracy of the predicted matrices from Predikin matched reasonably well with those derived from the peptide array experiments with only a small difference in AROC values. I broke down the predictions for individual kinases with at least 10 sites known. Benchmarking of such low numbers becomes very unreliable but besides the Cka1 kinase, the performance of the Predikin matrices matched reasonably well the experimental results.

I am assuming here that Predikin was not updated with any information from the Mok et al study to derive their predictions. If this is true it would mean that structural based prediction of kinase recognition preferences, as implemented in Predikin, is almost as accurate as preferences derived from peptide library approaches.