I want to be able to predict what proteins in a proteome are more likely to be regulated by phosphorylation and hopefully use mostly sequence information. This post is a quick note to show what I have tried and maybe get some feedback from people that might have tried this before.
The most straightforward way to predict the phospho-proteins is to use existing phospho-site predictors in some way. I have used the GPS 2.0 predictor on the S. cerevisiea proteome with medium cutoff and including only Serine/Threonine kinases. The fraction of tyrosine phosphosites in S. cerevisiae is very low so I decided to for now not try to predict tyrosine phosphorylation.
This produces a ranked list of 4E6 putative phosphosites for the roughly 6000 proteins scored according to the predictor (each site is scored for multiple kinases). My question is how to best make use of these predictions if I mostly want to know what proteins are phosphorylated and not the exact sites. Using a set of known phosphorylated proteins in S. cerevisiae (mostly taken from expasy) I computed different final scores as a function of the of all phospho-site scores:
1) the sum
2) the highest value
3) the average
4) the sum of putative scores if they were above a threshold (4,6,10)
5) the sum of putative phosphosite scores if they were outside ordered protein segments as defined by a secondary structure predictor and above a score threshold
The results are summarized with the area under the ROC curve (known phosphoproteins were considered positives and all other negatives) :
In summary, the sum of all phospho-site scores is the best way that I found so far to predict what proteins are phospho-regulated. My interpretation is that phospho-regulated proteins tend to be multi-phosphorylated and/or regulated by multiple kinases so the maximum site score will not work as well as the sum. As a side note, although there are abundance biases in mass-spec data (the source of most of the phospho-data) protein abundance is a very poor predictor of phospho-regulation (AROC=0.55).
Disregarding putative sites outside predicted secondary structured protein segments did not improve the predictions as I would expect but I should try a few disorder predictors.
Ideas for improvements are welcomed, in particular sequence based methods. I would also like to avoid comparative genomics for now.