Saturday, November 10, 2007

Predicting functional association using mRNA localization

About a month ago Lécuyer and colleagues published a paper in Cell describing an extensive study of mRNA localization in Drosophila embryos during development. The main conclusion of this study was that a very large fraction (71%) of the genes they analyzed (2314) had localization patterns during some stage of the embryonic development. This includes both embryonic localization or sub-cellular localizations.

There is a lot of information that was gathered in this analysis and it should serve as resource for further studies. There is information for different developmental stages so it should also be possible to look for the dynamics of localization of the mRNAs. Another application of this data would be to use it as information source to predict functional association between genes.

Protein localization information as been used in the past for prediction of protein-protein interactions (both physical and genetic interactions). Typically this is done by integrating localization with other data sources in probabilistic analysis [Jansen R et al. 2003, Rhodes DR et al. 2005, Zhong W & Sternberg PW, 2006].

To test if mRNA localization could be used in the same way I took from this website the localization information gathered in the Cell paper and available genetic and protein interaction information for D.melanogaster genes/proteins (can be obtained for example in BioGRID among others). For this analysis I grouped physical and genetic interactions together to have a larger number of interactions to test. The underlying assumption is that both should imply some functional association of the gene pair.

The very first simple test is to have a look at all pairs of genes (with available localization information) and test how the likelihood that they interact depends on the number of cases where they were found to co-localized (see figure below). I discarded any gene for each no interaction was known.
As seen in the figure there is a significant correlation (r=0.63,N=21,p<0.01) between the likelihood of interaction and the number of co-localizations observed for the pair. At this point I did not exclude any localization term but since images were annotated using an hierarchical structure these terms are in some cases very broad.

More specific patterns should be more informative so I removed very broad terms by checking the fraction of genes annotated to each term. I created two groups of more narrow scope, one excluding all terms annotated to more than 50% of genes (denominated "localizations 50") and a second excluding all terms annotated to more than 30% of genes (localizations 30). In the figure below I binned gene pairs according to the number of co-localizations observed in the three groups of localization terms and for each bin calculated the fraction that interact.

As expected, more specific mRNA localization terms (localizations 30) are more informative for prediction of functional association since fewer terms are required to obtain the same or higher likelihood of interaction. The increased likelihood does not come at a cost of fewer pairs annotated. For example, there are similar number of gene pairs in bin "10-14" of the more specific localization terms (localizations 30) as in the bin ">20" for all localization terms (see figure below).
It is important to keep in mind that mRNA localization alone is a very poor predictor of genetic or physical interaction. I took the number of co-localization of each pair (using the terms in "localizations 30") and plotted a ROC curve to determine the area under the ROC curve (AROC or AUC). The AROC value calculated was 0.54, with a 95% confidence lower bound of 0.52 and a p value of 6E-7 of the true area being 0.5. So it is not random (that would be 0.5) but by itself is a very poor predictor.

In summary:
1) the degree of mRNA co-localization significantly correlates with the likelihood of genetic or physical association.
2) less ubiquitous mRNA localization patterns should be more informative for interaction prediction
3) the degree of mRNA co-localization is by itself a poor predictor of interaction but it should be possible to use this information to improve statistical methods to predict genetic/physical interactions.

This was a quick analysis, not thoroughly tested and just meant to confirm that mRNA localization should be useful for genetic/physical interaction predictions. I am not going to pursue this but if there is anyone interested I suggest that it could be interesting to see what terms have more predictive power with the idea of integrating this information with other data sources or also possibly directing future localization studies. Perhaps there is little point of tracking different developmental stages or maybe embryonic localization patterns are not as informative as sub-cellular localizations to predict functional association.

Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003 Oct 17;302(5644):449-53.
Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM. Probabilistic model of the human protein-protein interaction network.Nat Biotechnol. 2005 Aug;23(8):951-9.
Zhong W, Sternberg PW. Genome-wide prediction of C. elegans genetic interactions.Science. 2006 Mar 10;311(5766):1481-4.