Showing posts with label hypothesis. Show all posts
Showing posts with label hypothesis. Show all posts

Saturday, November 10, 2007

Predicting functional association using mRNA localization

About a month ago Lécuyer and colleagues published a paper in Cell describing an extensive study of mRNA localization in Drosophila embryos during development. The main conclusion of this study was that a very large fraction (71%) of the genes they analyzed (2314) had localization patterns during some stage of the embryonic development. This includes both embryonic localization or sub-cellular localizations.

There is a lot of information that was gathered in this analysis and it should serve as resource for further studies. There is information for different developmental stages so it should also be possible to look for the dynamics of localization of the mRNAs. Another application of this data would be to use it as information source to predict functional association between genes.

Protein localization information as been used in the past for prediction of protein-protein interactions (both physical and genetic interactions). Typically this is done by integrating localization with other data sources in probabilistic analysis [Jansen R et al. 2003, Rhodes DR et al. 2005, Zhong W & Sternberg PW, 2006].

To test if mRNA localization could be used in the same way I took from this website the localization information gathered in the Cell paper and available genetic and protein interaction information for D.melanogaster genes/proteins (can be obtained for example in BioGRID among others). For this analysis I grouped physical and genetic interactions together to have a larger number of interactions to test. The underlying assumption is that both should imply some functional association of the gene pair.

The very first simple test is to have a look at all pairs of genes (with available localization information) and test how the likelihood that they interact depends on the number of cases where they were found to co-localized (see figure below). I discarded any gene for each no interaction was known.
As seen in the figure there is a significant correlation (r=0.63,N=21,p<0.01) between the likelihood of interaction and the number of co-localizations observed for the pair. At this point I did not exclude any localization term but since images were annotated using an hierarchical structure these terms are in some cases very broad.

More specific patterns should be more informative so I removed very broad terms by checking the fraction of genes annotated to each term. I created two groups of more narrow scope, one excluding all terms annotated to more than 50% of genes (denominated "localizations 50") and a second excluding all terms annotated to more than 30% of genes (localizations 30). In the figure below I binned gene pairs according to the number of co-localizations observed in the three groups of localization terms and for each bin calculated the fraction that interact.

As expected, more specific mRNA localization terms (localizations 30) are more informative for prediction of functional association since fewer terms are required to obtain the same or higher likelihood of interaction. The increased likelihood does not come at a cost of fewer pairs annotated. For example, there are similar number of gene pairs in bin "10-14" of the more specific localization terms (localizations 30) as in the bin ">20" for all localization terms (see figure below).
It is important to keep in mind that mRNA localization alone is a very poor predictor of genetic or physical interaction. I took the number of co-localization of each pair (using the terms in "localizations 30") and plotted a ROC curve to determine the area under the ROC curve (AROC or AUC). The AROC value calculated was 0.54, with a 95% confidence lower bound of 0.52 and a p value of 6E-7 of the true area being 0.5. So it is not random (that would be 0.5) but by itself is a very poor predictor.

In summary:
1) the degree of mRNA co-localization significantly correlates with the likelihood of genetic or physical association.
2) less ubiquitous mRNA localization patterns should be more informative for interaction prediction
3) the degree of mRNA co-localization is by itself a poor predictor of interaction but it should be possible to use this information to improve statistical methods to predict genetic/physical interactions.

This was a quick analysis, not thoroughly tested and just meant to confirm that mRNA localization should be useful for genetic/physical interaction predictions. I am not going to pursue this but if there is anyone interested I suggest that it could be interesting to see what terms have more predictive power with the idea of integrating this information with other data sources or also possibly directing future localization studies. Perhaps there is little point of tracking different developmental stages or maybe embryonic localization patterns are not as informative as sub-cellular localizations to predict functional association.


Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science. 2003 Oct 17;302(5644):449-53.
Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM. Probabilistic model of the human protein-protein interaction network.Nat Biotechnol. 2005 Aug;23(8):951-9.
Zhong W, Sternberg PW. Genome-wide prediction of C. elegans genetic interactions.Science. 2006 Mar 10;311(5766):1481-4.

Tuesday, May 15, 2007

Protein evolution

What constrains and determines the rate of protein evolution ? This topic has received a great deal of attention in bioinformatics. Many reports have found significant correlations between protein evolutionary rate and expression levels, codon adaptation index (CAI), protein interactions (see below), protein length, protein dispensability and centrality in protein interactions networks. To complicate matters still, there are known cross correlations between some of the factors. For example it has been observed that the number of protein interactions correlates with protein length (weakly) and the probability that a protein is essential to the cell.

This highlights the importance of thinking about the amount of variance explained by the correlation and controlling for possible cross correlations. In fact it has been shown that, when controlling for gene expression, some of other factors have a weaker correlation (or none at all) with the rate of protein evolution (Csaba Pál et al 2003). Using principal component regression, Drummond and colleagues have shown that a single component dominated by expression, CAI and protein abundance accounted for 43% of the variance of the non-synonymous mutation rate (dN). The other known factors account only for a few percentage of the observed variance in dN.

Two questions might come to mind when thinking about these observations. One is why would expression values, CAI and protein abundance constrain protein evolution. The other is why the number of protein interactions explain so little (or non at all) of the variance in protein evolutionary rates. Intuitively, the number of protein interactions is related to the functional density of a protein and proteins with hight functional density should have a lower dN.

Drummond and colleagues proposed in a PNAS paper an explanation for the first question. They first list three possible reasons for why expression levels should have such a strong effect on protein evolution: functional loss, translational efficiency and translational robustness. Functional loss, postulated by Rocha and Danchin hypothesizes that highly expressed proteins have lower dN because they are under strong selection to minimize the impact of miss-translation that would create a large pool of inefficient proteins and reduce the fitness of the cells. A second hypothesis proposed by Akashi links protein evolutionary rates with gene expression through efficiency of transcription. Highly expressed proteins have optimal codon usage for efficient translation and therefore a lower dN and dS. Drummond and colleagues added a third hypothesis that they called translational robustness. Given the costs of miss-folding and agregation, the higher the number of errors in translation that might lead to miss-folding and agregation the higher the cost for the cell. Therefore there might by a strong selection for keeping highly expressed genes robust against miss-translation.

The difference between translational robustness and functional loss is that the first implies that the number of events of translation are the important factor while the second puts emphasis on the protein concentration. Using protein abundance and mRNA expression the authors showed that translational robustness seams to be the most important factor determining the rate of protein evolution.

In fact, in a recent paper (Tartaglia et al, 2007) a correlation between in vitro aggregation rates and in vivo expression levels was discovered. Highly expressed proteins tend to have a lower agregation rate measured in vitro (r=97, N=12). The number of proteins analyzed was small and the rates of agregation were obtained not always in the same conditions but it does fit with the translational robustness hypothesis.

Even if the number of translational events is such a strong constrain, one would expect that when accounting for this, one would still see an effect of functional density on protein evolution. Yet, the correlation between a proxy for functional density - number of protein interactions - and dN has been under strong debate. (yes there is, no there isn't, yes, no , yes, maybe, ...)

The answer to this dispute might in the end be that the number of protein interactions is not a good proxy for functional density. A protein might have many protein interactions using a single interface. This is why the work of Kim and colleagues from Gerstein lab is important. Using structural information they predicted the most likely interface for protein interactions in S. cerevisiae. They could then show that protein evolutionary rate correlates better with adjusted interface surface area than with number of protein interactions. Also, the relationship of evolutionary rate with protein evolution appears to be independent of protein expression level.

The overall picture so far seems to be that translational robustness is the main driving force shaping protein evolutionary rates. Functional constrains are also important but are much more localized explaining a smaller fraction of the overall variance of the whole proteins.

Where can we go further ? As I mentioned above, translational robustness predicts that expression levels should correlate with overall stability, designability (number of sequences that fit the structure) and avoidance of aggregation prone sequences. Bloom and colleagues have shown that density of inter-residue contacts(a proxy for designability) does not correlate with expression but the study was limited to roughly 200 proteins so this might no be the final answer.

So, a clear hypothesis is that a computational measure that would sum a proteins' stability, tendency for agregation and designability should correlate with gene expression levels.

Further reading:
An integrated view of protein evolution (Nature Reviews Genetics)

Friday, January 26, 2007

Not so silent mutations


DNA mutations that do not change the coding amino-acid are many times referred to as "silent mutations", or synonymous mutations, because it is less likely that they will result in a change in function. Synonymous mutations are often considered to be evolutionary neutral and the ratio of non-synonymous substitutions (Ka) to synonymous substitutions (Ks) is used to study sequence evolution. It can be used for example to search for DNA regions targeted by selection (see review and a practical application).


In the last issue of Science Kimchi-Sarfaty and colleagues found a synonymous mutation in a transport protein that has an effect on the protein function. They have shown, at least in cell-lines, that the mutation does not affect mRNA levels nor the produced protein sequence. Finally the authors showed that the mutation might change the protein's conformation by comparing the sensibility of wild type and mutated sequence to trypsin digestion.

The authors speculate that the usage of that particular codon, even if not affecting the coding region, might change the translation rate and folding of the protein. It had already been shown in E. coli that synonymous mutations can affect the in vivo folding of a protein. Here the authors have shown a case where a silent mutation can change the substrate specificity of a transporter.

Because of these codon preferences it is important to adjust for codon selection pressures when studying synonymous substitutions. The codon preferences are usually considered to be due to differences in the pool of the cognate tRNA but other studies have shown that codon bias might arise also by codon context. In E. coli, codon pair preferences, were observed to affect their in vivo translation. Also, these codon pair preferences are species specific and are, at least in part, influenced by nucleotide positions within A-site tRNA sequences.

Hypothesis: If codon pairs can be selected due to tRNA structural constrains on the ribosome P and A sites then it might be necessary to correct for these codon preferences when studying synonymous mutations.