Tuesday, May 15, 2007

Protein evolution

What constrains and determines the rate of protein evolution ? This topic has received a great deal of attention in bioinformatics. Many reports have found significant correlations between protein evolutionary rate and expression levels, codon adaptation index (CAI), protein interactions (see below), protein length, protein dispensability and centrality in protein interactions networks. To complicate matters still, there are known cross correlations between some of the factors. For example it has been observed that the number of protein interactions correlates with protein length (weakly) and the probability that a protein is essential to the cell.

This highlights the importance of thinking about the amount of variance explained by the correlation and controlling for possible cross correlations. In fact it has been shown that, when controlling for gene expression, some of other factors have a weaker correlation (or none at all) with the rate of protein evolution (Csaba Pál et al 2003). Using principal component regression, Drummond and colleagues have shown that a single component dominated by expression, CAI and protein abundance accounted for 43% of the variance of the non-synonymous mutation rate (dN). The other known factors account only for a few percentage of the observed variance in dN.

Two questions might come to mind when thinking about these observations. One is why would expression values, CAI and protein abundance constrain protein evolution. The other is why the number of protein interactions explain so little (or non at all) of the variance in protein evolutionary rates. Intuitively, the number of protein interactions is related to the functional density of a protein and proteins with hight functional density should have a lower dN.

Drummond and colleagues proposed in a PNAS paper an explanation for the first question. They first list three possible reasons for why expression levels should have such a strong effect on protein evolution: functional loss, translational efficiency and translational robustness. Functional loss, postulated by Rocha and Danchin hypothesizes that highly expressed proteins have lower dN because they are under strong selection to minimize the impact of miss-translation that would create a large pool of inefficient proteins and reduce the fitness of the cells. A second hypothesis proposed by Akashi links protein evolutionary rates with gene expression through efficiency of transcription. Highly expressed proteins have optimal codon usage for efficient translation and therefore a lower dN and dS. Drummond and colleagues added a third hypothesis that they called translational robustness. Given the costs of miss-folding and agregation, the higher the number of errors in translation that might lead to miss-folding and agregation the higher the cost for the cell. Therefore there might by a strong selection for keeping highly expressed genes robust against miss-translation.

The difference between translational robustness and functional loss is that the first implies that the number of events of translation are the important factor while the second puts emphasis on the protein concentration. Using protein abundance and mRNA expression the authors showed that translational robustness seams to be the most important factor determining the rate of protein evolution.

In fact, in a recent paper (Tartaglia et al, 2007) a correlation between in vitro aggregation rates and in vivo expression levels was discovered. Highly expressed proteins tend to have a lower agregation rate measured in vitro (r=97, N=12). The number of proteins analyzed was small and the rates of agregation were obtained not always in the same conditions but it does fit with the translational robustness hypothesis.

Even if the number of translational events is such a strong constrain, one would expect that when accounting for this, one would still see an effect of functional density on protein evolution. Yet, the correlation between a proxy for functional density - number of protein interactions - and dN has been under strong debate. (yes there is, no there isn't, yes, no , yes, maybe, ...)

The answer to this dispute might in the end be that the number of protein interactions is not a good proxy for functional density. A protein might have many protein interactions using a single interface. This is why the work of Kim and colleagues from Gerstein lab is important. Using structural information they predicted the most likely interface for protein interactions in S. cerevisiae. They could then show that protein evolutionary rate correlates better with adjusted interface surface area than with number of protein interactions. Also, the relationship of evolutionary rate with protein evolution appears to be independent of protein expression level.

The overall picture so far seems to be that translational robustness is the main driving force shaping protein evolutionary rates. Functional constrains are also important but are much more localized explaining a smaller fraction of the overall variance of the whole proteins.

Where can we go further ? As I mentioned above, translational robustness predicts that expression levels should correlate with overall stability, designability (number of sequences that fit the structure) and avoidance of aggregation prone sequences. Bloom and colleagues have shown that density of inter-residue contacts(a proxy for designability) does not correlate with expression but the study was limited to roughly 200 proteins so this might no be the final answer.

So, a clear hypothesis is that a computational measure that would sum a proteins' stability, tendency for agregation and designability should correlate with gene expression levels.

Further reading:
An integrated view of protein evolution (Nature Reviews Genetics)