Sunday, January 03, 2010

Stitching different web tools to organize a project

A little over a year ago I mentioned a project I was working on about prediction and evolution of E3 ligase targets (aka P1). As I said back then, I am free to risk as much as I want in sharing ongoing results and Nir London just asked me how the project is going via the comments of that blog post so I decided to give a bit of an update.

Essentially, the project quickly deviated from course since I realized that predicting E3 specificity and experimentally determining ubiquitylation sites in fungal species (without having to resort to strain manipulation) were not going to be an easy tasks.
So, since the goal was to use these data to study the co-evolution of phosphorylation switches (phosphorylation regulating ubiquitylation) it makes little sense to restrain the analysis specifically to one form of post-translational modification (PTM).  After a failed attempt to purify ubiquitylated substrates the goal has been to come up with ways to predict the functional consequences of phosphorylation. We will still need to take ubiquitylation into account but that will be a part of the whole picture.

With this goal in mind we have been collecting for multiple species data on phosphorylation as well as other forms of PTMs from databases and the literature and we have been trying to come up with ways to predict the function of these phosphorylation events. These predictions can be broken down mostly intro tree types:
- phosphorylation regulating domain activity
- phosphorylation regulating domain-domain interactions (globular domain interfaces)
- phosphorylation regulating linear motif interactions (phosphorylation switches in disordered regions)

We have set up a notebook where we will be putting some of the results and ways to access the datasets. Any new experimental data and results from the analysis will be posted with a significant delay both to give us some protection against scooping and also to try to guarantee that we don't push out things that are obviously wrong. This brings us to a disclaimer... all data and analysis in that notebook is to be considered preliminary and not peer reviewed, it probably contains mistakes and can change quickly.

I am currently colaborating with Raik Gruenberg on this project and we are open to collaborators that bring new skills to the project. We are particularly interested in experimentalist working in cell biology and cell signalling that could be interested in testing some of the predictions we are getting out of this study.

I won't talk much (yet) about the results we have so far but instead mention some of the tools we are using or planning to use:
- The notebook of the project hosted in openwetware
- The datasets/files are shared via Dropbox
- If need arises code will be shared via Google Code (currently empty)
- Literature will be shared via a Zotero group library
- The papers and other items can be discussed in a Friendfeed group

This will be all for now. I think we are getting interesting results from this analysis on the evolution of the functional consequences of phosphorylation events but we will update the notebook when we are a bit more confident that we ruled out most of the potential artifacts. I think the hardest part about exposing ongoing projects is having to explain to potential collaborators that we intend to do so. This still scares people away.

I'll end with a pretty picture. This is an image of an homology model for the Tup1 -Hhf1 interaction. Highlighted are two residues that are predicted by the model to be in the interface and are phosphorylated in two different fungal species. This exemplifies how the functional consequence of a phosphorylation event can be conserved although the individual phosphorylation sites (apparently) are not. 


Thursday, December 17, 2009

Name that lab ...

In the last editorial in Nature, the need for an author ID is introduced with the simple notion that each one of us has specific sets of skills:
In his classic book Management Teams, UK psychologist Meredith Belbin used extensive empirical evidence to argue that effective teams require members who can cover nine key roles. These roles range from the creative 'plants' who generate novel ideas, to the disciplined 'implementers' who turn plans into action and the big-picture 'coordinators' who keep everyone working together.
From this perspective the author ID is a tool that might help us get appropriate credit for skill sets that are currently undervalued. This sort of argument reminds me of a discussion I had several times in the past about the management structure of academic labs. Why is it that we have one single leader in each lab that has to handle all sorts of different management tasks ? Is it ego ? That we all need to have our own lab, named accordingly with our name ?

It does not take long to notice that all supervisors have their strengths and weaknesses and we talk about this openly. Some are better at grant writing, some have good people skills and keep the lab well balanced, a few (rare ones :) still know what they are talking about when they help you troubleshoot your method/protocol. If it was possible to have the same person doing all these things companies would not have come up with their more complicated management structures.

So why is it that we name our labs after ourselves and do a poor management job instead of having multiple PIs handling different aspects of the lab that is named after what it actually studies ?

Friday, August 21, 2009

PLoS Currents - rapid dissemination of knowledge

PLoS unveiled recently an initiative they call PLoS Currents. It is an experiment in rapid dissemination of research built on top of Google Knol. Essentially, a community of people dedicated to a specific topic, could use PLoS Currents to describe their ongoing work before it is submitted to a peer review journal. They have focused their initial efforts to Influenza research where the speed of dissemination of information might be crucial.

The content of this PLoS Currents: Influenza is not peer reviewed but is moderated by a panel of scientists that will strive to keep the content on topic. There is a FAQ explaining in more detail the initiative. These articles are archived, citable, they can be revised and they should not be considered as peer-reviewed publications. For this reason, PLoS encourages authors to eventually submit these works to a peer-reviewed journal. It remains to be seen how other publishers will react to submissions that are available in these rapid dissemination portals.

PLoS Currents vs Nature Precedings
This initiative is somewhat related to the preprint archives like Nature Precedings and arxive. The main differences seam to be a stronger emphasizes on community moderators and the use of 3rd party technology (Google Knol). The community moderators, which I assume are researchers working on Influenza could be decisive factor in ensuring that other researchers in the field at least know about the project. Using Google Knol lets PLoS focus on the community and hopefully help them get the technical support from Google to develop new tools are they are needed. However the website currently looks a little bit like a hack, which is the downside of using a 3rd party technology. For example, we can click the edit button and see options to change the main website .. although obviously the permissions do not allow us to save these changes.

I think it is an interesting experiment and hopefully more bio-related researchers will get comfortable with sharing and discussing ongoing research before publication. I still believe this would reduce wasteful overlaps.  As usual, I only fear that more of these experiments tend to fragment the required critical mass for such a community site to work.

Tuesday, August 11, 2009

Translationally optimal codons do not appear to significantly associate with phosphorylation sites

I recently read an interesting paper about codon bias at structurally important sites that sent me on a small detour from my usual activities. Tong Zhou, Mason Weems and Claus Wilke, described how translationally optimal codons are associated with structurally important sites in proteins, such as the protein core (Zhou et al. MBE 2009). This work is a continuation of the work from this same lab on what constraints protein evolution. I have written here before a short review of the literature on the subject. As a reminder, it was observed that the expression level is the strongest constraint on a protein's rate of change with highly expressed genes coding for proteins that diverge slower than lowly expressed ones (Drummond et al. MBE 2006). It is currently believed that selection against translation errors is the main driving force restricting this rate of change (Drummond et al. PNAS 2005,Drummond et al. Cell 2008). It has been previously shown that translation rates are introduced, on average, at an order of about 1 to 5 per 10000 codons and that different codons can differ in their error rates by 4 to 9 fold, influenced by translational properties like the availability of their tRNAs (Kramer et al. RNA 2007).

Given this background of information what Zhou and colleagues set out to do, was test if codons that are associated with highly expressed genes tend to be over-represented at structurally important sites. The idea being that such codons, defined as "optimal codons" are less error prone and therefore should be avoided at positions that, when miss-translated, could destabilize proteins. In this work they defined a measure of codon optimality as the odds ratio of codon usage between highly and lowly expressed genes. Without going into many details they showed, in different ways and for different species, that indeed, codon optimality is correlated with the odds of being at a structurally important site.

I decided to test if I could also see a significant association between codon optimality and sites of post-translational modifications. I defined a window of plus or minus 2 amino-acids surrounding a phosphorylation site (of S. cerevisiae) as associated with post-translational modification. The rationale would be that selection for translational robustness could constraint codon usage near a phosphorylation site when compared with other Serine or Threonine sites. For simplification I mostly ignored tyrosine phosphorylation that in S. cerevisiae is a very small fraction of the total phosphorylation observed to date .
For each codon I calculated its over representation at these phosphorylation windows compared to similar windows around all other S/T sites and plotted this value against the log of the codon optimality score calculated by Zhou and colleagues.
Figure 1 - Over-representation of optimal codons at phosphosites
At first impression it would appear that there is a significant correlation between codon optimality and phosphorylation sites. However, as I will try to describe below this is mostly due to differences in gene expression. Given the relatively small number of phosphorylation sites per protein, it is hard to test this association for each protein independently as it was done by Zhou and colleagues for the structurally important sites. The alternative is therefore to try to take into account the differences in gene expression. I first checked if phosphorylated proteins tend to be coded by highly expressed genes.
Figure 2 - Distribution of gene expression of phosphorylated proteins

I figure 2 I plot the distribution of gene expression for phosphorylated and non-phosphorylated proteins. There is only a very small difference observed with phosphoproteins having a marginally higher median gene expression when compared to other proteins. However this difference is small and a KS test does not rule out that they are drawn from the same distribution.

The next possible expression related explanation for the observed correlation would be that highly expressed genes tend to have more phosphorylation sites. Although there is no significant correlation between the gene expression level and the absolute number of phosphorylation sites, what I observed was that highly expressed proteins tend to be smaller in size. This means that there is a significant positive correlation between the fraction of phosphorylated Serine and Threonine sites and gene expression.
Figure 3 - Expression level correlates with fraction of phosphorylated ST sites

Unfortunately, I believe this correlation explains the result observed in figure 1. In order to properly control for this observation I calculated the correlation observed in figure 1 randomizing the phosphorylation sites within each phosphoprotein. To compare I also randomized the phosphorylation sites keeping the total number of phosphorylation sites fixed but not restricting the number of phosphorylation sites within each specific phosphoprotein.

Figure 4 - Distribution of R-squared for randomized phosphorylation sites

When randomizing the phosphorylation sites within each phosphoprotein, keeping the number of phosphorylation sites in each specific phosphoproteins constant the average R-squared is higher than the observed with the experimentally determined phosphorylation sites (pink curve). This would mean that the correlation observed in figure 1 is not due to functional constraints acting on the phosphorylation sites but instead is probably due to the correlation observed in figure 3 between the expression level and the fraction of phosphorylated S/T residues.
The observed correlation would appear to be significantly higher than random if we allow the random phosphorylation sites to be drawn from any phosphoprotein without constraining the number of phosphorylation sites in each specific protein (blue curve). I added this because I thought it was an striking example of how a relatively subtle change in assumptions can change the significance of a score.

I also tested if conserved phosphorylation sites tend to be coded by optimal codons when compared with non-conserved phosphorylation sites. For each phosphorylation site I summed over the codon optimality in a window around the site and compared the distribution of this sum for phosphorylation sites that are conserved in zero, one or more than one species. The conservation was defined based on an alignment window of +/- 10AAs of S. cerevisiae proteins against orthologs in C. albicans, S. pombe, D. melanogaster and H. sapiens.
Figure 5 - Distribution of codon optimality scores versus phospho-site conservation

I observe a higher sum of codon optimality for conserved phosphorylation sites (fig 5A) but this difference is not maintained if the codon optimality score of each peptide is normalized by the expression level of the source protein (fig 5B).

In summary, when the gene expression levels are taken into account, it does not appear to be an association between translationally optimal codons with the region around phosphorylation sites.  This is consistent with the weak functional constraints observed by in analysis performed by Landry and colleagues.

Saturday, August 01, 2009

Drug synergies tend to be context specific

ResearchBlogging.org
A little over a year ago I mentioned a paper published in MSB on how drug-combinations could be used to study pathways. Recently, some of the same authors have now published a study in Nature Biotech analyzing drug combinations under different contexts (i.e. different tissues, different species, different outputs, etc).

The underlying methodology of the study is essentially the same as in above mentioned paper. The authors try to study the effect of combining drugs on specific phenotypes. One example of a phenotype could be the inhibition of growth of a pathogenic strain. Different concentrations of two drugs are combined in a matrix form as described in figure 1a (reproduced below) and the phenotype is measured for each case. Two drugs are said to be synergistic if the measured impact on the phenotype of the combined drugs is greater than expected by a neutral model.
The authors ask themselves if drug synergy is or not context dependent. This is an important question for combinatorial therapeutics since we would like to have treatments that are context dependent (i.e. specific). The most straightforward example would be drug treatments against pathogens. Ideally, combinations of drugs would act synergistically against the pathogens but not against the host. Another example would be drug combinations targeting the expression of a particular gene (ex. TNF-alpha) without showing synergy at targeting general cell viability.

In order to test this the authors performed simulations of E.coli metabolism growing under different conditions and a astonishing  panel of ~94000 experimental dose matrices covering several different types of therapeutic conditions. In each experiment, two drugs are tested against a control and a test phenotype and the synergy is measured and compared. The results are summarized as the synergy of the two drugs in the test case and the selectivity of this synergy towards the test phenotype. In other words, for each experiment the authors tested if the synergistic drug pairs in the test phenotype (ex inhibition of growth of the pathogen) are also acting in synergy on the control phenotype (ex. inhibition of growth of host cells).
I reproduce above fig 2b with the results from the flux balance simulations of E.coli metabolism. In these simulations "drugs" were implemented as ideal enzyme inhibitors that reduced flux of their targets. Each cross on this figure represents a "drug" pair targeting two enzymes of the E.coli metabolism.  The test and control phenotypes are, in this case, fermentation versus aerobic conditions. In this plot the authors show that synergistic drug pairs under fermentation tend to have a high selectivity for that condition when compared to aerobic conditions.

The authors then went on to show that this was also the case for most of the experimental cases studied. Some of the experimental cases included cell lines derived from different tissues, highlighting the complexity of drug-interactions in multicellular organisms. These results are consistent with the observation that negative genetic interactions are poorly conserved across species (Tischler et al. Nat Genet. 2008, Roguev et al. Science 2008). Although these results are promising, in respect to the usefulness of combinatorial therapeutic strategies, they emphasize the degree of divergence of cellular interaction networks across species and perhaps even tissues. I am obviously biased but I think that fundamental studies of chemogenomics across species will help us to better understand the potential of combinatorial therapeutics.

There are several examples in this paper of specific interesting cases of drug synergies but most of the results are in supplementary materials. Given that most of the authors are affiliated with a company I expect that there will be little real therapeutic value in the data. Still, it looks like an interesting set for anyone interested in studying drug-gene networks.

Lehár, J., Krueger, A., Avery, W., Heilbut, A., Johansen, L., Price, E., Rickles, R., Short III, G., Staunton, J., Jin, X., Lee, M., Zimmermann, G., & Borisy, A. (2009). Synergistic drug combinations tend to improve therapeutically relevant selectivity Nature Biotechnology, 27 (7), 659-666 DOI: 10.1038/nbt.1549

Friday, June 26, 2009

Reply: On the evolution of protein length and phosphorylation sites

Lars just pointed out in a blog post that the average protein length of a group of proteins is a strong predictor of average number of phosphorylation sites. Although this is intuitive this is something that I honestly had not fully considered. As Lars mentions this has potential implications for some of the calculations in our recently published study on the evolution of phosphorylation in yeast species.

One potential concern relates to figure 1a. We found that, although protein phosphorylation appears to diverge quickly, there is a high conservation of the relative number of phosphosites per protein for different GO groups. Lars suggests that, at least in part, this could be due to relative differences in average protein size for these different groups that in turn is highly conserved across species.

To test this hypothesis more directly I tried to correct for differences in the average protein size of different functional groups by calculating the average number of phosphorylation sites per amino-acid, instead of psites per protein. These values were then corrected for the average number of phosphorylation sites per AA in the proteome.

As before, there is still a high cross-species correlation for the average number of psites per amino-acid for different GO groups. The correlations are only somewhat smaller than before. The individual correlation coefficients among the three species changed from: S. cerevisiae versus C. albicans – R~0.90 to 0.80; S. cerevisiae versus S. pombe – R~0.91 to 0.84; S. pombe versus C. albicans – R~0.88 to 0.83. It would seem that differences in protein length explains only a small part of the observed correlations. Results in figure 1b are also not qualitative affected by this normalization suggesting that observed differences are not due to potential changes in the average size of proteins. In fact the number of amino acids per GO group is almost perfectly correlated across species.

Another potential concern relates to the sequence based prediction of phosphorylation. As explained in the methods, one of the two approaches used to predict if a protein was phosphorylated was the sum over multiple phosphorylation site predictors for the same sequence. Given the correlation shown by Lars, could it be that, at least for one of the methods, we are mostly predicting the average protein size ? To test this I normalized the phosphorylation prediction for each S. cerevisiae protein by their length. I re-tested the predictive power of this normalized value using ROC curves and the known phosphoproteins of S. cerevisiae as postives. The AROC values changed from 0.73 to 0.68. This shows that the phosphorylation propensity is not just predicting protein size although, as expected from Lars' blog post, size alone is actually a decent predictor for phosphorylation (AROC=0.66). The normalized phosphorylation propensity does not correlate with the protein size (CC~0.05) suggesting that there might ways to improve the predictors we used.

Nature or method bias ?
Are larger proteins more likely to be phosphorylated in a cell or are they more likely to be detected in a mass-spec experiment ? It is likely that what we are observing is a combination of both effects but it would be nice to know how much of this observed correlation is due to potential MS bias. I am open to suggestions for potential tests.
This is also important for what I am planning to work on next. A while ago I had noticed that prediction of phosphorylation propensity could also predict ubiquitination and vice-versa. It is possible that they are mostly related by protein size. I will try to look at this in future posts.

Tuesday, June 23, 2009

Comparative analysis of phosphoproteins in yeast species

My first postdoctoral project has just appeared online in PLoS Biology. It is about the evolution of phosphoregulation in yeast species. This analysis follows from a previous work I had done during my PhD on the evolution of protein-protein interactions after gene duplication (paper / blog post).  One of the conclusions from that previous work was that interactions of lower specificity, such as those mediated by short peptides, would be more prone to change. In fact, one of the protein domains that we found associated with high rates of change of protein-protein interactions was the kinase domain.
Given that the substrate specificity of a kinase is usually determined by a few key amino-acids surrounding the target phosphosite it is easy to image how kinase-substrate interactions can be easily created and destroyed with few mutations. It is also well known that these phosphorylation events can have important functional consequences. We therefore postulated that changes in phosphorylation are an important source of phenotypic diversity.

To test this, we collected by mass-spectrometry in vivo phosphorylation sites for 3 yeast species (S. cerevisiae, C. albicans and S. pombe). These were compared in order to estimate the rate of change of kinase-substrate interactions. Since changes in gene expression are generally regarded as one of the main sources of phenotypic diversity we compared these estimates with similar calculations for the rate of change of transcription factor (TF) interactions to promoters. Depending on how we define a divergence of phosphorylation we estimate that kinase-substrate interactions change either at similar rates or at most 2 orders of magnitude slower than TF-promoter interactions.

Although these changes in kinase-substrate interactions appear to be fast, groups of functionally related proteins tend to maintain the same levels of phosphorylation across broad time scales. We could identify a few functional groups and protein complexes with a significant divergence in phosphorylation and we tried to predict the most likely kinases responsible for these changes.

Finally we compiled recently published genetic interaction data for S. pombe (from Assen Roguev's work) and for S. cerevisiae (from Dorothea Fiedler's work) in addition to some novel genetic data produced for this work. We used this information to study the relative conservation of genetic interactions for protein kinases and transcription factors. We observed that both proteins kinases and TFs show a lower than average conservation of genetic interactions.

We think these observations strongly support the initial hypothesis that divergence in kinase-substrate interactions contributes significantly to phenotypic diversity.

Technology opening doors
For me personally it really feels like I was in the right place at the right time. Many of the experimental methods we used are still under heavy development but I was lucky to be very literally next door to the right people. I had the chance to collaborate with Jonathan Trinidad who works for the UCSF Mass Spectrometry Facility directed by Alma Burlingame. I also arrived at a time when the Krogan lab, more specifically Assen Roguev (twitter feed), has been working to develop genetic interaction assays for S. pombe (Roguev A 2007). As we describe in the introduction, these technological developments really allow us to map out the functional and physical interactions of a cell at an incredible rate. What I am hoping for is that soon they are seen in much the same light as genome sequencing. We can and should be using these tools to study, simultaneously, groups of species and not just the same usual model organisms that diverged from each other more than 1 billion years ago.

Evolution of signalling
There are many more protein interactions that are determined by short linear peptide motifs (Neduva PLoS Bio 2005). A large fraction of these determine protein post-translational modifications and are crucial for signal transduction systems. For the next couple of years I will try to continue to study the evolution of signal transduction systems. There are certainly many experimental and computational challenges to address. I am particularly interested in looking at the co-regulation by combinations of post-translational modifications and their co-evolution. I will do my best to share some of that work as it happens here in the blog.

Thursday, June 11, 2009

HFSP fellows meeting (Tokyo 2009)

I spent last week in Japan attending the fellows meeting of the Human Frontier Science Program. I was fortunate enough to get a postdoc fellowship from HFSP to support my current interest in the evolution of signalling systems. The meeting took place in Tokyo and brought together people from all sorts of different fields and at different stages of their careers. This program funds postdocs but also provides funding to young investigators setting up their labs and for teams of PIs working on interdisciplinary projects.

This year marks the 20th anniversary of the program that also coincides with a period of change in leadership. Ernst-Ludwig Winnacker, current Secretary General of the European Research Council, will take over the role of Secretary General of the HFSP organization from Torsten Wiesel. Also, Akito Arima will replace Masao Ito as the president of HFSPO (press release). Probably because of this the meeting had plenty of political moments and speeches. Thankfully most of the people involved in this organization appear to be very lighthearted so these moments were not a burden.

The curse of specialization ? 

A core focus of HFSP is to fund interdisciplinary projects that involve people from different areas or that help researchers change significantly their field of research. There was some time for discussions about the future of the organization as well as the future of "systems biology". For me personally, these debates helped to crystallized many of my own doubts. I am a biochemist but spent 90% of my PhD doing computational work. At this point I feel very much like a jack of all trades and master of none. In my previous work I have mostly hit walls due to lack of data so I plan to spend the next few years leaning a lot more about experimental work. Still, it is hard to be sure of what is best for the future. How much should I sacrifice in productivity to learn new skills ? Is it best to work as a specialist in interdisciplinary teams or be trained as an interdisciplinary person (Eddy SR, PloS Comp Bio 2005) ?

The broad scope of HFSP was well reflected in the topics presented in the meeting (PDF of program). There were many interesting talks, like the keynote by Takao Hensch about "How experience shapes the brain", in particular during the very early stages of life. He showed amazing work about "windows of opportunity" in learning and how these can be manipulated genetically or pharmacologically. Still, when I was looking around in the poster session I could not help but feel a bit of lack of interest since most of the topics were outside my previous work experience. This brings me back to the topic of specialization. Isn't it upsetting that we have to specialize so ? I don't think I can read and enjoy more than a third of a typical issue of Nature. This is for me the curse of specialization, it focuses not only your skills but your interests and curiosity.


Tokyo/Kyoto

Aside from the science, this was my first trip to Japan. I really liked it and hope to come back one day with more time to explore. I loved the temples, gardens, food, colors and all the differences.


Sunday, April 26, 2009

Guestimating PLoS ONE impact factor (Update)

Abhishek Tiwari did some analysis on the number of citations that PLoS ONE is getting so far using Scopus database. We had a small discussion over the numbers on FriendFeed and I ended up looking at different set of values also from Scopus. I tried to predict the first Impact Factor for PLoS ONE that might be out sometime this year.

Before showing the numbers I will repeat again that I think the IFs of the journal where a paper is published is a very poor measure of a papers importance. Although it is probably a good measure of the relative value of a journal (within a given field) we should be striving to pick what we read based on the value of a paper instead of the journal.

The Impact Factors that will be published this year are calculated as the total number of citations from 2008 to papers published in 2006 and 2007, divided by the number of citable units in 2006-2007 (articles and reviews). The data I am looking at is from Scopus so it varies a bit from the one in ISI. The variability comes from the decision of what to include as "citable" articles and from the journals that are covered in Scopus versus ISI.

One problem I found with Scopus data was that, for some journals, the database has multiple entries due to small variations in article titles. For PLoS Biology, PLoS Computational Biology and PLoS Genetics the number of articles published should be less than half of what is reported. This does not appear to be the case for PLoS ONE.
I downloaded the tables of published articles and tried to removed redundancies looking at the tittles and authors. I counted only articles and reviews as citable items but used all articles published in 2006-2007 to get the number of citations in the year 2008. I also did the same calculations for the impact factor of the previous year to be able to compare with the data from ISI. The results were comparable but not the same.



In summary, PLoS ONE might get an impact factor of about half of the expected for PLoS Computational Biology. The usual disclaimers should be said: I have no idea of how complete Scopus data is and how exactly it relates to ISI.


Update:
The official impact factor for PLoS ONE for 2008 is out and I is ~ 4.3. I underestimated it by 1.5. It is also amazing how many people search for this online. This post is my number one source of traffic to this blog. If you are reading this and typically sit in panels that decide on new faculty, please stop evaluating people by where they publish. This way, postdocs like me can focus on doing interesting science instead of trying to get into nature/cell/science.

Sunday, March 22, 2009

Thank you Nature

A while ago Euan Adie from Nature asked for help to categorize comments in PLoS ONE for analysis. A lot of people took some time to read some of the comments and the final results of this crowdsourcing effort was made available here. They randomly selected two people from the users that contributed some time for this to get some Nature branded ... stuff. I was one of the two lucky recipients. It took a while, but it arrived today:

Thank you NPG for the kind gifts, next time .. white t-shirt ?! :)

Monday, November 17, 2008

Why do we blog?

Martin Fenner, asked some questions to science bloggers in Nature Networks that I think are interesting. Plus, the meme is going around my blogging neighbourhood so I thought I would join in as well:

1. What is your blog about?
It is mostly about science and technology with a particular focus on evolution, bioinformatics and the use of the web in science.

2. What will you never write about?
I will never blog about blog memes like this one. I tend to stay away from religion and politics but never is a very strong word.

3. Have you ever considered leaving science?
Does this mean academic research, research in general or science in general ? In any case no. I love problem solving and the freedom of academic research. The only thing I dislike about it is not being sure that I can keep doing this for as long as I wish.

4. What would you do instead?
If I could not do research I would probably try to work in scientific publishing. Doing research usually means that we have to focus on a very narrow field. Editors on the other hand are almost forced to broaden their scope and I think I would like this. I would also be interested in the use of new technologies in publishing.

5. What do you think will science blogging be like in 5 years?
Five years is a lot of time for the pace of technological development but not a long time for cultural change. I could be wrong but, if anything, there will be only a small increase in adoption of blogging as part of personal and group online presence along with the already existing web pages. I wish blogging (and other tools) would be use to further decentralize research agendas from physical location but I don't think that will happen in 5 years.

6. What is the most extraordinary thing that happened to you because of blogging?
I have gained a lot from blogging. The most concrete example was an invitation to attend SciFoo but there are many other things that are harder to evaluate. In some ways it is related to the benefits of attending conferences. It is useful because you get to interact with other scientists, exchange ideas, forces you to think through different perspectives, etc.

7. Did you write a blog post or comment you later regretted?
I probably did but I don't remember an example right now.

8. When did you first learn about science blogging?
As many other bioinformatic bloggers I started blogging in Nodalpoint, according to the archives in November 2001. I started this blog some two years after that.

9. What do your colleagues at work say about your blogging?
Not much really, I don't think many of them are aware of it. If any, the responses have been generally positive but I don't usually find many people interested in knowing more about blogging in science.

Wednesday, November 12, 2008

Open Science - just do it

My blog is 5 years old today and to celebrate I am trying to actually do some blogging. There are a couple of reasons why I have blogged less in the past months. In part it was due to FriendFeed and also in part because I was trying to finish a project on the evolution of phospho-regulation in yeast species. Nearing the end of a project should actually provide some of the most interesting blogging material but I did not ask for permission from everyone involved to write about ongoing work.

I have to admit that although I have been discussing and evangelizing open science for over two years I have done very little of it. I have used this blog sometimes to put up small analysis or mini-reviews but never to describe ongoing projects. I have tried to start a side-project online but I over-estimated the amount of "spare cycles" I have for this. So, I have talked it over with my supervisor and I am now free to "risk" as much as I want in trying out Open Science. The first project I will be trying to work on will be on E3 target prediction and evolution.

Prediction and evolution of E3 ubiquitin ligase targets
As I have mentioned above, I have been working in the past months on the evolution of phosphorylation and kinase-substrate interactions in yeast species. I am interested in the evolution of regulatory interactions in general because I believe that they are important for the evolution of novel phenotypes. This is why I will be trying to study the evolution of E3 target interactions. In order to get there I will try first to develop some methods to predict ubiquitination and E3 targets. Since a lot of the ideas and methodology applies to other post-translational modifications and even localization signals I will in the future try to generalize the findings to other types of interactions.

Some of the questions that I will try to address:
- How accurately can we predict E3 substrates ?
- How quickly in evolution do E3-targets change ?
- Is there co-regulation by kinases and E3s on the same targets (and how these evolve) ?

Once I have something substantial I will open a code repository on Google Code.

Tuesday, September 02, 2008

Books: long tails and crowds

I read two interesting books recently that relate to how the internet is changing businesses and society in general.


“The Long Tail” by Chris Anderson ends up suffering from its own success. I was so exposed to the long tail meme before reading the book that there were very few novel ideas left to read. The book describes the business opportunities that come from having a near-unlimited shelf space. While physical stores are forced to focus on the big hits, long tail businesses sell those big hits but also all the other niche products that only a few people will be interested in. There is a big challenge in trying to guide the users to those niche products that they will be interested. Anderson provides examples of recommendation and reputation engines from several companies (ie. Amazon, iTunes, eBay) that by now most of are familiar with. Even for those well exposed to log normal distributions and long tail businesses the book is still worth getting as a resource and for the very interesting historical perspective on the origins of long tail businesses.

“Here Comes Everybody” is an excellent book by Clay Shirky that describes the huge decrease in cost of group formation that we are currently living. Through a series of stories Shirky demonstrates how the internet facilitates group formation and how collective actions that before were impossible are now become the norm. His stories touch on ideas as simple as the photo collections in Flickr to the coordination of regime opposition in Byelorussia. I appreciate the somewhat neutral stance on the phenomena. The book covers cases where online groups almost change to a mob like mentality and others were groups of consumers were able to stand up to corporations to guarantee their rights. The outcome of easy group formation for the future of society is not easy to predict and this is well conveyed in the book.

The subjects and stories from these books are interesting for scientists also because they can influence the way we work. Science is a long tail of knowledge with many niche areas that only a few people in the world care about. The recommendation and reputation engines described could help us navigate the body of knowledge to find those bits that interest us the most. Also, easy group formation might one day shift the way we work so that the innovation and research is not determined by physical location but instead focused on the research problems.

Tuesday, August 12, 2008

Freebase parallax

Freebase parallax is a new browsing interface for Freebase. It allows the user to drill in and connect sets of objects to other sets of objects within Freebase and draw maps and graphs with the information. This really shows the power of having well structured data available online. Here is a video describing how it works with great examples of data mining:

Sunday, August 10, 2008

Post-publication journals

With the increase in the number of journals and articles being published every year and the possibility of having an even larger set of "gray literature" available online we face the challenge of filtering out those bits of information that are relevant for us.

Let us define as "perceived impact" this subjective measure of importance that some bit of information holds for us as scientists. This information is typically an article but it could be applied later to pre-prints and database entries in general.

Everyone of us creates some rules to select from the constant stream of scientific output what to pay attention to. We could picture this sorting process in the form a triangle with a large base of very specific knowledge that is somewhat important to us and a small amount of more general but highly important content at the top. For the majority of scientists today, these sorting rules are based on journal topic (cell biology, physics, evolution, etc) and journal impact factor. Below the base we could place the gray literature that today is mostly out of sight and is not peer-reviewed.

With the advent of the web and in particular the social aspects of this new medium we should expect better than evaluation of articles based on the quality of the journal that it was published in. In the words of Eugene Garfield, the inventor of the impact factor:
“In order to shortcut the work of looking up actual (real) citation counts for investigators the journal impact factor is used as a surrogate to estimate the count. I have always warned against this use”. Eugene Garfield (1998)
Scientific publishing is now digital with every article having an universal digital identifier (DOI). However, as an author I can get (for free) much more information about how people are using the content from this blog than for articles I published. Information about the number of downloads, citations in other articles, in scientific blogs or in bookmarking services could help us sort through information in a better way than relying solely on journal editors (impact factors). We should be using the social web to re-sort articles after peer-review to reflect our preferences:
How would we build such a personalized sorting system ? In the words of the chief-editor of Nature:
(…) nobody wants to have to wade through a morass of papers of hugely mixed quality, so how will the more interesting papers in such an archive get noticed as such? Philip Campbell

It is obviously challenging to use some of those metrics mentioned above as signals to rank the important of individual articles when they are so easy to game. On the other hand some of them are already useful and working today. I already subscribe to RSS feeds from some users of Connotea that consistently bookmark articles that I find useful. Similarly through FriendFeed I get recommendations of articles to read from people I trust. So, although I do not have a clear solution on how to build such a system I think there is a need for it and there are clear ideas to try.
Here is something like a mind-map of what I think would work best, a mixture of the social recommendations of FriendFeed with the pure algorithmic ideas of Google News:


These ideas of sorting based on measures of usage is already being tested by the new Frontiers journals. These are a series of open access journals published by an international not-for-profit foundation based in Switzerland. As PLoS ONE, these journals aim to separate the peer-review process of quality and scientific soundness from the more subjective impact evaluation. In practice they are doing this by publishing research in a tiered system with articles submitted to a set of specialty journals. The articles are evaluated based on the reading activity of the users and the top 10% advance up to the next tier journal.
So far Frontiers has started with neuroscience specialty journals with a single top tier journal (Frontiers in Neuroscience) but if this is successful they could easily add other disciplines and have a third tier on top of very general content. In order to contribute to the evaluation procedure, readers must fill out their profile. This information is taken into consideration since they will rank users usage metrics differently according to their expertise.

Summary
No single individual wants to go through all published literature to find the useful information but together we effectively do this. The challenge is how to evaluate specific articles by a combination of metrics to promote them to wider audiences in a way that is not easy to exploit. Kevin Kelly said recently in a Ted Talk that "The price of total personalization is total transparency". Would this bother scientists ? Lets say that a few science publishers get together with some of these scientific social sites (social networks, bookmarking sites) to mimic the Frontiers model in a larger scale. Users would install a browser plugin that would link their scientific profile and social contacts with their reading activity. The publishers could then use this information to create personal reading hubs for users.

Saturday, August 09, 2008

BioBarCamp wrapup

In the last two days I attended the first BioBarCamp here in the bay area in the Institute for the Future. There is a lot of micro blogging coverage of the event in FriendFeed and even some recorded video from Cameron Neylon (click on demand and pick BioBarCamp).

The meeting was fun due of the unstructured nature of the event and also because I got to meet a lot of people I knew only from blogs. Two highlights of the event were the talks by Aubrey de Grey (see notes and also Cameron's video above) and Jon Trowbridg from Google that talked about this.

There were four parallel discussions going on but I kept mostly with the open science and web tolls related talks. There are a couple of ideas that I take away from these discussions that I will mention below but in general these overlap with what Shirley already mentioned in her post.

Pragmatic steps for Open Science and web tool adoption
Kaitlin Thaney and Cameron Neylon talked about open science and data commons. Cameron in particular is making the case that we need to demand open data the same way we demand for open access to science articles. Although publishers will say that they already try ask for availability to everything required to reproduce the results the truth is that this is not really well enforced. Funding agencies should provision funds to make raw results freely available for re-use once an article is accepted for publication.

On the side of web tools for science, Ricardo Vidal (OpenWetWare), Vivek Murthy (Epernicus), Jeremy England and Mark Kaganovich (Labmeeting) discussed user adoption. Adoption rates among scientists tend to be slow and there is a large generational gap. Again here pragmatic steps need to taken to promote the usage of these tools in science. Some of the current problems include fragmentation of user base, lack of focus in tool development, too few security restrictions.

These tools should try to focus on solving a few important problems really well. Examples of these problems include finding the person in my network that might have some expertize that I need; better ways to find articles that I find relevant or to manage my lab notebook and article library, etc. To reduce the fragmentation of user base it would be great that these websites find a way to share the social graph.

Finally the question of privacy online was again revisited. The idea of having open lab notebooks that anyone can see (as in OpenWetWare) might be a bit too radical and put away users that want to try the tools without the risks associated with exposing your research online. As has been discussed elsewhere there are advantages in having electronic notebooks (easier to access, share with peers and backup) but very few people will risk having their lab notebooks freely available online. Therefore allowing for privacy should increase usage.

Sunday, July 27, 2008

Some backlash on Open Science

During ISMB, thanks to Shirley Wu (FF announcement), there was an improvised BoF (Birds of a Feather) session on web tools for scientists. Given that the meeting was not really announced we were not really expecting a full room. I would say that we had around 20 to 30 people that sayed at least for a while. We talked in general about tools that are useful in science (things like online reference managers, pre-print archives, community wikis, FriendFeed, Second Life) and we also talked a bit about the culture of sharing and open science.

Curiosly, the most interesting discussion I had about open science was not at this BoF session but after it. In the following day the subject come up again in a conversation between me and tree other people (two PhD students and a PI from a different lab). I will not identify the people because I don't know if they would like that or not. The most striking thing for me about this conversation was the somewhat instinctive negative reaction against open science from the part of the two PhD students. After a long discussion they made a few interesting arguments that I will mention below but what was strange for me was that this is the first time I see someone react instinctively in a negative way against the concepts of open science.

One of the students in particular was arguing that the fact that scientists sharing their results online (prior to peer review) is not only silly on their part (the scooping argument) but it would be detrimental to science as a whole. The most concrete argument he offered was that seeing someone "stake claim" to a research problem might scare other people away from even trying to solve it. I would say that it would be better to have people collaborating on the same research problems instead of the current scenario where a lot of scientists waste years (of their time and resources) working in parallel without even knowing about it. He argues simply that some people might not want to collaborate at all and should be allowed to work in this way. I don't think scientists should be forced to put their work online before peer-review, I just happen to think that this would improve collaborations and decrease the current waste or resources.

The second argument against sharing of research ideas and results prior to peer review was more consensual. They all mention the problem of noise and how it is already difficult to find relevant results in the peer reviewed literature. They suggest that this problem would be further increased if more people were to share their ideas and results online. I fully agree that this is a problem but not related at all with open science. This is a sorting/filtering problem that is already important today with the large increase in journals and published articles. We do need better recommendation and filtering tools but sharing ideas and results in blogs/wikis/online project management tools is not going to seriously increase the noise since these are all very easily separated from peer-reviewed articles. No-one is forced to track shared projects, but if they are available it would make it that much easier to start a collaboration when and if it makes sense to do so. Are open source repositories detrimental to the software industry ?

It took around 3 years since people started discussing the idea of open science and open notebooks for these concepts to get some attention. It is inevitable (and healthy) that as more people are exposed to a meme that more counter-arguments emerge. I guess that a backlash only means that the meme is spreading.




Thursday, July 17, 2008

ISMB 2008


I am leaving soon to Toronto to attend ISMB 2008. I usually stay way from big conferences since typically in small conferences is easier to really have time to talk to everyone. The nice thing about attending a big conference is that it looks like everyone is there. There is no shortage of science bloggers attending and it is going to be nice to get to know the people behind some of the blogs for the first time.

There is a room in FriendFeed were several people attending are gathered and for those not going it will probably be a good place to check for coverage of the conference. Alternatively here is a list of bloggers that are attending ISMB or some of the conferences before/after it:

Saturday, July 05, 2008

On the PLoS business model

Declan Butler wrote a news article about PLoS' business model that has generates a lot of discussion. A good summary of blog reactions is available from Bora's blog and there is a long thread of discussions at FriendFeed.

It is hard to read the piece as impartial reporting due to the general negative undertone. Describing PLoS ONE as a database and referring to PLoS ONE and other PLoS journals of lower impact as "bulk, cheap publishing of lower quality papers". I have nothing against the factual content in the news piece. From that perspective it is an interesting report on the PLoS business model. According to the news story PLoS is on track to become economically self-sustainable within two years. We learn that this is possible due to the expansion of PLoS as a publisher to cover a broader range of subjects and different degrees of perceived impact. This is hardly surprising. I wrote a year ago:
"On an author pays model, the most obvious way to limit the cost per paper and still provide a solid evaluation of perceived impact, is to have journals that cover the broad spectrum of perceived impact. In this way, for the publisher, the overall rejection rates decrease, the papers are evaluated and directed to the appropriate "level" of perceived impact."

Most people agree that in principle Open Access publishing would benefit science. Up until know publishers have been reluctant to admit that there is a viable business model with author fees. Some open access publishers (including BioMedCentral) were already showing that this was a viable business model but PLoS will be the first to have viable business model with high impact factor journals within the set of journals they publish.

Two of the most interesting comments on this discussion so far have come from Timo Hannay at Nascent and from Lars Juhljensen
Timo argues that PLoS has failed to show that it is possible to have a business model for a publisher that only has journals of high editorial input (high rejection rates and high perceived impact). Also, the existence of PLoS creates a barrier to entry to other science publishers interested in publishing with an open access (OA) model. There is no argument against the first statement, so far I have not seen any publisher that has managed to reduce the costs of maintaining such "high impact" journals to the point were authors fees would be sufficient. I think this is possible and the PLoS Community journals are the closest form of this but this is another discussion.
What I disagree with Timo is that PLoS somehow creates barriers to entry to other OA publishers. PLoS did require (still requires) philanthropic grants to establish themselves but pioneers have typically a harder time than creative followers. Anyone trying to follow PLoS has access to the records of success and failures, detailed financial reports and (I think) even the publishing infrastructure that they have developed.

Most people know that the strongest barrier to entry to scientific publishing is a perception of quality. NPG has used this fact to their advantage many times. Journals with Nature brand typically establish themselves quickly among the top of their topic. I am sure Nature invests a lot in excellent professional editors but without the Nature brand supporting these journals there would be nothing to choose from to start with. NPG also publishes many more journals than the Nature branded journals and as Lars has pointed out the majority of these have lower impact factors. I don't think there is financial information available so it is hard to know what is the fraction of NPG's income that comes from the high impact or lower impact journals.

Going back to one of Timo's main points, I don't agree that PLoS creates barriers to market entry to other OA publishers. At least certainly not because they used philanthropic grants until they reached break even point. If there are barriers in the market they are due to perception of quality and strong brand name. Here OA publishers have the added advantage that creating a strong brand is easier when most people perceive OA as something good. From the example of PLoS and to some extent BMC there are now clear paths for any publisher (specially one with a strong brand name) to set up a viable business OA model.

Tuesday, July 01, 2008

Bioinformatics around the globe

Did you ever wanted to have a global impression of the field of bioinformatics ? What types of tools they used, or how different is the work in academia versus industry ? Michael Barton from Bioinformatics Zen created a survey that will be running for the next month (until the 1st of August) that tries to address some of these questions. The more people complete the survey, the more informative the picture will be. The survey is anonymous and all information will be made available for those interested in analyzing it.
If you have a blog you can re-post it on your blog (see intructions here) or send a link to any of these blog pages that host the survey to other bioinformatic/computational biology researchers.

Saturday, June 28, 2008

Capturing biology one model at a time

Mathematical and computational modeling is (I hope) a well accepted requirement in biology. These tools allow us to formalize and study systems of higher complexity that are hard to conceptualize with logic thinking. There have been great advances in our capacity to model different biological systems, from single components to cellular functions and tissues. Many of these efforts have been ongoing separately, each one dealing with a particular layer of abstraction (atoms, interactions, cells, etc) and some of them are now reaching a level of accuracy that rivals some experimental methods. I will try to summarize, in a series of blog posts, the main advances behind some of these models and examples of integration between them with particular emphasis on proteins and cellular networks. I invite others to post about models in their areas of interest to be collected for a review.

From sequence to fold
RNA and proteins once produced adopt structures that have different functional roles. In principle all information required to determine the structure is in the DNA sequence that encodes for the RNA/protein. Although there has been some success in the prediction of RNA structure from sequence ab-initio protein folding remains a difficult challenge (see review by R.Das and D.Baker). A more pragmatic approach has been to use the increasing structural and sequence data made available in public databases to develop sequence based models for protein domains. In this way, for well studied protein folds it is possible to ask the reverse question, what sequences are likely to fold this way.
(To be expanded in a future post, volunteers welcome)

Protein binding models

I am particularly interested in how proteins interact with other components (mainly other proteins and DNA) and in trying to model these interactions from sequence to function. I will leave protein-compound interactions and metabolic networks for more knowledge people.
As mentioned above even without a complete ab-initio folding model, it is possible to predict for some sequences what is their structure or determine to what protein/domain family the sequence belongs from comparative genomics analysis. This by itself might not be very informative from a cellular perspective. We need to know how cellular components interact and hwo these interconnected components create useful functions in a cell.

Docking
Trying to understand and predict how two proteins interact in a complex has been the challenge of structural computational biology for more than two decades . The initial attempt to understand protein-interaction from computational analysis of structural data (what is known today as docking) was published by Wodak and Janin in 1978. In this seminal study, the authors established a computational procedure to reconstitute a protein complex from simplified models of the two interacting proteins. In the twenty-years that have followed the complexity and accuracy of docking methods has steadily increased but still faces difficult hurdles (see reviews Bonvin et al. 2006, Gray, 2006). Docking methods start from the knowledge that two proteins interact and aim at predicting the most likely binding interfaces and conformation of these proteins in a 3D model of the complex. Ultimately, docking approaches might one day also predict new interactions for a protein by exhaustively docking all other proteins in the proteome of the species, but at the moment this is still not feasible.

Interaction types
It should still be possible to use the 3D structures of protein complexes to understand at least particular interactions types. In a recent study, Russel and Aloy have shown that it is possible to transfer structural information on protein-protein interactions by homology to other proteins with identical sequences (Aloy and Russell 2002). In this approach the homologous proteins are aligned to the sequences of the proteins in the 3D complex structure. Mutations in the homologous sequences are evaluated with an empirical potential to determine the likelihood of binding. A similar approach was described soon after by Lu and colleagues and both have been applied on large scale genomic studies (Aloy and Russell 2003 ; Lu et al. 2003). As any other functional annotation by homology this method is limited by how much the target proteins have diverged from the templates. Alloy and Rusell estimated that interaction modeling is reliable above 30% sequence identity (Aloy et al. 2003). Substitutions can also be evaluated with more sophisticated energy potentials after an homology model of the interface under study is created. Examples of tools that can be used to evaluate the impact of mutations on binding propensity include Rosetta and FoldX.
Althougt the methods described above were mostly developed for domain-domain protein interactions similar aproaches have been developed for protein-peptide interactions (see for example McLaughlin et al. 2006) and protein-DNA interactions (see for example Kaplan et al. 2005) .

In summary the accumulation of protein-protein and protein-DNA interaction information along with structures of complexes and the ever increase coverage of sequence space allow us to develop models that describe binding for some domain families. In a future blog post I will try to review the different domain families that are well covered by these binding models.

Previous mini-reviews
Protein sequence evolution

Thursday, June 12, 2008

@World

(caution, fiction ahead)


I wake up in the middle of the night startled by some noise. Pulse racing I try to focus my attention outwards. Something breaking, glass shattering? Is someone out there ? I reach out with my senses but an awkward feeling nags at me, bubbling up to my consciousness. I try hard to focus, it is coming from outside the room , someone is inside my house. I close my eyes but vertigo takes over and weightlessness empowers me. I am in the living room cleaning the floor, picking up a broken glass. The nagging feeling finally assaults me fully. I am moving but I am not in control. Panic rises quickly as I watch helpless the simple and quiet actions of someone else. I stop picking up glass and I feel curious, only it is not exactly me, the feeling is there besides me.
- Hi, who are you ?
The voice catches me by surprise and my fear goes beyond rational control. All I can think of is to escape. to go away from here. For a second time I find myself floating as if searching for a way out. When I open my eyes again I am by the beach and I breath a sigh of relief. The constant sound of the waves calms me down for a few seconds until my eyes start drifting to the side. No, stay there I am in control! I look into the eyes of a total stranger that smiles back at me in recognition. Two voices ask me if I am enjoying the view and I can only scream back in confusion.

I wake up in the middle of the night startled by some noise. I immediately flex my hands in front of my eyes to make sure it was nothing but a nightmare trying hard to calm down. What a dream. I get up and check on the noise coming from the living room realizing that it was just the storm outside. Feeling better I fire up my laptop and grab a glass of water from the kitchen. I open twitter and type away:
- I had the strangest dream !(cursor blinking) Our senses were all connected(enter)
I get up to open the window drinking another sip of water. After a couple of steps I feel a jabbing headache forcing me to stop and bright spots of light blur my vision. I close my eyes in pain and the voices of some unseen crowd thunder in my ears:
- I had the same dream - the all say in unison
The sound of glass shattering on the floor in the last thing I remember before collapsing.

I wake up in the middle of the night startled by some noise (...)

(Twistori was the main motivation for this post)

Previous fiction:
The Fortune Cookie Genome

Tuesday, June 10, 2008

Why does FriendFeed work ?

I have been using FriendFeed for a while and I have to say that it works surprisingly well. It is hard to define what FriendFeed is so the only real way of understanding it is to try it for a while.

One common way to define FF would be as a life-stream aggregator. Each user defines a set of feeds (blog, Flickr, Twitter, bookmarks, comments, etc) providing all other users with a single view of all the online activities of that user. Anyone can select how much to share (even nothing at all) and subscribe to a number of other users. Each item (photo, blog post, bookmark) can serve then as spark for discussions. The users can mark items as interesting or comment on them and this propagates to all other people that subscribe to you. In addition we can select sources to hide if for some reason there is a particular part of a user's activities you don't enjoy. All of this creates a very personalized view of whoever you elect to interact with online.

I still find it striking that there are so many long threads of discussions around items that we share in FriendFeed, sometimes more than in the original site. A couple of examples:
Google code as a science repository (discussion in FF, blog post)
Into the Wonderful (discussion in FF, slideshare site)
Bursty work (discussion in FF, blog post)

Why does it work so well ? One possible reason could be that a group of early adopter scientists happened to get together around this website creating the required critical mass to start the discussions. Still, most of those commenting were already participating on blogs so that might not be it. There might be something about the interface, maybe it is the ease of adding comments and that these comments can be edited that increases the participation. Ongoing discussions get bumped higher in the view so every new comment brings the item back to your attention. In this way you know who saw the item and who is thinking about it. A bit like talking about a movie you saw or a book you read with a bunch of friends.

Anyone interested in the science aspects of it should check out the Life Scientists room with currently around 85 subscribers. Here is an introduction to some of these people, in particular on what they work on. Connecting to other scientists in this way lets you see what are the articles they find interesting and discuss current scientific news. Even maybe start a couple of side-projects for the fun of it.

Monday, June 09, 2008

Evaluation metrics and Pubmed Faceoff

I have been reading recently a lot about evaluation metrics for papers and authors. It started with a blog post in Action Potential (Nature Neuroscience's blog) showing a correlation between the number of downloads of a paper and its citations. From the comments in that blog post I found out about a forum in Nature Network about Citation in Science and also the recently published group of perspectives on "The use and misuse of bibliometric indices in evaluating scholarly performance".

It could have been a coincidence but Pierre sparked a long discussion in FriendFeed when he suggested it would be nice to be able to sort Pubmed queries by the imapact factor of the journal. In reaction to this Euan set up a very creative interface to Pubmed that he named Pubmed Faceoff. He took several different factors into account (citations from Scopus, eigenfactor of the journal, the time the paper was published) and for each paper returned from a Pubmed query creates a face that describes the paper. The idea for the visualization is based on Chernoff Faces. It is really a creative idea and I wish Pubmed could spend more resources in coming up with alternative interfaces like this, something like a "labs" section where they could play with ideas or allow others to create interfaces that they would host.

I wont go here into the whole debate about the evaluation metrics since there is already a lot of discussion going on in some of those links I mentioned.

Wednesday, May 14, 2008

Prediction of phospho-proteins from sequence

I want to be able to predict what proteins in a proteome are more likely to be regulated by phosphorylation and hopefully use mostly sequence information. This post is a quick note to show what I have tried and maybe get some feedback from people that might have tried this before.

The most straightforward way to predict the phospho-proteins is to use existing phospho-site predictors in some way. I have used the GPS 2.0 predictor on the S. cerevisiea proteome with medium cutoff and including only Serine/Threonine kinases. The fraction of tyrosine phosphosites in S. cerevisiae is very low so I decided to for now not try to predict tyrosine phosphorylation.

This produces a ranked list of 4E6 putative phosphosites for the roughly 6000 proteins scored according to the predictor (each site is scored for multiple kinases). My question is how to best make use of these predictions if I mostly want to know what proteins are phosphorylated and not the exact sites. Using a set of known phosphorylated proteins in S. cerevisiae (mostly taken from expasy) I computed different final scores as a function of the of all phospho-site scores:
1) the sum
2) the highest value
3) the average
4) the sum of putative scores if they were above a threshold (4,6,10)
5) the sum of putative phosphosite scores if they were outside ordered protein segments as defined by a secondary structure predictor and above a score threshold

The results are summarized with the area under the ROC curve (known phosphoproteins were considered positives and all other negatives) :


In summary, the sum of all phospho-site scores is the best way that I found so far to predict what proteins are phospho-regulated. My interpretation is that phospho-regulated proteins tend to be multi-phosphorylated and/or regulated by multiple kinases so the maximum site score will not work as well as the sum. As a side note, although there are abundance biases in mass-spec data (the source of most of the phospho-data) protein abundance is a very poor predictor of phospho-regulation (AROC=0.55).

Disregarding putative sites outside predicted secondary structured protein segments did not improve the predictions as I would expect but I should try a few disorder predictors.

Ideas for improvements are welcomed, in particular sequence based methods. I would also like to avoid comparative genomics for now.