Thursday, June 07, 2007

Tangled Bank #81 is know available

I participated with a submission to the latest edition of Tangled Bank (the first science carnival blog journal around) that is available at the Behavioral Ecology Blog. Thanks to RPM at Evolgen for "peer-reviewing" my post on protein evolution :).
Nature Precedings, a pre-print server for biomedical research

It was hard to hold off from blogging about this but I can finally write about Nature Precedings, a new free service provided by the Nature Publishing Group. The official announcement is in this editorial:
"... this site will enable researchers to share, discuss and cite their early findings. It provides a lightly moderated and relatively informal channel for scientists to disseminate information, especially recent experimental results and emerging conclusions."
"...the site will host a wide range of research documents, including preprints, unpublished manuscripts, white papers, technical papers, supplementary findings, posters and presentations."


I have been participating in the beta for some months now and as it is mentioned in the editorial it will be openly available starting next week. All documents are citable (have DOIs), are not peer-reviewed (in the formal sense) and are archived under a creative commons license (derivatives allowed). The site has the community features (tagging/commenting/rating/RSS feeds) that you would expect and that will hopefully allow for requesting and providing comments on early findings. In summary an nicer version of ArXive for biomedical research.

I think this is great news that serves on one hand to improve access to research (open access by pre-print archiving) and increase the openness of research. This can provide a place for independent time-stamping of early findings and could be improved (hopefully with community feedback) until it is appropriate for formal submission to a peer-reviewed journal.

A framework for open science (in biology) can now go from blogs/wikis to pre-print server to peer-reviewed journals. Many ideas might die along the way and many collaborations might form by connecting early findings in an unexpected way.

Of course if you are in maths/physics you have arXive and you are probably wondering what is taking us biomedical researchers so long to get into this.

Friday, June 01, 2007

Bio::Blogs# 11

The 11th edition of Bio::Blogs, is online at Nodalpoint. We tried to do something different this time. Michael Barton volunteered to host a special section dedicated to tips and tricks for bioinformatics that is hosted separately in Bioinformatics Zen. Because there were so many posts this month about personalized medicine there is also a special section on that.

There are three separate PDFs for this edition: 1) the main PDF can be found here; 2) The one on personalized medicine can be downloaded here; the one for tips and tricks available from Bioinformatics Zen. Michael did a great job with this special section, with a very cool design.

Wednesday, May 30, 2007

Presenting Blog Citations

Recently Postgenomic hit the 10k mark. Ten thousand citations to papers and books have been tracked in science related blogs. In the post announcing the milestone, Euan asked if blog buzz could be an indication of impact of a paper. Can science bloggers help to highlight potentially interesting research ?

I decided to have a look at this and asked him to send in a list of papers published in 2003-2004 and mentioned in blog posts. For these I took from ISI Web of Science the number of citations in papers tracked by ISI (all years). There are 519 papers published in 170 journals in the period of 2003-2004 that were mentioned in blogs tracked by Postgenomic. Of these, 79 papers could not be found in ISI. Many of the papers not found in ISI were published in arXiv. These 79 were no longer considered for further analysis.

Top cited journals in blog posts

I ranked the journals according to the incoming blog citations. The top 5 are highlighted below, and apart from arXiv, that is not usually tracked as a journal (maybe it should), the other 4 are all known journals publishing in general science/biology. Comparing to impact factors there is a noted absence of review and medical journals. This measure of blog citations (instead of blog citations per article) will penalize low volume journals like the Annual Review series. Regarding the low blog impact of medical journals, maybe the current journal ranking by blog citations reflects a higher proportion of biology and physics blogs currently tracked by postgenomic.


Relation between blog citations and average literature citations

The fact the bloggers tend to cite research published in high-impact journals could be just due to the higher visibility of these journals. To test this, I analyzed the average citation per article from papers published in 2003-2004 in any journal with more than 1,2 and 3 blog citations (see table below). I compared it to papers published in Science and Nature in the same period. It is possible to conclude that: 1) papers mentioned in blogs have a higher average citation than those published in these high impact journals: 2) papers with increasing blog citations have on average a higher number of literature citations.

Journal Papers in 2003-2004 Citations Average citation per paper
Science 5306 148912 28.06
Nature 5193 145478 28.01
>0 blog citations 440 21306 48.42
>1 blog citations 71 3679 51.81
>2 blog citations 24 1835 76.45
>3 blog citations 15 1557 103.8

I did not remove non-citable items (editorials, news and view, letters, etc) from the analysis. It would hard to come up with criteria for removing these from both the journals and from the papers tracked by postgenomic. In any case, I suspect that bloggers tend to blog a lot about of non-citable items because these are usually more engaging for discussions than research papers. Therefore if anything I suspect that the real measure of impact for blog cited items should be even higher.

Our global distributed journal club

In recent years science publishers have worked to adjust to publishing online. Most of them now offer RSS feeds for their content and some timidly started allowing readers to comment on their sites. With the exception of BioMed Central none of the publishers make of point of prominently showing these comments, making it harder to find out about interesting ongoing discussions. This has not stopped researchers from participating on what can be called a global distributed journal club. As Euan and others have nicely noted, scientists are using blogs to discuss research. It is a very diffuse discussion but it can be aggregated in way that it could never be possible if we kept to ourselves, in the usual conferences or in our institutes/universities.

I tried to show here that this aggregated discussion conveys information regarding the potential impact of published research. This is only the tip of the iceberg of the potential benefits of aggregating and analyzing science blogs. For example, it should be possible to look for related papers from the linking patterns of science bloggers; the dynamics of communication between different science disciplines; the trends in technology development, etc.

Some publishers might be thinking of ways to reproduce these discussions in their sites. One alternative would be for science publishers to get together in the development of the aggregation technology. There should be an independent site gathering all the ongoing comments from blog posts and from the publishers' websites. This could then be used by anyone interested in the information. It could be shown next to a pubmed abstract or directly in the publishers website. Right now this would likely be the single biggest incentive to online science discussions that science publishers could do.
Next stop: San Francisco

I finally know for sure where I will be going for my first posdoc. I will be taking a joint posdoc position with Wendell Lim and Nevan Krogan at UCSF - Mission Bay. I will be moving to San Francisco around the end of the year or beginning of next year.

Tuesday, May 29, 2007

Reminder for Bio::Blogs#11

I will start collecting posts for the 11th edition of Bio::Blogs (monthly bioinformatics blog journal) to be hosted at Nodalpoint on the 1st of June. Anyone can participate by sending in submissions to bioblogs at gmail. This month there is going to be a special section dedicated to tips for computational biologists, that will be hosted at Bioinformatics Zen. Something like a separate insight issue :). To participate in the special section email Mike (see post) with your tips or write a post and submit the link to him. It can even be just a couple of sentences. Just think of things that you consider to be important for people working in computational biology and send it in.

Friday, May 25, 2007

The Human Microbiome Project approved

(via Jacques Ravel's blog) It looks like NIH approved a pilot study to have a look at human microbial populations. From the NIH roadmap :

On May 18, 2007, the IC Directors met to review and prioritize specific proposals developed by Working Groups of trans-NIH staff, led by IC Directors. Four topics were chosen to move forward as Major Roadmap Initiatives. Two of these, the Microbiome and Epigenetics Programs, were approved for immediate implementation as five year programs. (...)

* Microbiome – The goal of the proposed Human Microbiome Project is to characterize the microbial content of sites in the human body and examine whether changes in the microbiome can be related to disease.


Related posts from other blogs:
A human microbiome program? (Jonathan A. Eisen)
More on the Human Microbiome Program Workshop - Day1 (Jonathan A. Eisen)
A Human Microbiome Project? (MSB blog)

Thursday, May 24, 2007

Nature vs. Nurture in personalized medicine

Personalized medicine aims to determine the best therapy for an individual based on personal characteristics. Given that the family history is a risk factor for many diseases there is a strong motivation for the search of inheritable genetic variation that might provide molecular explanations for diseases. In the last couple years, improvements in sequencing technology have helped to scale up these efforts. The HapMap project is an example of these attempts at genome wide characterization of human genetic variation. The project aims to create a haplotype map of the human genome. This map is important because correlating a disease with a haplotype can be used to pin-point the cause of a disease to a genome region. This map based approach is done by first sequencing known sites of polymorphisms, spaced across the genome, in a large population and then associating disease with haplotypes (see a recent example).

Eventually sequencing costs will go down to a point when these map based approaches are replaced by full genome re-sequencing. It looks like there is a consensus that this is just a matter of time. Also, the main sequencing centers seem to be directing more of their efforts to studying variation. If sequencing full genomes is currently too expensive, sequencing coding regions is much more affordable. In two recent papers (Greenman et al. and Sjoblom et al.) researchers have tried to identify somatic mutations in human cancer genomes by sequencing. Greenman and colleagues focused on 518 kinases and searched for mutations in these genes in 210 different human cancers (see post by Keith Robison). Sjoblom and colleagues on the other hand sequenced fewer cancer types (11 breast and 11 colorectal cancers) but did so for 13023 genes. The challenge going forward is to understand what is the impact of these mutations on cellular function.
Instead of sequencing to find new polymorphism is also possible to test the association of previously identified variation with disease by high-throughput profiling. Two recent papers focused on profiling known polymorphisms in cancer tissues using either microarrays or PCR plus mass spec.

Underlying all of these efforts is the idea of genetic determinism. That if I sequence my genome I should know how each variation impacts on my health and what treatment I should use to correct it. It begs the question however of much does it really depend on inherited genetic variation ? The often re-visited Nature vs. Nurture debate. The latests MSB paper highlights the impact of the environment on mammalian metabolic functions. Fracois-Pierre J Martin and colleagues have studied how the microbial gut population affects the mouse metabolism. They have used NMR metabolic profiling in conventional mice, and germ free mice colonized by human baby flora to study this question.

Metabolic analysis of liver, plasma, urine and ileal of both types of mice showed a significant change in metabolites in the different compartments associated with the two microbial populations. This is a very clear example of how the environment must be taken into consideration for future efforts of personalized medical care.

This example also underscores the importance of studying the human microbial associations. As Jonathan Eisen discussed in his blog, maybe we should aim at a human microbiome program.

Nature or Nurture ? In either case, abundant streams of data are forthcoming as the sequencing centers crunch away and new omics tools get directed at studying disease. There will be a lot of work to do in order to understand causal relationships and suggest therapeutic strategies. That might be why Google is taking a look at this. They keep saying they want to organize the worlds information, why not health related data.


The picture was taking from News and View by Ian Wilson:
Top-down versus bottom-up—rediscovering physiology via systems biology? Molecular Systems Biology 3:113

Tuesday, May 15, 2007

Protein evolution

What constrains and determines the rate of protein evolution ? This topic has received a great deal of attention in bioinformatics. Many reports have found significant correlations between protein evolutionary rate and expression levels, codon adaptation index (CAI), protein interactions (see below), protein length, protein dispensability and centrality in protein interactions networks. To complicate matters still, there are known cross correlations between some of the factors. For example it has been observed that the number of protein interactions correlates with protein length (weakly) and the probability that a protein is essential to the cell.

This highlights the importance of thinking about the amount of variance explained by the correlation and controlling for possible cross correlations. In fact it has been shown that, when controlling for gene expression, some of other factors have a weaker correlation (or none at all) with the rate of protein evolution (Csaba Pál et al 2003). Using principal component regression, Drummond and colleagues have shown that a single component dominated by expression, CAI and protein abundance accounted for 43% of the variance of the non-synonymous mutation rate (dN). The other known factors account only for a few percentage of the observed variance in dN.

Two questions might come to mind when thinking about these observations. One is why would expression values, CAI and protein abundance constrain protein evolution. The other is why the number of protein interactions explain so little (or non at all) of the variance in protein evolutionary rates. Intuitively, the number of protein interactions is related to the functional density of a protein and proteins with hight functional density should have a lower dN.

Drummond and colleagues proposed in a PNAS paper an explanation for the first question. They first list three possible reasons for why expression levels should have such a strong effect on protein evolution: functional loss, translational efficiency and translational robustness. Functional loss, postulated by Rocha and Danchin hypothesizes that highly expressed proteins have lower dN because they are under strong selection to minimize the impact of miss-translation that would create a large pool of inefficient proteins and reduce the fitness of the cells. A second hypothesis proposed by Akashi links protein evolutionary rates with gene expression through efficiency of transcription. Highly expressed proteins have optimal codon usage for efficient translation and therefore a lower dN and dS. Drummond and colleagues added a third hypothesis that they called translational robustness. Given the costs of miss-folding and agregation, the higher the number of errors in translation that might lead to miss-folding and agregation the higher the cost for the cell. Therefore there might by a strong selection for keeping highly expressed genes robust against miss-translation.

The difference between translational robustness and functional loss is that the first implies that the number of events of translation are the important factor while the second puts emphasis on the protein concentration. Using protein abundance and mRNA expression the authors showed that translational robustness seams to be the most important factor determining the rate of protein evolution.

In fact, in a recent paper (Tartaglia et al, 2007) a correlation between in vitro aggregation rates and in vivo expression levels was discovered. Highly expressed proteins tend to have a lower agregation rate measured in vitro (r=97, N=12). The number of proteins analyzed was small and the rates of agregation were obtained not always in the same conditions but it does fit with the translational robustness hypothesis.

Even if the number of translational events is such a strong constrain, one would expect that when accounting for this, one would still see an effect of functional density on protein evolution. Yet, the correlation between a proxy for functional density - number of protein interactions - and dN has been under strong debate. (yes there is, no there isn't, yes, no , yes, maybe, ...)

The answer to this dispute might in the end be that the number of protein interactions is not a good proxy for functional density. A protein might have many protein interactions using a single interface. This is why the work of Kim and colleagues from Gerstein lab is important. Using structural information they predicted the most likely interface for protein interactions in S. cerevisiae. They could then show that protein evolutionary rate correlates better with adjusted interface surface area than with number of protein interactions. Also, the relationship of evolutionary rate with protein evolution appears to be independent of protein expression level.

The overall picture so far seems to be that translational robustness is the main driving force shaping protein evolutionary rates. Functional constrains are also important but are much more localized explaining a smaller fraction of the overall variance of the whole proteins.

Where can we go further ? As I mentioned above, translational robustness predicts that expression levels should correlate with overall stability, designability (number of sequences that fit the structure) and avoidance of aggregation prone sequences. Bloom and colleagues have shown that density of inter-residue contacts(a proxy for designability) does not correlate with expression but the study was limited to roughly 200 proteins so this might no be the final answer.

So, a clear hypothesis is that a computational measure that would sum a proteins' stability, tendency for agregation and designability should correlate with gene expression levels.

Further reading:
An integrated view of protein evolution (Nature Reviews Genetics)

Friday, May 11, 2007

Science Foo Camp 2007 and other links

Nature is organizing another Science Foo Camp. There are already a couple of bloggers that have been invited (Jean-Claude Bradley, Pierre, PZ Myers, Peter MR, Andrew Walkingshaw). There is a "Going to Camp" group in Nature Network, and the scifoo tag in connotea to explore if you want to dig deeper.

I was there last year and I can only thank again Timo for inviting me and encourage everyone that has been invited to go. It was a chance to get to know fascinating people and hear about new ideas. In the off chance of any of the organizers is reading this ... please try to get together people from Freebase (or similar company) with the people involved in biological standards (like Nicolas Le Novère).

A quick hello to two new bioinformatic related blogs: Beta Science by Morgan Langille and Suicyte Notes.

(via Pierre, Neil and Nautilus) In a correspondence letter published by Nature, Mark Gerstein, Michael Seringhaus and Stanley Fields discuss the implementation of structured, machine readable abstracts. As I mentioned in a comment to Neil's post, this is one of those ideas that have been around, that most people would agree to but somehow it is never implemented. In this case it would have to start on the publisher's side. As we have seen with other small technical implementations, like RSS feeds, once a main publisher sets this up others will follow.

Monday, May 07, 2007

Introducing the Systems Biology department at CRG

I am spending two weeks in Barcelona to help out with a referee report. I can't really say yet what it is about but if everything goes well, maybe I will in a couple of months (hint: evolvability). What I can do is introduce the environment. I am in the 5th floor of the Barcelona Biomedical Research Park. The building is located in front of the sea and it hosts several different institutes. I am staying at the Center for Genomic Regulation (CRG) where my supervisor Luis Serrano is heading the program for Systems Biology. The program is a partnership between CRG and EMBL and it currently is home for four groups (Luis Serrano, James Sharpe, Mark Isalan and Ben Lehner).


The department has a lot of research in development and evolutionary systems biology. I have only been here a week but the environment is great and the beach in the background is a killer plus. Have a look around the webpage for the other programs.

Friday, May 04, 2007

Its official ;), scientists enter the blogoshere

Laura Bonetta wrote an analysis piece in Cell about scientists entering the blogosphere. Laura Bonetta (could not find her blog :) does a god job of introducing science blogging in a short and easy to read assay. There is a bit of everything: science education, discussions, carnivals and open science. The only thing that is sorely lacking is a mention of Postgenomic and maybe the publishers blogs.

Wednesday, May 02, 2007

Bio::Blogs #10

The 10th edition of Bio::Blogs, the bioinformatics blog journal has been posted at Nodalpoint. The PDF can be downloaded from Box.net.

Sunday, April 29, 2007

Bio::Blogs #10 - reminder

The May 1st edition of Bio::Blogs will be hosted at Nodalpoint. Anyone can participate by sending in links for interesting blog posts from April (to bioblogs at gmail). If you send in links to your own blog posts please also say if you agree or not to have the post copied to the PDF version for offline reading.

Friday, April 27, 2007

Bio-science hype cycle

I found out about the Gartner's Hype Cycle today a story at Postgenomic (via Science Notes and HealthNex)

The Gartner's Hype Cycle is meant to highlight the relative maturity of different IT technologies. The idea originated from the common pattern of evolution of human perception towards nascent technologies. From their initial trigger passing trough exaggerated expectations, disillusionment and finally to a maturity and stability.

Just for the fun of it I tried to plot the same graph with some bio-science related technologies and/or ideas:


This is a very limited and biased view but for me it was interesting trying to think were to place the different technologies.

Thursday, April 26, 2007

The publisher's reaction

Sarah Cooney, Director of Publications for the Society of Chemical Industry as issued an official reply to the gathering criticism:

"We apologise for any misunderstanding. In this situation the publisher would typically grant permission on request in order to ensure that figures and extracts are properly credited. We do not think there is any need to pursue this matter further."

The email was posted in Shelley Batts blog and also in this blog post at Nature Network. Overall it is good news, this was an honest mistake and not a policy from the journal nor the publisher. There is a hint in the official reply to some potentially abusive emails sent to the editor. Maybe, as Euan Adie suggested this is also a lesson for science bloggers to take care of rising mob mentality when handling these issues. I would guess that many editors might not even be aware of their own copyright/fair use policies and this issue might at least raise the discussion.
How can a publisher be so dumb - Update

I posted a while ago about copyright policies of different science publishers regarding images. I concluded by saying that in any case we should be safe to blog images since no publisher would likely sue a blogger for using an image or two to promote one of their papers. Well ... apparently I was wrong. Shelley Batts from Retrospectable got an email from an editor of the Journal of the Science of Food and Agriculture (published by Wiley Interscience):

"The above article contains copyrighted material in the form of a table and graphs taken from a recently published paper in the Journal of the Science of Food and Agriculture. If these figures are not removed immediately, lawyers from John Wiley & Sons will contact you with further action."

This is not a legal action but the threat is there. I cannot see what they were thinking. Are they really willing to sue a blogger for what is very likely fair use of their content? The content used is a small fraction of the whole, the blog post is educational and most likely has increased the traffic to that paper. If anything this email just bought them a lot of indignation and it will be a PR nightmare for the journal and the publisher.

Science bloggers are doing a great service of covering science news, faster and more in depth than most traditional news services. Every time I have a look these days at the first page of Postgenomic I see there what is going to be the main science stories of the next day in the normal news outlets. Not only that but I will likely find someone that actually works on the subject and can give a very good explanation of what the work is about. Publishers should be fostering this by crafting policies directed at this use of their material not the other way around.

If you want to contact the editor that made the decision the email is on Shelley's post.

There is a large number of posts reacting to this in Postgenomic.

Update - Boing Boing is giving coverage to this too. If you also think that this was a bad decision from the journal editor/publisher consider writing about it or sending them an email. Even if it is within their legal right to do so we can at least tell them that we don't find this appropriate or fair.

Tuesday, April 24, 2007

Cellular adaptation to unforeseeable events

How do cells react to changes in external conditions ? It has been noted before than in many cases the immediate transcriptional response includes unspecific changes in gene expression for a large group of genes (Gash et al, 2000). Fong and colleagues have shown that in E. coli, 20 to 40 days after the initial changes, most of the genes return to expression levels prior to the modifications of the environment. The differentially expressed genes at this stage are situation specific but not necessarily always the same. In this same paper, the gene expression changes were followed for different independent populations evolving under the same changes in conditions. Out of ~1100 gene expression changes (on average) that were possibly adaptive to the new conditions, only 70 were common to all 7 parallel populations.

A new studied published in MSB, adds more information to these interesting findings. In this study the authors tried to challenge S. cerevisiae with a perturbation that these cells should not have seen during their evolutionary history. They used a his3 deletion strain with a plasmid having HIS3 under the GAL1 promoter. In these cells the essential HIS3 gene should be efficiently turned off in a glucose medium. They then tracked the gene expression changes over time when the medium was changed from galactose to glucose. The cells adapted to these conditions within around 10-20 generations. Again the initial gene expression changes involved a large number of genes (~1000-1600 genes> 2 fold change) with most of them (65%-70% ) returning to their original expression levels in 10-20 generations. Again, different populations had different genes differentially expressed in response to the transition from gal to glu.

There is a detailed analysis in the paper regarding the functional classes of the genes but for me these general trends were by themselves very interesting. How does the cell cope with unforeseeable events ?

Maybe there is a general mechanism that senses discrepancies between metabolic requirements and the current cellular state and, in the absence of a programed response, drives an almost chaotic search for plausible solutions ? If there is such a sensing mechanism it could provide the necessary feedback for the selection of cellular states at a physiological time scale. In a environment were frequent unpredictable changes occur such a system could possibly be selected for.

For further reading have a look at the news and views by Eugene Koonin

Tuesday, April 17, 2007

The Seven Stones blog and more quick links

The Nature family of blogs as a new member - The Seven Stones - from the Molecular Systems Biology journal. I gave some help to set it up during my 3 months stay with the journal. Go over there and say hello to the editors :).

(via Deepak) The TED.com site was relaunched. It is has one of the most amazing collection of video talks available. The current main focus is The Rise of Collaboration.

(via Konrad and Richard Akerman) There was an interesting conference organized by Alen press - Emerging Trends in Scholarly Publishing. Both Konrad and Richard Akerman describe in their blogs what the conference was about and what they talked about.

Wednesday, April 11, 2007

Sharing notes of science papers not just on PLoS ONE

Niel just posted some examples of webtools to take notes of websites. I"ll pick up on the theme to show you another example - Diigo. Diigo is a collaborative note sharing webapp and it allows us to annotate (highlight and add notes) to a website and share this with anyone also using Diigo. I will go trough an example. I registered to the service and dragged a bookmarklet to the browser.

Then I went to have a look at this paper, that showed up in the Systems Biology Connotea feed, and clicked the bookmarklet:

The paper discusses the usefulness of machine learning methods to study dynamics of cellular pathways so I just added a link to the wikipedia entry on fuzzy logic.

Now anyone using Diigo can browse the article and see the notes I added (I think you can select to make them public or private). I do most of the reading on paper so for me this would be useful only to dissect papers with someone online. In a sense this is what PLoS ONE does but extended to any website. For example at the same time we blog about a paper, we could add the link to the blog post on the paper itself or add some of the comments directly on the paper if it makes more sense. You can also use Diigo to gather a clip to add to a blog post and create groups to share bookmarks and annotations.

The biggest drawback is that I don't know if there are any Diigo comments or annotations for this site without clicking the bookmarklet (I did not try the toolbar they have for instalation) . Even after clicking the bookmarklet there is no way of knowing how many (if any) notes exist on the site without scrolling to look for them.

As for most social applications, this becomes more useful if more people start using it. On the other hand if this (or a similar service) grows too much it will be to attractive to spammers. There are more examples of this type of tools on todays post on Techcrunch were I found out about Diigo. With so many ways to add annotations to a webpage it would be nice to have some kind of abstract way of communicating annotations between web applications. Something like trackbacks but able to convey more information, not just this content points to that content.