Wednesday, May 09, 2012

The Minimal Publishable Unit

What constitutes a minimal publishable unit in scientific publishing ? The transition to online publishing and the proliferation of journals is creating a setting where anything can be published. Every week spam emails almost beg us to submit our next research to some journal. Yes, I am looking at you Bentham and Hindawi. At the same time, the idea of a post-publication peer review system also promotes an increase in number of publications. With the success of PLoS ONE and its many clones we are in for another large increase in the rate of scientific publishing. Publish-then-sort as they say.

With all these outlets for publication and the pressure to build up your CV it is normal that researchers try to slice their work into as many publishable units as possible. One very common trend in high-throughout research is to see two to three publications that relate to the same work: the main paper for the dataset and biological findings and 1 or 2 off-shoots that might include a database paper and/or a data analysis methods paper. Besides these quasi duplicated papers there are the real small bites, specially in bioinformatics research. You know, those that you read and you think to yourself that it must have taken no more than a few days to get it done. So what is an acceptable publishable unit ?

I mapped phosphorylation sites to modbase models of S. cerevisiae proteins and just sent this tweet with a small fact about protein phosphosites and surface accessibility:
Should I add that tweet to my CV ? This relationship is expected and probably already published with a smaller dataset but I could bet that it would not take much more to get a paper published. What is stopping us from adding trivial papers to the flood of publications ? I don't have an actual answer to these questions. There are many interesting and insightful "small-bite" research papers that start from a very creative question that can be quickly addressed.  It is also obvious that the amount of time/work spent on a problem is not proportional to the interest and merit of a piece of research. At the same time, it is very clear that the incentives in academia and publishing are currently aligned to increase the rate of publication. This increase is only a problem if we can't cope with it so maybe instead of fighting against these aligned incentives we should be investing heavily in filtering tools.


Wednesday, March 28, 2012

Individual genomics of yeast

Nature Genetics used to be one of my favorite science journals. It consistently had papers that I found exciting. That changed about 5 years ago or so when they had a very clear editorial shift into genome-wide association studies (GWAS). Don't take me wrong, I think GWAS are important and useful but I don't find it very exciting to have lists of regions of DNA that might be associated with a phenotype. I want to understand how variation at the level of DNA gets propagated through structures and interaction networks to cause these differences in phenotype. I mostly stayed out of GWAS since I was focusing on the evolution of post-translational networks using proteomics data but I always felt that this line research was not making full use of what we know already about how a cell works.

In this context, I want to tell you about a paper that came out from Ben Lehner's lab that finally made me excited about individual variation and why I think it is such a great study. I was playing around with the similar idea when the paper came out so I will start with the (very) preliminary work I did and continue with their paper. I hope it can serve as small validation of their approach.

As I just mentioned, I think we can make use of what we know about cell biology to interpret the consequence of genetic variation. Instead of using association studies to map DNA regions that might be linked to a phenotype, we can take a full genome and try to guess what could be deleterious changes and their consequences. It is clear that full genome sequences for individuals are going to be the norm so how do we start to interpret the genetic variations that we see ? For human genetic variation, this is a highly complex and challenging task.

Understanding the consequences of human genetic variation from the DNA to phenotype requires knowledge of how variation will impact on proteins's stability, expression and kinetics; how this in turn changes interaction networks; how this variation is reflected in each tissue function; and ultimately to a fitness difference, disease phenotype or response to drugs. Ultimately we would like to be able to do this but we can start with something simpler. We can take unicellular species (like yeast) and start by understanding cellular phenotypes before we move to more complex species.

To start we need full genome sequences for many different individuals of the same species. For S. cerevisiae we have genome sequences for 38 different isolates by Liti et al. We then need phenotypic differences across these different individuals. For S. cerevisiae there was a great study published June last year by Warringer and colleagues were they tested the growth rate of these isolates under ~200 conditions.  Having these data together we can attempt to predict how the observed mutations might result in the differences in growth. As a first attempt we can look at the non-synonymous coding mutations. For these 38 isolates there are something like 350 thousand non-synonymous coding mutations. We can predict the impact of these mutations on a protein either by analyzing sequence alignments or using structures and statistical potentials. There are advantages and disadvantages to both of the approaches but I think they end up being complementary. The sequence analysis required large alignments while the structural methods require a decent structural model of the protein. I think we will need a mix of both to achieve a good coverage of the proteome.

I started with the sequence approach as it was faster. I aligned 2329 S. cerevisiae proteins with more than 15 orthologs in other fungal species and used MAPP from the Sidow lab at Stanford to calculate how constrained each position is. I got about 50K non-synonymous mutations scored with MAPP of which about 1 to 8 thousand could be called potentially deleterious depending on the cut-off. To these we can add mutations that introduce STOP codons, in particular if they occur early in the protein (~710 of these within the first 50 AAs of proteins).

So up to here we have a way to predict if a mutation is likely to impact negatively on a protein's function and/or stability. How do we go from here to a phenotype like a decrease growth rate under the presence of stress X ? This is exactly the question that chemical-genetic studies try to address. Many labs, including our own,  have used knock-out collections (of lab strains) to measure chemical-genetic interactions that give you a quantitative relative importance of each protein in a given condition. So, we can make the *huge* simplification that we can take all deleterious mutations and just sum up the effects assuming a linear combination of the effects of the knock-outs.

To test this idea I picked 4 conditions (out of the 200 from mentioned above) for which we have chemical-genetic information (from Parsons et al. ) and where there is a high growth rate variation across the 38 strains. With everything together I can test how well we can predict the the measured growth rates under these conditions (relative to a lab strain):
Each entry in the plot represents 1 strain in a given condition. Higher values report worse predicted/experimental growth (relative to a lab strain). There is a highly significant correlation between measured and predicted growth defects (~0.57) overall but cisplain growth differences are not well predicted by these data. Given the many simplifications and poor coverage of some of the methods used I was even surprised to see the correlation at all. This tells us, that at least for some conditions, we can use mutations found in coding regions and appropriately selected gene sets to predict growth differences.

This is exactly the message of the Rob Jelier's paper from Ben Lehner's lab. When they started their work, the phenotypic dataset from Warringer and colleagues was not yet published so they had to generate their own measurements for this study. In addition their study is much more careful in several different ways. For example they only used the sequences for 19 strains that they say have higher coverage and accuracy. They also tried to estimate the impact of indels and they try to increase the size of the alignments (a crucial step in this process) by searching for distant homologs. If you are interested in making use of "personal" genomes you should really read this paper.

Stepping back a bit I think I was excited about this paper because it finally connects the work that has been done in high-throughput characterization of a model organism with the diversity across individuals of that species. It serves as bridge for many people to come to work in this area. There are a large number of immediate questions like how much do we really need to know to make good/better predictions ? What kind of interactions (transcriptional, genetic, conditional genetic) do we need to know to capture most of the variation ? Can we select gene-set and gene weights in other species without the conditional-genetics information (by homogy) ?

As we are constantly told, the deluge of genome sequences will continue so there are plenty of opportunities and data to analyze (I wish I had more time ;). Some recent examples of interest include the sequencing of 162 D. melanogaster lines with associated phenotypic data and the (somewhat narcissistic) personal 'omics study of Michael Snyder. To start to make the jump to human I think it would be great to have cellular phenotypic data (growth rate/survival under different conditions) for the same cells/tissue across a number of human individuals with a sequenced genome. Maybe in a couple of years I wont be as skeptical as I am now about our fortune cookie genomes.


Wednesday, February 29, 2012

Book Review - The Filter Bubble

Following my previous post I thought it was on topic to mention a book I read recently called “The Filter Bubble”. The book, authored by Eli Pariser, discusses the several applications of personalization filters in the digital world. As several books I have read in the past couple of years, I found it via a TED talk where the author neatly summarizes the most important points. Even if you are not too interested in technology it is worth watching it. I am usually very optimistic about the impact of technology on our lives but Pariser raises some interesting potential negative consequences of personalization filters.




The main premise of the book is that the digital world is increasingly being presented to us in a personalized way, a filter bubble. Examples include Facebook’s newsfeed and Google search among many others. Because we want to avoid the flood of digital information we willingly give commercially valuable personal information that can be used for filtering (and targeted advertisement). Conversely, the fact that so many people are giving out this information has created data mining opportunities in the most diverse markets. The book goes into many examples of how these datasets have been used by different companies such as dating services and the intelligence community. The author also provides an interesting outlook for how these tracking methods might even find us in the offline world a la Minority Report.

If sifting through the flood of information to find the most interesting content is the positive side of personalization what might be the downside? Eli Pariser tries to argue that this filter “bubble”, that we increasingly find ourselves in, isolates us from other points of view. Since we are typically unaware that our view is being filtered we might get a narrow sense of reality. This would tend to re-enforce our perception and personality. It is obvious that there are huge commercial interests in controlling our sense of reality so keeping these filters in check is going to be increasingly important. This narrowing of reality may also stifle our creativity since so often novel ideas are found at the intersection between different ways of thinking. So, directing our attention to what might be of interest can inadvertently isolate us and make us less creative. 

As much as I like content that resonates with my interest I get a lot of satisfaction from finding out new ideas and getting exposed to different ways of thinking. This is way I like the TED talks so much. There are few things better than a novel concept well explained - a spark that triggers a re-evaluation of your sense of the world. Even if these are ideas that I strongly disagree with, as it happens often with politics here in the USA, I want to know about them if a significant proportion of people might think this way.  So, even if the current filter systems are not effective to the point of isolating us I think it is worth noting these trends and taking precautions.

The author offers an immediate advice to those creating the filter bubble – let us see and tune your filters. One of the biggest issues he tries to bring up is that the filters are invisible. I know that Google personalizes my search but I have very little knowledge of how and why. The simple act of making these filters more visible should make us see the bubble. Also, if you are designing a filtering system, make it tunable. Sometimes I might want to get out of my comfort zone and see the world from a different lens. 

Thursday, February 23, 2012

Academic value, jobs and PLoS ONE's mission

Becky Ward from the blog "It Takes 30" just posted a thoughtful comment regarding the Elsevier boycott.  I like the fact that she adds some perspective as a former editor contributing to the ongoing discussion. This follows also from a recent blog post from Michael Eisen regarding academic jobs and impact factors. The tittle very much summarizes his position: "The widely held notion that high-impact publications determine who gets academic jobs, grants and tenure is wrong". Eisen is trying to play down the value of the "glamour" high impact factor magazines and fighting for the success of open access journal. It should be a no-brainer really. Scientific studies are mostly payed for by public money, they are evaluated by unpaid peers and published/read online. There is really no reason why scientific publishing should be behind pay-walls.

Obviously it is never as simple as it might appear at first glance. If putting science online was the only role publishers played I could just put all my work up on this blog. While I write up some results as blog posts I can guarantee you that I would soon be out of job if I only did that. So there must be other roles that scientific publishing plays and even if these roles might be outdated or performed poorly they are needed and must be replaced for us to have a real change in scientific publishing.

The value of scientific publishing

In my view there are 3 main roles that scientific journals are currently playing: filtering, publishing and providing credit. The act of publishing itself is very straightforward and these days could easily cost near zero if the publishers have access to the appropriate software. If publishing itself has benefited greatly with the shift online, filtering and credit are becoming increasingly complex in the online world.

Filtering
Moving to the digital world created a great attention crash that we are still trying to solve. What great scientific advances happened last year in my field ? What about in unrelated fields that I cannot evaluate myself ?  I often hear that we should be able to read the literature and come up with answers to these questions directly without regard to where the papers where published. However, try to just imagine for a second that there were no journals. If PLoS ONE and its clones get what they are aiming for, this might be on the way. A quick check on Pubmed tells me that 87134 abstracts were made available in the past 30 days. That is something like 2900 abstracts per day ! Which ones of these are relevant for me ? The currently filtering system of tiered journals with increasing rejection rates is flawed but I think it is clear that we cannot do away with it until we have another in place.

Credit attribution
The attribution of credit is also intimately linked to the filtering process. Instead of asking about individual articles or research ideas credit is about giving value to researchers, departments or universities. The current system is flawed because it overvalues the impact/prestige of the journals where the research gets published. Michael Eisen claims that impact factors are not taken into account when researchers are picked for group leader positions but honestly this idea does not ring true to me. From my personal experience of applying for PI positions (more on that later), those that I see getting shortlisted for interviews tend to have papers in high-impact journals. On twitter Eisen replied to this comment by saying "you assume interview are because of papers, whereas i assume they got papers & interviews because work is excellent". So either high impact factor journals are being incorrectly used to evaluate candidates or they are working well to filter excellent work. In either case, if we are to replace the current credit attribution system we need some other system in place.

Article level metrics
So how do we do away with the current focus on impact factors for both filtering and credit attribution? Both of those could be solved if we could focus on evaluating articles instead of the journals. The mission of PLoS ONE was exactly to develop article level metrics that would allow for a post-publication evaluation system. As they claim in their webpage they want "to provide new, meaningful and efficient mechanisms for research assessment". To their credit PLoS has been promoting the idea and making some article level indicator easily accessible but I have yet to see a concrete plan to provide the readers with a filtering/recommendation tool. As much as I love PLoS and try to publish in their journals as much as possible, in this regard PLoS ONE has so far been a failure. If PLoS and other open access publishers want to fight Elsevier and promote open access they have to invest heavily in filtering/recommendation engines. Partner with academic groups and private companies with similar goals (ex. Mendeley ?) if need be. With PLoS ONE they are contributing to the attention crash and making (finally) a profit off of it. It is time to change your tune, stop saying how big PLoS ONE is going to be next year and start staying how you are going to get back on track with your mission of post-publication filtering.  

Summary
Without replacing the current filtering and credit attribution roles of traditional journals we wont do away with the need for tiered structure in scientific publishing. We could still have open access tiered systems but the current trend for open access journals appears to be the creation of large journals focused on the idea of post-publication peer review since this is economically viable. However, without filtering systems, PLoS ONE and its many clones can only contribute to the attention crash problem and do not solve the issue of credit attribution. PLoS ONE's mission demands it that they work on filtering/recommendation and I hope that if nothing else they can focus their message, marketing efforts and partnerships on this problem.




 



Wednesday, February 22, 2012

The 2012 Bioinformatics Survey

I am interrupting my current blogging hiatus to point to a great initiative by Michael Barton. He is collecting some information regarding those working in the fields of bioinformatics / computational biology in this survey. This is a repeat from a similar analysis done in 2008 and I think is it is really worth getting a felling for how things have been changing. We can all benefit from the end result. So far, after 2 weeks, there have been close to 400 entries to the survey but the rate of new entries is slowing down. So, if you have not done so already, go and fill it out or bug some colleague to do so. 

Wednesday, May 25, 2011

Predicting kinase specificity from phosphorylation data

Over the past few years, improvements in mass-spectrometry methods have resulted in a big increase in throughput for the identification of post-translational modifications (PTMs). It is even hard to keep up with all the phosphoproteomics papers and the accumulation of phosphorylation data. Most often, improvements in methods result in interesting challenges and opportunities. In this case, how can we make use of this explosion in PTM data ? I will try to explore a fairly straightforward idea, on how to use phosphorylation data to predict kinase substrate specificity. I'll describe here the general idea and just the first stab at it to show that I think it can work.

The inspiration for this is the work by Neduva and colleagues that have show that we can search for enriched motifs within proteins that interact with the domain of interest. For example, we can take a protein containing and SH3 domain, find all of it's interaction partners and you will likely see that they are enriched for proline rich motifs of the type PXXP (x = any amino-acid) that is the known binding preference for this domain. So the very obvious application to kinases would be to take the interaction partners of a kinase and find enriched peptide motifs. The advantage of looking at kinases, over any other type of peptide binding domains, is that we can focus specifically on phosphosites.

As a test case I picked the S.cerevisiae Cdc28p (Cdk1) that is known to phosphorylate the motif  [ST]PXK. I used the STRING database to identify proteins that functionally interact with Cdc28 with a cut-off of 0.9 and retrieved all currently known phosphosites within these proteins. As a quick check I used Motif-X to search for enriched motifs.  The first try was somewhat disappointing but after removing phosphosites that had less than 5 MS spectra and/or experiments supporting it I got back the this logo as the most enriched motif:

This was probably the easiest kinase to try since it is known that it typically phosphorylates its targets at multiple sites and it heavily studied.  Still, I think there is a lot of room for exploration here. If anyone is interested in collaborating on this let me know. If your doing computational work I would be interested in some code/tools for motif enrichment. If your doing experimental work let me know about your favorite kinases/species. 

Thursday, April 28, 2011

In defense of 'Omics

High-throughput studies tend to have a bad reputation. They are often derided as little more than fishing expeditions. Few have summarized these feelings as sharply as Sydney Brenner:
"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology"
Having dealt with these type of data for so long, I am often in the strange position of having to defend the approaches. As I was in a real need to procrastinate, I decide to try to write some of these thoughts down.

Error rates
One of the biggest complaints directed at large-scale methods is that they have very high error rates. Usually these complaints come from scientists interested in studying system X or protein Y, that dig into these datasets only to find out that their protein of interest is missing. Are the error rates high ? While this might be true for some methods it is important to note that the error rates are almost always quantified and that those developing the methods keep pushing the rates down.

When thinking about 'small-scale' studies I could equally ask - why should I trust a single western blot image ? How many westerns were put in the garbage bin before you got that really nice one that is featured in the paper ? In fact, some methods for reducing the error become only feasible when operating in high-throughout. As an example, when conducting pull-down experiments to determine protein-protein interactions, unspecific binding becomes much easier to call. This has lead to the development of analysis tools that cannot be employed on single pull down experiments.

So, by quantifying the error rates and driving these down via experimental or analysis improvements, 'omics research is in fact, on the forefront of data quality. At the very least, you know what the error rate is and can use the information accordingly. Once the methods are improved to an extent that the errors are negligible or manageable they are quietly no longer consider "omics". The best example of this I think is genome sequencing. Even with the current issues with next-gen sequencing, few put 'traditional' genome sequencing in the same bag with the other 'omics tools, although they have quantifiable errors.

Standardization
Related to error quantification is standardization. To put is simply, large-scale data is typically deposited in databases and is available for re-use. What is the point of having really careful experiments if they will only be available for re-use, in any significant way, when a (potentially sloppy) curator digs the info out of papers ? This availability fuels research by others that are not set-up to perform the measurements. This is one of the reasons why bioinformatics thrives. The limitations become the ideas not the experimental observations/measurements. Anyone can sit down, think of a problem and with some luck the required measurements (or proxy of them) have been made by others for some unrelated purpose. This is why publications of large-scale studies are so highly cited, they are re-used over and over again.

Engineering mindset and costs
One other very common complaint about these methods is cost. It is common to feel that 'omics research is 'trendy', expensive and consumes too much of the science budgets. While the part about budget allocation might be true, the issue with costs is most certainly not. Large-scale methods are developed by people with an engineering mindset. The problems in this type of research are typically on how to make the methods work effectively, which includes making them cheaper, smaller, faster, etc. 'Omics research drives costs down.

Cataloging diversity
Besides these technical comments the highest barrier to deal with, when discussing these methods with others is a conceptual one.  Is there such a thing as 'hypothesis free' research ? To address this point let me go off on a small tangent. I am currently reading a neuroscience book - Beyond Boundaries - by Miguel Nicolelis, a researcher at Duke University.  I will leave a proper review for some later post but, at some point, Nicolelis talks about the work of Santiago Ramon y Cajal. Ramon y Cajal is usually referred to as the father of the neuron theory that postulates that the nervous systems is made up of fundamental discrete units (neurons).  His drawings of neuronal circuits of different species are famous and easily recognizable. The amazing level of detail and effort that he put into these drawings really underscores his devotion for cataloging diversity. These observations inspired a revolution in neuroscience, much the same way Darwin's catalogs of diversity impacted biology. Should we not build catalogs of protein-interactions, gene-expression, post-translational modifications, etc ? I would argue that we must. Omics research drives errors and price down, creates catalogs of easily accessible and re-usable observations that fuels research. I actually think that it frees researchers. While a few specialize in method developments others are free to dream up biological problems to solve with the data gathering effort shortened to a digital query.

Miss-understandings
So why the negative connotations ? Part of it is simple backlash against the hype. As we know, most technologies tend to follow a hype cycle where early exaggerated excitement is usually followed by disappointment and backlash when they fail to deliver. A second important aspect is simply a lack of understanding of how to make use of the available data. This model of data generation separated from the problem solving and analysis only makes sense if researchers can query the repositories and integrate the data into their research. It is sad to note that this capacity is far from universal. While new generations are likely to bring with them a different mindset, those developing the large scale methods should also bear the responsibility of improving the re-usability of the data. 

Thursday, March 03, 2011

Structure based prediction of kinase interactions

About a year ago Ben Turk's lab published a large scale experimental effort to determine the substrate recognition preferences of most yeast kinases (Mok et al. Sci. Signal. 2010). They used a peptide screening approach to analyze 61 of about 122 known S. cerevisiae kinases in order to derive, for each one, a position specific scoring matrix (PSSM) describing their substrate recognition preference. In the figure below I show an example for the Hog1 MAPK where it is clear that this kinase prefers to phosphorylate peptides that have proline next to the S/T that is going to be phosphorylated.

Figure 1 - Example of Hog1 substrate recognition preference derive from peptide screens. Each spot in the array contains a mixture of peptides that are randomized at all positions except at marked position (-5 to +4 relative to the phosphorylatable residue).  Strong signal correlates with a preference for phosphorylating peptides containing that amino-acid at the fixed position.

As was previously known, most kinases don't appear to have very striking substrate binding preferences. Still, these matrices should allow for significant predictions of kinase-site interactions. These matrices should allow us also to benchmark previous efforts by Neil and other members of the Kobe lab on the structural based predictions of kinase substrate recognition. For this, I obtained the predicted substrate recognition matrices from the Predikin server and known kinase-site interactions from the PhosphoGrid database. I used this data to compare the predictive power of the experimentally determined kinase matrices (Mok et al.) with the predicted matrices from Predikin. This analysis was done about a year ago when the Mok et al. paper was published but I don't think Phosphogrid was significantly updated since then.

Phosphogrid had 422 kinase-site interactions for the 61 kinases analyzed in Mok et al. of which ~50% of these have in-vivo evidence for kinase recognition. As expected, the known kinase-site interactions have a stronger experimental matrix score than random kinase-site assignments (Fig 2).

Figure 2 - The set of kinase-site interactions used broken down according the kinases with higher representation. These sites were scored using the experimental matrices along with other randomly selected phosphosites and the scores of both populations are summarized in the boxplots.


A random set of kinase-phosphosite interactions of equal size was used to quantify the predictive power of the experimental and the Predikin matrices with a ROC curve (Fig 3).
Figure 3 - Area under the ROC curve values for kinase-site predictions using both types of matrices.

Overall, the accuracy of the predicted matrices from Predikin matched reasonably well with those derived from the peptide array experiments with only a small difference in AROC values. I broke down the predictions for individual kinases with at least 10 sites known. Benchmarking of such low numbers becomes very unreliable but besides the Cka1 kinase, the performance of the Predikin matrices matched reasonably well the experimental results.

I am assuming here that Predikin was not updated with any information from the Mok et al study to derive their predictions. If this is true it would mean that structural based prediction of kinase recognition preferences, as implemented in Predikin, is almost as accurate as preferences derived from peptide library approaches. 

Friday, January 07, 2011

Why would you publish in Scientific Reports ?

The Nature Publishing Group (NPG) is launching a fully open access journal called Scientific Reports. Like the recently launched Nature Communications, this journal is online only and the authors cover (or can choose to cover for Nat Comm) the cost of publishing the articles in an open access format. Where 'Scientific Reports' differs most is that the journal will not reject papers based on their perceived impact. From their FAQ:
"Scientific Reports publishes original articles on the basis that they are technically sound, and papers are peer reviewed on this criterion alone. The importance of an article is determined by its readership after publication."

If that sounds familiar it should. This idea of post-publication peer reviewing was introduced by PLoS ONE and Nature appears to be essentially copying the format from this successful PLoS journal. Even the reviewing practices are the same whereby the academic editors can choose to accept/reject based on their opinion or consult external peer reviews. In fact, if I was working at PLoS I would have walked into work today with a bottle of champagne and I would have celebrated. As they say, imitation is the sincerest form of flattery. NPG is increasing their portfolio of open access or open choice journals and  hopefully they will start working on article level metrics. In all, this is a victory for the open-access movement and to science as a whole.

As I had mentioned in a previous post, PLoS has shown that one way to sustain the costs of open access journals with high rejection rates a publishers needs also to publish higher volume journals. Both BioMedCentral and more recently PLoS have also shown that high-volume open access publishing can be profitable so Nature is now trying to get the best of both worlds. Brand power from high-rejection rate journals with a subscription model and a nice added income with a higher-volume open access journals. If by some chance, founders force a complete move to immediate open access, NPG will have a leg to stand on.

So why would you publish in Scientific Reports ? Seriously, can someone tell me ? Since the journal will not filter on perceived impact, they wont be playing the impact factor game. They did not go as far as naming it Nature X so brand power will not be that high. It is similarly priced (until January 2012) as PLoS ONE and has less author feedback information (i.e. article metrics). I really don't see any compelling reason why I would choose to send a paper to Scientific Reports over PLoS ONE.

Updated June 2013 - Many of you reach this page searching for the impact factor of Scientific Reports. It is now out and it is ~3. Yes, it is lower than PLOS ONE's so you have yet another reason not to publish there. 

Friday, December 31, 2010

End of the year with chemogenomics

Taken from jurvetson at:
www.flickr.com/photos/jurvetson/3156246099/
Around this time of the year it is customary to make an assessment of the year that is ending and to make a mental list of things we wish for in the year ahead. Here is my personal (but work related :) take on this tradition.

My academic year ended with the publication of two works related to chemogenomics. Chemogenomics or chemical genomics tries to study the genome-wide response to a compound. Usually, collections of knock-outs or over-expression of large number of genes are grown in the presence or absence of a small molecule to assess the fitness cost (or advantage) of that perturbation to the drug response. This is what was done in these two works.

In the first one, Laura Kapitzky (a former postdoc colleague in the lab) used a collection of KO strains both in S. cerevisiae and S. pombe to essay for the growth in the presence of different compounds. The objective was to study the evolution of the drug response in these distantly related fungi. In line with what was previously observed in the lab for genetic-interactions and kinase-substrate interactions we found that drug-gene functional interactions were poorly correlated across these two species. Perhaps one interesting highlight from this project was that we could combine data from both fungi to improve the prediction of the mode-of-action of the compounds.

The second project, in which I was only minimally involved in, was a similar chemogenomic screen but at a much larger scale. As the tittle implies "Phenotypic Landscape of a Bacterial Cell" (behind paywall), is a very comprehensive study of the response of the E.coli whole knock-out library against an array of compounds and conditions. Robert, Athanasios and other members of the Carol Gross lab did an amazing job of creating this resource and picking some of the first gems from it.

Something that I wanted to highlight here was not so much what was discovered but what I was left wanting. These sort of growth measurements tell us a lot about drug-gene relationships. We also have a growing knowledge of how genes genetically interact either by similar growth measurements in double-mutants or by predictions (as in STRING). These should allow us then to make prediction about how drugs interact. If two drugs can act in synergy to decrease the growth of a bug we should be able to rationalize that in terms of drug-gene and gene-gene interactions. I find this is a very interesting area of research. Naively these sort of data should allow us to predict drug combinations that target a specific species (i.e. pathogen) or diseased tissue but not the host or the healthy tissue. Here is a scientific wish for 2011, that these and other related datasets will give us a handle on this interesting problem.

As for the future, I am entering the final year of my current funding source (thank you HFSP) so my attention is turning into finding either some more funds or another job. I will continue working on the evolution of signalling systems, in particular trying to find the function of post-translational modifications (aka P1). Unfortunately the project failed as an open science initiative, something that I have mostly given up for now. I think the main reason it didn't work was because of lack of collaborators of similar (open) interests and non-overlapping skill sets as Greg and Neil were discussing in the Nodalpoint podcast a while ago.

See you all in 2011 !

Tuesday, December 21, 2010

The GABBA program

I was recently in the annual meeting of my former PhD program, the GABBA program, a Graduate Program in Areas of Basic and Applied Biology in Portugal. I realized that I never blogged about the Portuguese PhD programs and I thought I would share with you their somewhat unusual concept.

Like in other PhD programs, GABBA students start by having courses during the first semester of the program. The semester is divided into week long courses in different subjects (think Cell-Cycle, Development, etc) with invited teachers. What is different from most other programs I know of is that students then get to use their scholarship to do their research projects anywhere in the world. GABBA students get payed to do their research in any lab that accepts them, no strings attached. No return clause, not even a requirement to inform the program of research progress. There is an annual meeting where students (and alumni) get to go to Portugal to present their work but no one is obliged to go. It is also a nice opportunity to exchange tips and in some cases even start collaborations.

The annual meeting is always organized around Christmas time so most people end up going. I kept going to the meetings after finishing my PhD mostly because I enjoy seeing the people but also because of the cool science. As you can imagine, everyone is scattered around the world in very nice labs doing research in all sort of different biomedical related subjects. This year there were a lot of talks about stem cells and an unusually high number of neurobiology related work. Some cool research of note for me were for example the work of Martina Bradic (Borowsky lab at NYU) about the convergent evolution of blind cave fish and the talk by Andre Sousa (Sestan Lab at Yale) on the transcriptional profiling of human brain regions during development (http://hbatlas.org/).

The GABBA program takes international students as well but they are typically asked to do their research in Portugal. The applications are usually around June so keep an eye out if you are interested in applying. Have a look at the admissions page for more information.

Wednesday, November 24, 2010

This holiday season, make them spit in a tube

Black Friday is upon us and everyone here in the US is going consumer crazy. Along with the traditional discounts in the offline world, there are also tempting promotions in many online stores. One great example is the discount that 23andMe is offering until next Friday. If you have not heard about 23andMe, they are a direct-to-consumer genetics company that sell a SNP profiling service. You get to find out about your ancestry and genetic propensity for traits and some diseases. The analysis usually costs $499 (plus a one year $5 monthly mandatory subscription) but they are having a $400 dollar discount (use promo code UA3XJH). What better way to spend Christmas than having everyone spit into a little tube.

Wednesday, October 20, 2010

ICSB 2010 - From design principles to genome-wide and that persistent gap in-between

I am back from the 11th International Conference on Systems Biology (ICSB 2010) that was held in the lovely cite of Edinburgh. A full week of talks dedicated to systems biology (however you might define it). Speaking of definitions, it was refreshing to note that no speaker spent much time trying to define the field this time around. Aside from the keynote lectures, there were constantly four parallel sessions (see program) so you were always guaranteed to miss out on something interesting. With the help of a couple of attendees we took some notes of the conference in a FF room. There is lots of notes to go through if you are interested but here a few of my favorite highlights.

Keynotes
From all the keynotes the ones I enjoyed most were from Luis Serrano, Mike Tyers and Aldons Lusis. Unfortunately, Sydney Brenner was scheduled but canceled last minute.

Luis Serrano talked about his lab's work on characterizing a mycoplasma species (notes here) using a multi-omics approach. They are trying to use available technologies to build parts lists and models of all aspects of this species. They have learned a lot by taking such a systematic approach on a bug with so few genes but there is no plan that I could see on how to follow up on all this work. I can see that many computational biology labs will use this data but it would be a missed opportunity if more labs don't continue to go beyond the omics limitations.

The lecture by Tyers (notes here) was also very much about omics. He talked about their recent effort (Breitkreutz et al Science 2010) to search for kinase-protein interactions in yeast and how hard it is, in general, to study signalling pathways in this way (promiscuous interactions, complex systems etc). From kinases he moved to drug-gene interactions and chemogenomics. In particularly he briefly mention some unpublished work on evolution and prediction of drug-synergy. This is topic that I am really interested in as an applied side of evolutionary biology (more on that hopefully soon).

Another keynote I enjoyed was from Aldons Lusis (see notes). His presentation centered on a strategy for association studies in mice (Bennett et al. Genome Research 2010). This sort of work is out of my comfort zone but I really like all the examples he gave of using this strategy to find loci associated with clinical traits or protein/gene expression levels. Maybe I should be trying to read Nature Association Studies Genetics more often.


Parallel sessions
I went to the sessions of "Functional Genomics", "Cell signalling dynamics","Parameterizing proteomics" and "Biological noise and cell decision making".  In the Functional Genomics session, Lars Steinmetz talked about genome wide analysis of antisense non coding transcription and David Amberg's talk covered the use of genetic interactions to study actin mutants (an extension from Haarer et al. G&D 2007).
I really liked many of presentations from the Cell Signaling Dynamics session, including talks by Walter Kolch, Timothy Elston and Nils Bluthgen. It was interesting to note that many people presenting were following a similar approach of first enumerating different models that could achieve the function they were studying and then finding the most plausible by elimination.
From the proteomics session the highlight for me was the really cool work presented by Christian von Mering. They have essentially compiled a lot of mass-spec data and used corrected spectral counts to estimate protein abundance for many different species. The data can be found at http://pax-db.org and some reported results are published (Schrimpf et al., PLoS Biology 2009 and Weiss et al., Proteomics 2010). Overall the message appears to be that protein abundance is more conserved across species than gene expression.
Finally, from the session on noise and cell decision I particularly liked the talk of Roy Kishony, on his lab's work with antibiotic response, and James Locke's analysis of sigma b promoters (Elowitz lab).

Bridging the gap
Besides all the cool science I come back from this meeting with the feeling that we still have this huge gap between the -omics work and the detailed 'design principles' pathway analysis. There is even such a tension between people working in these two camps that it becomes almost a joke. Maybe this is why it is so hard to define systems biology, each "type" of researcher sees it differently. Some would say that it is not systems biology if it is not genome wide, while others will claim that we don't learn anything with omics (just a parts list). In this meeting there were great examples of both camps using established methods to attack new systems but there is still no clear attempt to bridge the gap. How do we go from genome-wide to quantitative mechanistic understanding ? Maybe next year in Heidelberg / Mannheim (ICSB 2011) we will see both camps, at least, acknowledging each other.

Tuesday, October 05, 2010

Book Review: The Visual Miscellaneum

After watching David McCandless's TED talk "The beauty of data visualization" I was sufficiently curious to go ahead and buy his most recent book "The Visual Miscellaneum". As expected, it is a very easy to "read" book containing interesting and beautifully presented trivia. The main idea behind the talk (and I assume the book) is to present information in a visual way to make data more tangible. McCandless complains that the information that is given in the news is hard to grasp without context and shows, through his infographics, how it can be improved. He keeps a website where some of the visualizations are freely available and if you enjoy them and the talk then the book is worth a look too.

What is interesting about these visualizations is that they occupy a space between art and data presentation. When I was going through the book I was considering if they could serve as a source of inspiration for research data. Is there room for more artistic visualizations in scientific articles ? Should we try to make data beautiful or does the primary objective of conveying the result totally overrides any artistic intention ? In a recent referee report a reviewer asked us to change some of the figures because he/she thought that there were redundant elements in them. The figure was not even about data but about the workflow of the project. This is not a complaint, just an example that probably illustrates well what is the current culture in academia. We are trained to be skeptical and constantly looking for tweaks or additions that obfuscate the results (i.e. weird scales, exploded pie charts). Maybe in our effort to be accurate we forget how important it is to make an image intuitive and pleasant.

Wednesday, September 22, 2010

Nodalpoint is back, as a podcast

If you read this blog than there is a very high chance that you know about Nodalpoint. It was one of the first (community) blogs related to science and where many bioinformatic bloggers, myself included, started out. Over the years, the site lost usage as people started their own independent blogs and Greg Tyrelle, the creator of Nodalpoint, eventually archived it.

The main website is back, in a way. Greg decided to start up a podcast series to discuss issues around bioinformatics and I guess whatever else he might be interested in. Go check it out. The first episode is a conversation with Neil Saunders, one of Nodalpoint's early users (blog, friendfeed, twitter) .

Among many other things, they talk about the lack of traction that open science has among scientists. I agree with some of the points that were raised regarding the small size niche of each specific research problem. It is not the full answer but it probably plays a role. There are so few people that have the skills and interest to tackle the same problem that creating a online community around any given scientific question becomes hard. Still, if we have not come together to openly share results and methods we have at least witness the creation of many online communities that are working very well to discuss all sort of different scientific issues  (ex. Friendfeed-Life Scientists, Biostart, Nature Network, etc).

Friday, September 17, 2010

Systems Biology versus "real" biology

Scientific American has an article about this years' Lindau meeting of Nobel Laureates. It features an interesting conversations between Tim Hunt, Roland Pache (at the time PhD student) and undergraduate Sophia Hsing-Jung Li.
Here is the video of the conversation:
The discussion centered around systems biology and Hunt was not shy about expressing his skepticism. Since I happen to see great value in both the Omics and the design principles sort of work that characterize systems biology my frustration grew quickly. The whole video can be neatly summarize by Hunt's advice that people working in systems biology should "spend plenty of time talking to real biologists".

Real biologists ? ... I felt like writing a long rant about the findings that were made possible by the sort of work that he his so skeptical about but then I thought about xkcd and relaxed a bit:

Tuesday, July 20, 2010

Do we still need pre-publication peer-review ?

A bit over a month ago Glyn Moody wrote a blog post arguing that abundance of scientific publishing outlets removes the need for our current system of pre-publication peer-review. The post sparked an interesting discussion here on FriendFeed.

Glyn Moody tells us that we have now:
"yet another case of a system that was originally founded to cope with scarcity - in this case of outlets for academic papers. Peer review was worth the cost of people's time because opportunities to publish were rare and valuable and needed husbanding carefully"

Since we have an endless capacity to publish information online Moody argues that there is no longer a need to pre-select before publication. We can leave that all behind us and do a post-publication peer-review that is distributed by all of the readers using all sorts of article level metrics that PLoS has been promoting.

More recently Duncan wrote another blog post that has some information that I think is important for this discussion. He was trying to estimate how many articles have ever been published. In the process he noted an interesting number - the number of articles that are currently published per minute. Pubmed keeps a table with the number of articles that they have information on per year. I don't think the last couple of years are well annotated and the first decades are that reliable so I just plotted here the totals between 1966 and 2007.

It is not surprising to see that the number of articles published per year is increasing, it probably matches well our expectations. I personally feel like I never have enough time to keep up with the literature. We are currently over the 700.000 papers per year. A search on pubmed for articles published in 2009 returns 848.856 papers. Something like 1.6 papers per minute !

So, although we have no scarcity of publishing outlets we have a huge scarcity of attention. It is very literally impossible to keep up with the current literature without some sophisticated filtering system. With all of the imperfections of our current System (TM) of editorial control, subjective peer review, subjective impact evaluations, impact factors and so on, we must agree that we need a lot of help filtering through these many articles.

I have read some people arguing that we should be capable ourselves of reading papers and realizing if they are interesting/innovative or not. That is fine for the very narrow range of topics that are close to our area of interest. I have pubmed queries for my topics of interest and I do filter through these myself without relying (too much) on the journal it was published on, etc. The problem is everything else that is not within this extremely narrow range of topics or the many papers that escape my queries. I want to be made aware of important new methods and new discoveries outside my narrow focus.

Moody and many others argue that we can do the filtering after publication by the aggregated actions of all of the readers. I totally agree, it should be possible to do the filtering after publication. It should be possible but it is not in place yet. So, if we want to do away with the System .. build a better system along side it. Show that it works. I would pay for tools that would recommend me papers to read. In my mind, this is where publishers of today should be making their money, in tools that connect the readers to what they want to read, not on content that should be free to read and re-use by anyone (open access).

Thursday, July 15, 2010

Review - The Shallows by Nicholas Carr

On a never ending flight from Lisbon back to San Francisco I finished reading the latest book from Nicholas Carr: "The Shallows - What the Internet is doing to our brains". The book is a very extended version of an article Carr wrote a few years ago enteitled "Is Google Making us stupid" that can be read online. If you like that article you will probably find the book interesting as well.

In the book (and article) Carr tries to convince the reader that the internet is reducing our capacity to read deeply. He acknowledges that there is no turning back to a world without the internet and he does not offer any solutions, just the warning. He explains how the internet, as many other communication revolutions (printing press, radio, etc), changes how we perceive the world. In a very material way, it changes our brain as we interact with the web and learn to use it. He argues that the web promotes skimming the surface of every web page and that the constant distractions (email, social networks) are addictive. This addiction can even be explained by an ancient species need to constantly be on the look out for changes in our environment. So, by promoting this natural and addictive shallow intake of information, the internet is pushing aside the hard and deep type of reading that has been one of mankind's greatest achievements.

After reading all of this I should be scared. I easily spend more than ten hours a day on these interwebs and my job as a researcher depends crucially on my capacity to read deeply other scientific works, reason about them, come up with hypothesis, experiments etc. So, why I am still writing this blog post instead of sitting in some corner reading some very large book ? Probably because I do not share Nicholas Carr's pessimist view. I actually agreed with a lot more things that I was expecting to before reading the book. I certainly believe that, like any other tool, the internet changes our brains as we used it. I agree also that reading online promotes this skimming behavior that the book describes. I observe the same from my own experience. What I find hard to believe is that the internet will result in the utter destruction of mankind as we know it (* unless saved by The Doctor).

It is just a personal experience but, despite my addiction to the internet, I haven't stopped reading "deeply". Not only is it a job requirement, I enjoy it. One of my favorite ways to spend saturday mornings is to get something to read and have long breakfast outside. At work I skim through articles and feeds to find what I need and when I do I print to read deeply. That is why I have piles of articles on my desk. This just to say that I found a way around my personal difficulty with deep reading on the computer screen. In other words, if it is required, we will find a way to do it. The internet habits that might be less conducive to deep thought are not worse than many any other addiction of our society and we have learned to cope with those.

I cannot imagine going back to a time when I would need to go to a library and painfully look for every single scientific article I wanted. Not to mention the impossibility of easily re-using other people's data and code. So even if a small but significant number of people can't find a way to cope with the lure of the snippets the advantages still overwhelmingly outnumber the disadvantages.

This topic and book have been covered extensively online. It is almost even evidence in itself that Carr is wrong that such a wealth of interesting and diverse opinions have shown up on the very technological platform that Carr is criticizing in the book (granted that some of these are also newspapers :). Examples:

Mind Over Mass Media (by Steven Pinker)
Carr's reply

Interview with Nick Carr and New York Time's blogger Nick Bilton

and for a different take on the topic here is an interview with Clay Shirky

Thursday, May 27, 2010

Genetic interactions in powers of ten

During my PhD at EMBL I attended a talk by Peer Bork where he said that computational biologists have the luxury of being able to work at any level of biological organization (atoms, cellular interactions, organism, ecosystems, etc) . At the time his lab was starting to work with metagenomics and his talks would cover the whole range of topics from protein domains to ecosystems. This idea of studying biology across this different scales reminded me of a very inspiring short movie entitled "Powers of Ten" (Wikipedia entry). This 1977 short movie was commissioned by IBM and it was written and directed by Ray Eames and Charles Eames. It takes the viewer on a journey in space from the very small atomic resolution to the outer reaches of the universe in incremental steps of powers of ten. Its only about 10 min long and if you haven't already seen here is below the embed version (while it lasts):


With all the different applications of genetic interaction screening going on here in the Krogan lab we though it would be interesting to write an essay that would, in the same spirit of this short movie, take the reader on a journey across different scales of biology. The essay was just made available online and I hope you enjoy the ride :).

We hope it serves as a tutorial for people interested in using genetic interaction data. There is more and more of this sort of data being deposited in databases and only a small fraction is being used to its fullest potential. We tried to show several examples of concrete findings that were first hinted by genetic interaction data.

Additionally we were trying to make the point that developments in high-throughout methods are decreasing the limitations of what can be observed in biological systems using the same methodologies. This is interesting because it challenge us to build models that can explain biological systems across different layers of biological organization. How does a change in DNA propagate across these layers ? Can it change the meaning of a codon, impact on a protein's stability/interactions, affect the action potentials in a neuronal cell and how species interact ? As we increase our capacity to monitor biological systems we should not only be able to tackle specific layers (i.e. understand protein folding) but we will eventually be concerned with coupling this different models to each other.

Friday, April 30, 2010

Kaggle - a home for data mining challenges

I got a promotional email today from a new project called Kaggle. Somewhat related to Innocentive, this project aims to connect challenging problems with people that have the right set of skills to solve them. Kaggle is more specifically aiming to host prediction challenges and should appeal more to the data mining communities. For example, the site is currently hosting a challenge about HIV progression where problem solvers are giving a training dataset and asked to predict improvement in a patient's viral load.

I sent a few questions to Anthony Goldbloom (who works for Kaggle) to get a better idea of what the site is about:

Could you just tell me a bit about the company ? 
The project was inspired by an internship I did as a journalist in London in 2008, when I wrote about the use of data by organizations. I am an econometrician by training and I was excited to see the principles we use to forecast economic growth, inflation etc, being applied by organizations. I returned to Australia and resolved to get involved in the broader analytics community. That's how Kaggle was born.  

It looks like a young startup, is this right?
The project is only two weeks old and we've been thrilled with the response - we've attracted over 6,000 unique visitors. 

We launched the Eurovision contest to get things going. In the last few days we released the HIV Progression Prediction competition. This was my introduction to bioinformatics, which seems like a fascinating area - we're hoping to attract more such competitions. Perhaps your readers have ideas or data

Does the name mean anything ?
The name doesn't mean anything. I got tired of coming up with great names and finding they were taken (and that the owner would only sell for $xx,xxx). As a young project,  our funds could be better spent elsewhere, so I built a program that iterated over different combinations of letters and printed a list of available and phonetic domain names. (I put this program on the web for others in a similar situation.) 

How do you hope to be different from what Innocentive is doing ?
The project is solely focused on data competitions. This enables us to offer services - e.g. to help our clients frame their problems, anonymize their data,  etc. 

The platform is also easily extensible, so we can modify it to suit the specific needs of different data competitions. 

We will host a rating system/league table, so that statisticians can use strong performances to market themselves. The rating system also allows us to host forecasting competitions, since the competition host will know who has a track record of forecasting well (and therefore who to pay attention to).

In the medium term, we plan to also offer a tender system, so that consultants can bid for work from organizations and researchers all over the world. From the organization's perspective, the rating system means they know what they're paying for. From the consultant's perspective, they don't have to waste time touting for work and they get access to interesting clients and datasets.