<rant>
Life isn’t fair, science is part of life therefore science isn’t fair. This would be a very short way to say what I am thinking but this is a rant so I will stretch it out a bit more.
We learn early on that in our line of work there is almost no correlation between the amount of work we do and the results we get. You need luck and I am not turning mystical on you here. I mean the low likelihood kind of luck. Even if you do everything right, being successful in science depends mostly on factors that are outside your control. A somewhat random pool of people end up being in the right place and the right time to go on with their academic work. Almost like a game of musical chairs, those with enough passion and perseverance to sustain the blows of lady luck get to play in the final rounds. Granted that I have been at this only for a few years but I have seen my share of hard working people getting scooped or hitting the wall with impossible projects. Try to explain scooping to non-scientists to see how ridiculous that sounds. I have also seen people (myself included) getting authorships for things I would not consider worthy of such.
So … science isn’t fair. This was exactly the sort of observations that made me start thinking about open science a few years ago. We could help to even out the playing field if we all are a bit more open about what we are working on. Too many financial and personal resources are eaten away to the duplication of research agendas.
</rant>
Tuesday, April 27, 2010
Tuesday, April 13, 2010
Nature Communications serves its first papers
The new Nature brand journal (Nature Communications) has published its first set of papers this week. It is an interesting development in scientific publishing for many reasons. This is the first Nature brand journal that is online only and offers an (expensive) $5000 open access choice. Also, they are positioning this journal specifically as lower tier journal than previous Nature journals. According to the scope section:
So with this in mind, Nature Communications could be seen as bet hedging. Open access might be here to stay due to mandates from funding agency. If that is the case, the example from PLoS shows us that the only way to sustain highly selective journals is to publish also lower tier, less selective journals. This way the publishing house can also directly pass papers down its chain of journals and even possibly pass around the referee reports to expedite publishing.
If most publishers try to cover the whole range of journal selectivity how may publishers will there be a market for ?
While PLoS and Nature and expanding down this perceived pyramid of journal selectivity, BMC has been trying to expand up. This week, BMC Biology and Journal of Biology announced that these two journals are fusing to be the new flagship journal of BMC. I wish the best to the re-birth of BMC Biology but expanding up the ladder of "perceived impact" is much harder than expanding down.
Through this all we have still not managed to do away with this idea of journal prestige or impact. PLoS ONE promised us they would provide us with ways to filter and sort papers on their individual value but we are still not there yet. Ironically these "editorial" services might end up coming from third party programs like Mendeley, CiteUlike or Papers.
"papers published in Nature Communications will be of high quality, without necessarily having the scientific reach of papers published in Nature and the Nature research journals."So why is Nature dipping its toes in higher volume open access versus its typical market of highly selective closed access papers ? A bit of context might be required and some of the discussions from 2008 about the PLoS business model are worth revisiting. A few years ago, Declan Butler, a reporter from Nature, wrote an overly negative news piece about PLoS ONE which generated a huge online discussion (see Bora's link fest). Timo Hannay's reaction to this discussion was a much more balanced point of view from Nature's side of things. Essentially, Timo Hanny was pointing out that PLoS had failed to make a profit with their more selective journals and that it was showing that a lower tier of less selective journals are required to subsidize the higher tiers. Timo also said that PLoS was creating barriers to market entry for other OA publishers because they were using philanthropic grants to sustain their business.
So with this in mind, Nature Communications could be seen as bet hedging. Open access might be here to stay due to mandates from funding agency. If that is the case, the example from PLoS shows us that the only way to sustain highly selective journals is to publish also lower tier, less selective journals. This way the publishing house can also directly pass papers down its chain of journals and even possibly pass around the referee reports to expedite publishing.
If most publishers try to cover the whole range of journal selectivity how may publishers will there be a market for ?
While PLoS and Nature and expanding down this perceived pyramid of journal selectivity, BMC has been trying to expand up. This week, BMC Biology and Journal of Biology announced that these two journals are fusing to be the new flagship journal of BMC. I wish the best to the re-birth of BMC Biology but expanding up the ladder of "perceived impact" is much harder than expanding down.
Through this all we have still not managed to do away with this idea of journal prestige or impact. PLoS ONE promised us they would provide us with ways to filter and sort papers on their individual value but we are still not there yet. Ironically these "editorial" services might end up coming from third party programs like Mendeley, CiteUlike or Papers.
Sunday, February 21, 2010
The stream
![]() |
http://www.flickr.com/photos/hamed/ CC BY 2.0 |
One interesting thing about all this proliferation of social networks and feed aggregators is seeing their evolution over time. Over the past couple of years some of their features became somewhat standard. You could say that this is just because some websites keep stealing ideas from others but it also says which features seam to be useful and which implementations are intuitive to theirs users.
One idea that is central and common to all of these social websites is the concept of the stream. A list of updates from your contacts in the network typically ordered by time that you can interact with either by commenting or more simply by stating that you find that interesting. These actions are in turn propagated to your own contacts and so on.
It is impressive to see how this simple idea became so widespread in so little time. Facebook estimates that it has over 400 million active users. If Facebook was a country it would the 3rd most populous after China and India. We had plenty of ways to interact with friends and colleagues online before these social networks arrived (Email and instant messaging among others) so why did they become so popular ? The first few iterations of the stream reminded me a lot of those mass emails and chain emails from a few years back. It is also somewhat similar to how people were using their status in instant messaging tools to broadcast news about themselves. These two examples show that when given the tools people enjoy telling their contacts what their up to.
Status in instant messaging have no history and broadcasting jokes by email is very impolite as most people use email for work. So broadcasting to your social network in an non-intrusive way fills a need that previous tools could not solve well before.
Its clear that the stream is here to stay but where is it heading to ?
The stream localized

It is easy to imagine how interesting it would be to get tips on what to eat when "checking in" to a restaurant or finding out that your friend is just around the corner in a cafe you like. Still, you don't have to be too paranoid to start thinking about the implications of telling the world where you are. "Please Rob Me" is the name of a website that, as the name implies, was created exactly to raise awareness to these privacy concerns.
Most likely these tools will iterate through changes in their privacy settings. For example, Google Latitude lets you share your location only to a select group of people or applications as well as letting you set the level of detail shared (ex. exact position versus area/city). Given the many business opportunities around location based advertisement companies will certainly try to make location sharing a standard property of the stream. The advertisement system in the movie Minority Report comes to mind.
Social Searching
After releasing Google Buzz the company also announced that they had acquired the company Aardvark. If you use sites like twitter or many of the other social networks you probably tried to broadcast a question. If you are not sure who exactly knows the answer there is no harm and casting a wide (and non-intrusive) net to try to find an answer. The term "lazy web" describes this sort of question broadcasting. In twitter there are even simple services organized around these "lazyweb" questions (see Lazytweet as exanple).
Aardvark tries to take this concept a bit further by targeting your questions to people that are more likely to known the answer instead of simply broadcasting to all your network. When you sign up to the service you tell it what subjects you might be able to answer and how often you mind getting some questions. In return you can ask Aardvark any question you want and it will try to route it to an "expert". This sort of social searches are a useful complement to current search engines. Your not supposed to ask questions that are easy to find with Google and it will take longer to get a reply but you can ask more subjective questions and hopefully get very knowledgeable answers.
I have tried asking questions in different social networks and a few times in Aardvark. Predictably the quantity and quality of the replies depends mostly on how specific the question is. Very broad and subjective questions get many useful replies while questions on very specialized topics will probably go unanswered.
The success of such an approach depends on many different factors but it looks like an interesting direction for search.
What do you think ?
In what other ways will we be using the stream ?
Friday, February 05, 2010
Review - You are not a gadget
I just finished reading "You are not a gadget" by Jaron Lanier. The book is very much in the same tone as an article he recently wrote the Edge called "DIGITAL MAOISM:
The Hazards of the New Online Collectivism". Very few other books made me want to say "No!" out loud so many times while reading it. I enjoy reading opinions that run contrary to my own because I think it is important to challenge our ideas. This is why I like reading Rough Type. This book, however, was extremely confusing too me. It reads mostly as a collection of essays and often deviates from the path. I still think it was an interesting book to read because of the importance of the topic.
If you read the essay linked above you will get the general feeling conveyed in the book. As Lanier writes in the end of the first chapter:
"So, in this book, I have spun a long tale of belief in the opposites of computationalism, the noosphere, the Singularity, web 2.0, the long tail, and all the rest. I hope the volume of my contrarianism will foster an alternative mental environment, where the exciting opportunity to start creating a new digital humanism can begin".
I think these sentences summarize well what he set out to do in this book. To counter the rising open culture / web 2.0 movement and create some "alternative mental environment" for the future of the web culture. Some things he talks about I fully subscribe. If you believe that the singularity is near and that we are about to merge with the machines in the next couple of years you are about as bonkers as the rapture people. The wisdom of the crowds can do a great job at annotating images but it will not cure cancer. Also, the rise of the open culture (free content, mash-ups, etc) is hurting content producers and we can't just say that they are the dinosaurs and let them figure it out while we pirate their goods. Journalism is fundamental to democracy and we need to figure a way to make it work.
What I dislike about the book is the overly negative tone. How many people really believe that "wisdom of the crowds" can solve the worlds problems ? How many people have even heard of the term ? I would risk saying that Lanier spends too much time around silicon valley geeks. Sure, there is an open culture on the web but I pay more for content today that I ever did before (The Economist, Nature Reviews Genetics, Netflix, iTunes, Amazon on Demand, Pandora One, etc). The web 2.0 mash-up craze peaked when the Times nominated "You" as the person of the year (twitter is not content ;). Also, I like youtube clips like anyone else, some of them can be just amazing (ex. Kutiman's mash-ups) but I still want to pay to see Avatar again in glorious 3D IMAX.
One idea that he mentions often is that of the technological lock-in. As media formats might get locked in with use by the majority Lanier argues that concepts and ideas can be equally locked-in. An example he gives is the concept of files on the computers. That we are no longer free to experiment with the way information is stored in a computer system because this has been locked in.
What I guess Laniear was trying to say with this warning about technological lock-ins is that we run the risk of getting trapped in a set of ideas of the web that decrease the value of humanity and the content we produce and give too much value to the cloud of computers that underly the net. Even if I was to agree that current web culture tends to devalue content and humanity I don't think these lock-ins can be that powerful. We see net culture changing everyday before us and we have so far gained much more than we lost.
In summary, I would say that the problems he talks about are important but the book is overly pessimist about our current web culture.
The Hazards of the New Online Collectivism". Very few other books made me want to say "No!" out loud so many times while reading it. I enjoy reading opinions that run contrary to my own because I think it is important to challenge our ideas. This is why I like reading Rough Type. This book, however, was extremely confusing too me. It reads mostly as a collection of essays and often deviates from the path. I still think it was an interesting book to read because of the importance of the topic.
If you read the essay linked above you will get the general feeling conveyed in the book. As Lanier writes in the end of the first chapter:
"So, in this book, I have spun a long tale of belief in the opposites of computationalism, the noosphere, the Singularity, web 2.0, the long tail, and all the rest. I hope the volume of my contrarianism will foster an alternative mental environment, where the exciting opportunity to start creating a new digital humanism can begin".
I think these sentences summarize well what he set out to do in this book. To counter the rising open culture / web 2.0 movement and create some "alternative mental environment" for the future of the web culture. Some things he talks about I fully subscribe. If you believe that the singularity is near and that we are about to merge with the machines in the next couple of years you are about as bonkers as the rapture people. The wisdom of the crowds can do a great job at annotating images but it will not cure cancer. Also, the rise of the open culture (free content, mash-ups, etc) is hurting content producers and we can't just say that they are the dinosaurs and let them figure it out while we pirate their goods. Journalism is fundamental to democracy and we need to figure a way to make it work.
What I dislike about the book is the overly negative tone. How many people really believe that "wisdom of the crowds" can solve the worlds problems ? How many people have even heard of the term ? I would risk saying that Lanier spends too much time around silicon valley geeks. Sure, there is an open culture on the web but I pay more for content today that I ever did before (The Economist, Nature Reviews Genetics, Netflix, iTunes, Amazon on Demand, Pandora One, etc). The web 2.0 mash-up craze peaked when the Times nominated "You" as the person of the year (twitter is not content ;). Also, I like youtube clips like anyone else, some of them can be just amazing (ex. Kutiman's mash-ups) but I still want to pay to see Avatar again in glorious 3D IMAX.
One idea that he mentions often is that of the technological lock-in. As media formats might get locked in with use by the majority Lanier argues that concepts and ideas can be equally locked-in. An example he gives is the concept of files on the computers. That we are no longer free to experiment with the way information is stored in a computer system because this has been locked in.
What I guess Laniear was trying to say with this warning about technological lock-ins is that we run the risk of getting trapped in a set of ideas of the web that decrease the value of humanity and the content we produce and give too much value to the cloud of computers that underly the net. Even if I was to agree that current web culture tends to devalue content and humanity I don't think these lock-ins can be that powerful. We see net culture changing everyday before us and we have so far gained much more than we lost.
In summary, I would say that the problems he talks about are important but the book is overly pessimist about our current web culture.
Predicting and explaining drug-drug interactions
I am generally interested in chemogenomic studies and drug interaction studies as a complement to what we work on in the Krogan lab (genetic interactions). Much like in genetic interaction screening, where the fitness of double mutant strains is compared with that of the individual single mutants, chemogenomics tries to identify drug-gene interactions while drug-drug interaction screening attempts to find cases where the combined effect of two compounds on fitness is different from the expected from the combination of the single independent effects.
I read two recent papers that I found interesting regarding drug-drug interactions. One was by Bollenbach and colleagues from the Kishony lab (published in Cell) and the other was by Jansen and colleagues (published in MSB). In the first, the authors present an explanation for a previously observed drug-drug interaction. It had been previously shown that the combination of DNA and protein synthesis inhibitors results in lower reduction of fitness than expected by a neutral combination model (termed antagonist interaction). The authors show in this paper that, in the presence of DNA synthesis inhibitors, ribosomal genes are not optimally expressed. This imbalance between ribosomal production and cell growth is detrimental to the cell and can be, at least in part, corrected by protein synthesis inhibitors, explaining why these can suppress the effects of the DNA synthesis inhibitors.
Although it is a relatively simple idea (once described), I think it shows how complex these drug-drug interactions can be and to some extent also how these can provide information about a cell.
In the second paper I mentioned, Jansen and colleagues try to develop an approach to predict drug-drug interactions based on chemogenomic data. There are many obvious reasons why this would be very useful and I find this line of research extremely interesting. What I was surprised with was the simplicity of the approach and the disappointing benchmarks.
The end-result from a chemogenomic screen is a vector of drug-gene interaction scores that tell us how the combination of a drug with each mutant (normally KO strains) affect growth when compared to neutral expectation from the combined effect of the individual perturbations. It had been previously shown that drugs that have a similar drug-gene vectors tend to have similar mechanisms of action (Parsons et al. 2006 Cell). What Jansen and colleagues now claim is that the similarity of drug-gene vectors are predictive not only of similar mode of action but also of drug-drug interactions. Specifically, they try to show that drugs with similar profiles are more likely to be synergistic, such that the combined effect of both drugs is expected to be more detrimental to the cell that the expected neutral combination.
Although the authors show experimental validation of their predictions with an accuracy of 56% they also benchmark their predictions using drug pairs previously known to be synergistic. This benchmark is somewhat disappointing since they only see a significant enrichment of these true-positive pairs for a narrow range of cut-offs and with 2 out of 3 ways of calculating drug-profile similarity. I wish the authors had comment on this difference between the relatively poor performance based on benchmark and the very high accuracy observed in their experimental tests. They also show that these predicted synergistic pairs are well conserved from S. cerevisiae to C. albicans which is contradictory to a previous Nature Biotech paper that I mentioned in previous post.
Are drug-synergies this easy to predict and so well conserved across species? I am personally not convinced based on the data from this paper alone so I am holding off for further validation by other groups or additional larger datasets/benchmarks.
I read two recent papers that I found interesting regarding drug-drug interactions. One was by Bollenbach and colleagues from the Kishony lab (published in Cell) and the other was by Jansen and colleagues (published in MSB). In the first, the authors present an explanation for a previously observed drug-drug interaction. It had been previously shown that the combination of DNA and protein synthesis inhibitors results in lower reduction of fitness than expected by a neutral combination model (termed antagonist interaction). The authors show in this paper that, in the presence of DNA synthesis inhibitors, ribosomal genes are not optimally expressed. This imbalance between ribosomal production and cell growth is detrimental to the cell and can be, at least in part, corrected by protein synthesis inhibitors, explaining why these can suppress the effects of the DNA synthesis inhibitors.
Although it is a relatively simple idea (once described), I think it shows how complex these drug-drug interactions can be and to some extent also how these can provide information about a cell.
In the second paper I mentioned, Jansen and colleagues try to develop an approach to predict drug-drug interactions based on chemogenomic data. There are many obvious reasons why this would be very useful and I find this line of research extremely interesting. What I was surprised with was the simplicity of the approach and the disappointing benchmarks.
The end-result from a chemogenomic screen is a vector of drug-gene interaction scores that tell us how the combination of a drug with each mutant (normally KO strains) affect growth when compared to neutral expectation from the combined effect of the individual perturbations. It had been previously shown that drugs that have a similar drug-gene vectors tend to have similar mechanisms of action (Parsons et al. 2006 Cell). What Jansen and colleagues now claim is that the similarity of drug-gene vectors are predictive not only of similar mode of action but also of drug-drug interactions. Specifically, they try to show that drugs with similar profiles are more likely to be synergistic, such that the combined effect of both drugs is expected to be more detrimental to the cell that the expected neutral combination.
Although the authors show experimental validation of their predictions with an accuracy of 56% they also benchmark their predictions using drug pairs previously known to be synergistic. This benchmark is somewhat disappointing since they only see a significant enrichment of these true-positive pairs for a narrow range of cut-offs and with 2 out of 3 ways of calculating drug-profile similarity. I wish the authors had comment on this difference between the relatively poor performance based on benchmark and the very high accuracy observed in their experimental tests. They also show that these predicted synergistic pairs are well conserved from S. cerevisiae to C. albicans which is contradictory to a previous Nature Biotech paper that I mentioned in previous post.
Are drug-synergies this easy to predict and so well conserved across species? I am personally not convinced based on the data from this paper alone so I am holding off for further validation by other groups or additional larger datasets/benchmarks.
Friday, January 22, 2010
Recently read - Jan 2010
Blueprint for antimicrobial hit discovery targeting metabolic networks
Y. Shen, J. Liu, G. Estiu, B. Isin, Y-Y. Ahn, D-S. Lee, A-L. Barabási, V. Kapatral, O. Wiest, and Z. N. Oltvai
The authors use a flux balance analysis to identify reactions that essential for S. aureus growth. The enzymes required for the identified pathways were selected for in silico drug screening using both known structures and homology models. Inhibitors identified computationally where tend tested experimentally. I particularly liked the breath of different methods used in this study (FBA, homology modelling, ligand docking and experimental verification). It shows the usefulness of the increase in knowledge across these different areas (networks and structures).
Quantitative Phosphoproteomics Reveals Widespread Full Phosphorylation Site Occupancy During Mitosis
Jesper V. Olsen, Michiel Vermeulen, Anna Santamaria, Chanchal Kumar, Martin L. Miller, Lars J. Jensen, Florian Gnad, Jürgen Cox, Thomas S. Jensen, Erich A. Nigg, Søren Brunak, and Matthias Mann
In this study HeLa cells were synchronized in different stages of the cell cycle and their proteins and phosphorylation sites were quantified relative to asynchronous cells using a SILAC mass-spec approach. Changes in protein abundance and phosphorylation were combined with transcriptional changes and these were used to identify previously known and potentially novel complexes and kinase-substrate interactions important for cell-cycle progression. In addition, I thought it was pretty cool that the authors found a way to directly quantify the phosphorylation site occupancy from the mass-spec results. I was only slightly disappointed that the authors did not attempt to do a cross-species analysis given the available data from Liam Holt et al. on Cdk1 phosphorylation in S. cerevisiae. (Bonus- spot the blogger in the author list)
The Genetic Landscape of a Cell
Michael Costanzo et al.
This paper reports the large scale effort to quantify genetic interactions for approximately 1700 times 3800 pairs of genes in S. cerevisiae. As it is typically the case for these sort of "resource" papers describing a large dataset there is no way that a paper can do full justice to this work. They have mostly tried to show different ways to use this information: 1) predict gene function, 2) functional interactions between complexes and functional groups and 3) prediction of drug targets. Hopefully cell biology labs will pick up on this information to search for their genes of interested and bioinformatics groups will continue to find ways to make these resources easier to navigate (see STRING for a good example of this).
Y. Shen, J. Liu, G. Estiu, B. Isin, Y-Y. Ahn, D-S. Lee, A-L. Barabási, V. Kapatral, O. Wiest, and Z. N. Oltvai
The authors use a flux balance analysis to identify reactions that essential for S. aureus growth. The enzymes required for the identified pathways were selected for in silico drug screening using both known structures and homology models. Inhibitors identified computationally where tend tested experimentally. I particularly liked the breath of different methods used in this study (FBA, homology modelling, ligand docking and experimental verification). It shows the usefulness of the increase in knowledge across these different areas (networks and structures).
Quantitative Phosphoproteomics Reveals Widespread Full Phosphorylation Site Occupancy During Mitosis
Jesper V. Olsen, Michiel Vermeulen, Anna Santamaria, Chanchal Kumar, Martin L. Miller, Lars J. Jensen, Florian Gnad, Jürgen Cox, Thomas S. Jensen, Erich A. Nigg, Søren Brunak, and Matthias Mann
In this study HeLa cells were synchronized in different stages of the cell cycle and their proteins and phosphorylation sites were quantified relative to asynchronous cells using a SILAC mass-spec approach. Changes in protein abundance and phosphorylation were combined with transcriptional changes and these were used to identify previously known and potentially novel complexes and kinase-substrate interactions important for cell-cycle progression. In addition, I thought it was pretty cool that the authors found a way to directly quantify the phosphorylation site occupancy from the mass-spec results. I was only slightly disappointed that the authors did not attempt to do a cross-species analysis given the available data from Liam Holt et al. on Cdk1 phosphorylation in S. cerevisiae. (Bonus- spot the blogger in the author list)
The Genetic Landscape of a Cell
Michael Costanzo et al.
This paper reports the large scale effort to quantify genetic interactions for approximately 1700 times 3800 pairs of genes in S. cerevisiae. As it is typically the case for these sort of "resource" papers describing a large dataset there is no way that a paper can do full justice to this work. They have mostly tried to show different ways to use this information: 1) predict gene function, 2) functional interactions between complexes and functional groups and 3) prediction of drug targets. Hopefully cell biology labs will pick up on this information to search for their genes of interested and bioinformatics groups will continue to find ways to make these resources easier to navigate (see STRING for a good example of this).
Thursday, January 14, 2010
The joys of print
For the past two months I have been enjoying my first ever print subscription to a scientific journal. The good folks over at Nature Reviews Genetics offered my a small discount that nudged me to it (thank you!). I thought that if I ever tried a print subscription it would be for sure a review journal. I can't say I regret the decision. Having the print issue to read on my commute makes me read articles that I would not normally print out and the front section (research highlights) is a good way to catch up to science news.
Maybe this explosion of e-readers will make it easier to emulate the browsing experience of a bound print copy. As Nicholas Carr (and others) have pointed out, changes in internet technology shape the way we think and use information. In a recent post, Carr gives his answer to this year's Edge annual question "How is the Internet changing the way you think?".
He writes:
I think that Carr is, as usual, exaggerating in his pessimist view of web culture. In my work I scan online and print to read. The difference from that workflow to browsing a bound print copy is only the way I filter what I read. Still, it is worth a thought. Maybe we should be working on technologies that will help us work by forcing us to focus our attention better. For now, a print out will do :).
Maybe this explosion of e-readers will make it easier to emulate the browsing experience of a bound print copy. As Nicholas Carr (and others) have pointed out, changes in internet technology shape the way we think and use information. In a recent post, Carr gives his answer to this year's Edge annual question "How is the Internet changing the way you think?".
He writes:
"My own reading and thinking habits have shifted dramatically since I first logged onto the Web fifteen or so years ago. I now do the bulk of my reading and researching online. And my brain has changed as a result. Even as I’ve become more adept at navigating the rapids of the Net, I have experienced a steady decay in my ability to sustain my attention."
I think that Carr is, as usual, exaggerating in his pessimist view of web culture. In my work I scan online and print to read. The difference from that workflow to browsing a bound print copy is only the way I filter what I read. Still, it is worth a thought. Maybe we should be working on technologies that will help us work by forcing us to focus our attention better. For now, a print out will do :).
Monday, January 11, 2010
In science, data without purpose is sometimes required
The title is probably flamebait but it might get you to read my little rant about data production in science. Its something I have been meaning to write about for a while but Deepak's post provided the extra incentive.
I think Deepak's post was a reminder that science is nothing without hypothesis and I certainly agree with that. To put this into context maybe it worth pointing again to the wired article about "The End of Theory" where Chris Anderson painfully tries to make the point that with the deluge of data that we are seeing we don't need models or hypothesis we just need to crunch the data to look for correlations.
I strongly disagree with this viewpoint. What would we learn about reality this way ? At most we would see correlations and could have some predictive power about future events but we would not know the mechanisms and thats the interesting part.
So why is data without purpose sometimes justified ? What I mean by this is that the capacity to produce data and its analysis does not have to be centralized in the same place. My perspective (bioinformatician) is from someone that has benefited a lot from the data deluge in biology and the fact that data is made (mostly) available to others. It has allowed many studies that reuse pre-existing results to answer new questions.
I also work in lab the develops genetic interactions screening methods and end having some discussions about this topic. Many people dislike this sort of research, finding creative names like "fishing expedition" to describe it. The truth is that there are many types of data that we need to collect (genomes, gene expression, protein-protein interactions, etc) that we know that will be useful to understand how cells work. We just need more accurate and cheaper methods to get them and there is no other way but to have the focus of the research be the data production itself.
I think Deepak's post was a reminder that science is nothing without hypothesis and I certainly agree with that. To put this into context maybe it worth pointing again to the wired article about "The End of Theory" where Chris Anderson painfully tries to make the point that with the deluge of data that we are seeing we don't need models or hypothesis we just need to crunch the data to look for correlations.
I strongly disagree with this viewpoint. What would we learn about reality this way ? At most we would see correlations and could have some predictive power about future events but we would not know the mechanisms and thats the interesting part.
So why is data without purpose sometimes justified ? What I mean by this is that the capacity to produce data and its analysis does not have to be centralized in the same place. My perspective (bioinformatician) is from someone that has benefited a lot from the data deluge in biology and the fact that data is made (mostly) available to others. It has allowed many studies that reuse pre-existing results to answer new questions.
I also work in lab the develops genetic interactions screening methods and end having some discussions about this topic. Many people dislike this sort of research, finding creative names like "fishing expedition" to describe it. The truth is that there are many types of data that we need to collect (genomes, gene expression, protein-protein interactions, etc) that we know that will be useful to understand how cells work. We just need more accurate and cheaper methods to get them and there is no other way but to have the focus of the research be the data production itself.
Sunday, January 03, 2010
Stitching different web tools to organize a project
A little over a year ago I mentioned a project I was working on about prediction and evolution of E3 ligase targets (aka P1). As I said back then, I am free to risk as much as I want in sharing ongoing results and Nir London just asked me how the project is going via the comments of that blog post so I decided to give a bit of an update.
Essentially, the project quickly deviated from course since I realized that predicting E3 specificity and experimentally determining ubiquitylation sites in fungal species (without having to resort to strain manipulation) were not going to be an easy tasks.
So, since the goal was to use these data to study the co-evolution of phosphorylation switches (phosphorylation regulating ubiquitylation) it makes little sense to restrain the analysis specifically to one form of post-translational modification (PTM). After a failed attempt to purify ubiquitylated substrates the goal has been to come up with ways to predict the functional consequences of phosphorylation. We will still need to take ubiquitylation into account but that will be a part of the whole picture.
With this goal in mind we have been collecting for multiple species data on phosphorylation as well as other forms of PTMs from databases and the literature and we have been trying to come up with ways to predict the function of these phosphorylation events. These predictions can be broken down mostly intro tree types:
- phosphorylation regulating domain activity
- phosphorylation regulating domain-domain interactions (globular domain interfaces)
- phosphorylation regulating linear motif interactions (phosphorylation switches in disordered regions)
We have set up a notebook where we will be putting some of the results and ways to access the datasets. Any new experimental data and results from the analysis will be posted with a significant delay both to give us some protection against scooping and also to try to guarantee that we don't push out things that are obviously wrong. This brings us to a disclaimer... all data and analysis in that notebook is to be considered preliminary and not peer reviewed, it probably contains mistakes and can change quickly.
I am currently colaborating with Raik Gruenberg on this project and we are open to collaborators that bring new skills to the project. We are particularly interested in experimentalist working in cell biology and cell signalling that could be interested in testing some of the predictions we are getting out of this study.
I won't talk much (yet) about the results we have so far but instead mention some of the tools we are using or planning to use:
- The notebook of the project hosted in openwetware
- The datasets/files are shared via Dropbox
- If need arises code will be shared via Google Code (currently empty)
- Literature will be shared via a Zotero group library
- The papers and other items can be discussed in a Friendfeed group
This will be all for now. I think we are getting interesting results from this analysis on the evolution of the functional consequences of phosphorylation events but we will update the notebook when we are a bit more confident that we ruled out most of the potential artifacts. I think the hardest part about exposing ongoing projects is having to explain to potential collaborators that we intend to do so. This still scares people away.
I'll end with a pretty picture. This is an image of an homology model for the Tup1 -Hhf1 interaction. Highlighted are two residues that are predicted by the model to be in the interface and are phosphorylated in two different fungal species. This exemplifies how the functional consequence of a phosphorylation event can be conserved although the individual phosphorylation sites (apparently) are not.
Essentially, the project quickly deviated from course since I realized that predicting E3 specificity and experimentally determining ubiquitylation sites in fungal species (without having to resort to strain manipulation) were not going to be an easy tasks.
So, since the goal was to use these data to study the co-evolution of phosphorylation switches (phosphorylation regulating ubiquitylation) it makes little sense to restrain the analysis specifically to one form of post-translational modification (PTM). After a failed attempt to purify ubiquitylated substrates the goal has been to come up with ways to predict the functional consequences of phosphorylation. We will still need to take ubiquitylation into account but that will be a part of the whole picture.
With this goal in mind we have been collecting for multiple species data on phosphorylation as well as other forms of PTMs from databases and the literature and we have been trying to come up with ways to predict the function of these phosphorylation events. These predictions can be broken down mostly intro tree types:
- phosphorylation regulating domain activity
- phosphorylation regulating domain-domain interactions (globular domain interfaces)
- phosphorylation regulating linear motif interactions (phosphorylation switches in disordered regions)
We have set up a notebook where we will be putting some of the results and ways to access the datasets. Any new experimental data and results from the analysis will be posted with a significant delay both to give us some protection against scooping and also to try to guarantee that we don't push out things that are obviously wrong. This brings us to a disclaimer... all data and analysis in that notebook is to be considered preliminary and not peer reviewed, it probably contains mistakes and can change quickly.
I am currently colaborating with Raik Gruenberg on this project and we are open to collaborators that bring new skills to the project. We are particularly interested in experimentalist working in cell biology and cell signalling that could be interested in testing some of the predictions we are getting out of this study.
I won't talk much (yet) about the results we have so far but instead mention some of the tools we are using or planning to use:
- The notebook of the project hosted in openwetware
- The datasets/files are shared via Dropbox
- If need arises code will be shared via Google Code (currently empty)
- Literature will be shared via a Zotero group library
- The papers and other items can be discussed in a Friendfeed group
This will be all for now. I think we are getting interesting results from this analysis on the evolution of the functional consequences of phosphorylation events but we will update the notebook when we are a bit more confident that we ruled out most of the potential artifacts. I think the hardest part about exposing ongoing projects is having to explain to potential collaborators that we intend to do so. This still scares people away.
I'll end with a pretty picture. This is an image of an homology model for the Tup1 -Hhf1 interaction. Highlighted are two residues that are predicted by the model to be in the interface and are phosphorylated in two different fungal species. This exemplifies how the functional consequence of a phosphorylation event can be conserved although the individual phosphorylation sites (apparently) are not.
Thursday, December 17, 2009
Name that lab ...
In the last editorial in Nature, the need for an author ID is introduced with the simple notion that each one of us has specific sets of skills:
It does not take long to notice that all supervisors have their strengths and weaknesses and we talk about this openly. Some are better at grant writing, some have good people skills and keep the lab well balanced, a few (rare ones :) still know what they are talking about when they help you troubleshoot your method/protocol. If it was possible to have the same person doing all these things companies would not have come up with their more complicated management structures.
So why is it that we name our labs after ourselves and do a poor management job instead of having multiple PIs handling different aspects of the lab that is named after what it actually studies ?
In his classic book Management Teams, UK psychologist Meredith Belbin used extensive empirical evidence to argue that effective teams require members who can cover nine key roles. These roles range from the creative 'plants' who generate novel ideas, to the disciplined 'implementers' who turn plans into action and the big-picture 'coordinators' who keep everyone working together.From this perspective the author ID is a tool that might help us get appropriate credit for skill sets that are currently undervalued. This sort of argument reminds me of a discussion I had several times in the past about the management structure of academic labs. Why is it that we have one single leader in each lab that has to handle all sorts of different management tasks ? Is it ego ? That we all need to have our own lab, named accordingly with our name ?
It does not take long to notice that all supervisors have their strengths and weaknesses and we talk about this openly. Some are better at grant writing, some have good people skills and keep the lab well balanced, a few (rare ones :) still know what they are talking about when they help you troubleshoot your method/protocol. If it was possible to have the same person doing all these things companies would not have come up with their more complicated management structures.
So why is it that we name our labs after ourselves and do a poor management job instead of having multiple PIs handling different aspects of the lab that is named after what it actually studies ?
Friday, August 21, 2009
PLoS Currents - rapid dissemination of knowledge
PLoS unveiled recently an initiative they call PLoS Currents. It is an experiment in rapid dissemination of research built on top of Google Knol. Essentially, a community of people dedicated to a specific topic, could use PLoS Currents to describe their ongoing work before it is submitted to a peer review journal. They have focused their initial efforts to Influenza research where the speed of dissemination of information might be crucial.
The content of this PLoS Currents: Influenza is not peer reviewed but is moderated by a panel of scientists that will strive to keep the content on topic. There is a FAQ explaining in more detail the initiative. These articles are archived, citable, they can be revised and they should not be considered as peer-reviewed publications. For this reason, PLoS encourages authors to eventually submit these works to a peer-reviewed journal. It remains to be seen how other publishers will react to submissions that are available in these rapid dissemination portals.
PLoS Currents vs Nature Precedings
This initiative is somewhat related to the preprint archives like Nature Precedings and arxive. The main differences seam to be a stronger emphasizes on community moderators and the use of 3rd party technology (Google Knol). The community moderators, which I assume are researchers working on Influenza could be decisive factor in ensuring that other researchers in the field at least know about the project. Using Google Knol lets PLoS focus on the community and hopefully help them get the technical support from Google to develop new tools are they are needed. However the website currently looks a little bit like a hack, which is the downside of using a 3rd party technology. For example, we can click the edit button and see options to change the main website .. although obviously the permissions do not allow us to save these changes.
I think it is an interesting experiment and hopefully more bio-related researchers will get comfortable with sharing and discussing ongoing research before publication. I still believe this would reduce wasteful overlaps. As usual, I only fear that more of these experiments tend to fragment the required critical mass for such a community site to work.
The content of this PLoS Currents: Influenza is not peer reviewed but is moderated by a panel of scientists that will strive to keep the content on topic. There is a FAQ explaining in more detail the initiative. These articles are archived, citable, they can be revised and they should not be considered as peer-reviewed publications. For this reason, PLoS encourages authors to eventually submit these works to a peer-reviewed journal. It remains to be seen how other publishers will react to submissions that are available in these rapid dissemination portals.
PLoS Currents vs Nature Precedings
This initiative is somewhat related to the preprint archives like Nature Precedings and arxive. The main differences seam to be a stronger emphasizes on community moderators and the use of 3rd party technology (Google Knol). The community moderators, which I assume are researchers working on Influenza could be decisive factor in ensuring that other researchers in the field at least know about the project. Using Google Knol lets PLoS focus on the community and hopefully help them get the technical support from Google to develop new tools are they are needed. However the website currently looks a little bit like a hack, which is the downside of using a 3rd party technology. For example, we can click the edit button and see options to change the main website .. although obviously the permissions do not allow us to save these changes.
I think it is an interesting experiment and hopefully more bio-related researchers will get comfortable with sharing and discussing ongoing research before publication. I still believe this would reduce wasteful overlaps. As usual, I only fear that more of these experiments tend to fragment the required critical mass for such a community site to work.
Tuesday, August 11, 2009
Translationally optimal codons do not appear to significantly associate with phosphorylation sites
I recently read an interesting paper about codon bias at structurally important sites that sent me on a small detour from my usual activities. Tong Zhou, Mason Weems and Claus Wilke, described how translationally optimal codons are associated with structurally important sites in proteins, such as the protein core (Zhou et al. MBE 2009). This work is a continuation of the work from this same lab on what constraints protein evolution. I have written here before a short review of the literature on the subject. As a reminder, it was observed that the expression level is the strongest constraint on a protein's rate of change with highly expressed genes coding for proteins that diverge slower than lowly expressed ones (Drummond et al. MBE 2006). It is currently believed that selection against translation errors is the main driving force restricting this rate of change (Drummond et al. PNAS 2005,Drummond et al. Cell 2008). It has been previously shown that translation rates are introduced, on average, at an order of about 1 to 5 per 10000 codons and that different codons can differ in their error rates by 4 to 9 fold, influenced by translational properties like the availability of their tRNAs (Kramer et al. RNA 2007).
Given this background of information what Zhou and colleagues set out to do, was test if codons that are associated with highly expressed genes tend to be over-represented at structurally important sites. The idea being that such codons, defined as "optimal codons" are less error prone and therefore should be avoided at positions that, when miss-translated, could destabilize proteins. In this work they defined a measure of codon optimality as the odds ratio of codon usage between highly and lowly expressed genes. Without going into many details they showed, in different ways and for different species, that indeed, codon optimality is correlated with the odds of being at a structurally important site.
I decided to test if I could also see a significant association between codon optimality and sites of post-translational modifications. I defined a window of plus or minus 2 amino-acids surrounding a phosphorylation site (of S. cerevisiae) as associated with post-translational modification. The rationale would be that selection for translational robustness could constraint codon usage near a phosphorylation site when compared with other Serine or Threonine sites. For simplification I mostly ignored tyrosine phosphorylation that in S. cerevisiae is a very small fraction of the total phosphorylation observed to date .
For each codon I calculated its over representation at these phosphorylation windows compared to similar windows around all other S/T sites and plotted this value against the log of the codon optimality score calculated by Zhou and colleagues.
I figure 2 I plot the distribution of gene expression for phosphorylated and non-phosphorylated proteins. There is only a very small difference observed with phosphoproteins having a marginally higher median gene expression when compared to other proteins. However this difference is small and a KS test does not rule out that they are drawn from the same distribution.
The next possible expression related explanation for the observed correlation would be that highly expressed genes tend to have more phosphorylation sites. Although there is no significant correlation between the gene expression level and the absolute number of phosphorylation sites, what I observed was that highly expressed proteins tend to be smaller in size. This means that there is a significant positive correlation between the fraction of phosphorylated Serine and Threonine sites and gene expression.
Unfortunately, I believe this correlation explains the result observed in figure 1. In order to properly control for this observation I calculated the correlation observed in figure 1 randomizing the phosphorylation sites within each phosphoprotein. To compare I also randomized the phosphorylation sites keeping the total number of phosphorylation sites fixed but not restricting the number of phosphorylation sites within each specific phosphoprotein.
When randomizing the phosphorylation sites within each phosphoprotein, keeping the number of phosphorylation sites in each specific phosphoproteins constant the average R-squared is higher than the observed with the experimentally determined phosphorylation sites (pink curve). This would mean that the correlation observed in figure 1 is not due to functional constraints acting on the phosphorylation sites but instead is probably due to the correlation observed in figure 3 between the expression level and the fraction of phosphorylated S/T residues.
The observed correlation would appear to be significantly higher than random if we allow the random phosphorylation sites to be drawn from any phosphoprotein without constraining the number of phosphorylation sites in each specific protein (blue curve). I added this because I thought it was an striking example of how a relatively subtle change in assumptions can change the significance of a score.
I also tested if conserved phosphorylation sites tend to be coded by optimal codons when compared with non-conserved phosphorylation sites. For each phosphorylation site I summed over the codon optimality in a window around the site and compared the distribution of this sum for phosphorylation sites that are conserved in zero, one or more than one species. The conservation was defined based on an alignment window of +/- 10AAs of S. cerevisiae proteins against orthologs in C. albicans, S. pombe, D. melanogaster and H. sapiens.
I observe a higher sum of codon optimality for conserved phosphorylation sites (fig 5A) but this difference is not maintained if the codon optimality score of each peptide is normalized by the expression level of the source protein (fig 5B).
In summary, when the gene expression levels are taken into account, it does not appear to be an association between translationally optimal codons with the region around phosphorylation sites. This is consistent with the weak functional constraints observed by in analysis performed by Landry and colleagues.
Given this background of information what Zhou and colleagues set out to do, was test if codons that are associated with highly expressed genes tend to be over-represented at structurally important sites. The idea being that such codons, defined as "optimal codons" are less error prone and therefore should be avoided at positions that, when miss-translated, could destabilize proteins. In this work they defined a measure of codon optimality as the odds ratio of codon usage between highly and lowly expressed genes. Without going into many details they showed, in different ways and for different species, that indeed, codon optimality is correlated with the odds of being at a structurally important site.
I decided to test if I could also see a significant association between codon optimality and sites of post-translational modifications. I defined a window of plus or minus 2 amino-acids surrounding a phosphorylation site (of S. cerevisiae) as associated with post-translational modification. The rationale would be that selection for translational robustness could constraint codon usage near a phosphorylation site when compared with other Serine or Threonine sites. For simplification I mostly ignored tyrosine phosphorylation that in S. cerevisiae is a very small fraction of the total phosphorylation observed to date .
For each codon I calculated its over representation at these phosphorylation windows compared to similar windows around all other S/T sites and plotted this value against the log of the codon optimality score calculated by Zhou and colleagues.
Figure 1 - Over-representation of optimal codons at phosphosites
At first impression it would appear that there is a significant correlation between codon optimality and phosphorylation sites. However, as I will try to describe below this is mostly due to differences in gene expression. Given the relatively small number of phosphorylation sites per protein, it is hard to test this association for each protein independently as it was done by Zhou and colleagues for the structurally important sites. The alternative is therefore to try to take into account the differences in gene expression. I first checked if phosphorylated proteins tend to be coded by highly expressed genes.Figure 2 - Distribution of gene expression of phosphorylated proteins
I figure 2 I plot the distribution of gene expression for phosphorylated and non-phosphorylated proteins. There is only a very small difference observed with phosphoproteins having a marginally higher median gene expression when compared to other proteins. However this difference is small and a KS test does not rule out that they are drawn from the same distribution.
The next possible expression related explanation for the observed correlation would be that highly expressed genes tend to have more phosphorylation sites. Although there is no significant correlation between the gene expression level and the absolute number of phosphorylation sites, what I observed was that highly expressed proteins tend to be smaller in size. This means that there is a significant positive correlation between the fraction of phosphorylated Serine and Threonine sites and gene expression.
Figure 3 - Expression level correlates with fraction of phosphorylated ST sites
Unfortunately, I believe this correlation explains the result observed in figure 1. In order to properly control for this observation I calculated the correlation observed in figure 1 randomizing the phosphorylation sites within each phosphoprotein. To compare I also randomized the phosphorylation sites keeping the total number of phosphorylation sites fixed but not restricting the number of phosphorylation sites within each specific phosphoprotein.
Figure 4 - Distribution of R-squared for randomized phosphorylation sites
When randomizing the phosphorylation sites within each phosphoprotein, keeping the number of phosphorylation sites in each specific phosphoproteins constant the average R-squared is higher than the observed with the experimentally determined phosphorylation sites (pink curve). This would mean that the correlation observed in figure 1 is not due to functional constraints acting on the phosphorylation sites but instead is probably due to the correlation observed in figure 3 between the expression level and the fraction of phosphorylated S/T residues.
The observed correlation would appear to be significantly higher than random if we allow the random phosphorylation sites to be drawn from any phosphoprotein without constraining the number of phosphorylation sites in each specific protein (blue curve). I added this because I thought it was an striking example of how a relatively subtle change in assumptions can change the significance of a score.
I also tested if conserved phosphorylation sites tend to be coded by optimal codons when compared with non-conserved phosphorylation sites. For each phosphorylation site I summed over the codon optimality in a window around the site and compared the distribution of this sum for phosphorylation sites that are conserved in zero, one or more than one species. The conservation was defined based on an alignment window of +/- 10AAs of S. cerevisiae proteins against orthologs in C. albicans, S. pombe, D. melanogaster and H. sapiens.
Figure 5 - Distribution of codon optimality scores versus phospho-site conservation
I observe a higher sum of codon optimality for conserved phosphorylation sites (fig 5A) but this difference is not maintained if the codon optimality score of each peptide is normalized by the expression level of the source protein (fig 5B).
In summary, when the gene expression levels are taken into account, it does not appear to be an association between translationally optimal codons with the region around phosphorylation sites. This is consistent with the weak functional constraints observed by in analysis performed by Landry and colleagues.
Saturday, August 01, 2009
Drug synergies tend to be context specific
A little over a year ago I mentioned a paper published in MSB on how drug-combinations could be used to study pathways. Recently, some of the same authors have now published a study in Nature Biotech analyzing drug combinations under different contexts (i.e. different tissues, different species, different outputs, etc).
The underlying methodology of the study is essentially the same as in above mentioned paper. The authors try to study the effect of combining drugs on specific phenotypes. One example of a phenotype could be the inhibition of growth of a pathogenic strain. Different concentrations of two drugs are combined in a matrix form as described in figure 1a (reproduced below) and the phenotype is measured for each case. Two drugs are said to be synergistic if the measured impact on the phenotype of the combined drugs is greater than expected by a neutral model.
The authors ask themselves if drug synergy is or not context dependent. This is an important question for combinatorial therapeutics since we would like to have treatments that are context dependent (i.e. specific). The most straightforward example would be drug treatments against pathogens. Ideally, combinations of drugs would act synergistically against the pathogens but not against the host. Another example would be drug combinations targeting the expression of a particular gene (ex. TNF-alpha) without showing synergy at targeting general cell viability.
In order to test this the authors performed simulations of E.coli metabolism growing under different conditions and a astonishing panel of ~94000 experimental dose matrices covering several different types of therapeutic conditions. In each experiment, two drugs are tested against a control and a test phenotype and the synergy is measured and compared. The results are summarized as the synergy of the two drugs in the test case and the selectivity of this synergy towards the test phenotype. In other words, for each experiment the authors tested if the synergistic drug pairs in the test phenotype (ex inhibition of growth of the pathogen) are also acting in synergy on the control phenotype (ex. inhibition of growth of host cells).
I reproduce above fig 2b with the results from the flux balance simulations of E.coli metabolism. In these simulations "drugs" were implemented as ideal enzyme inhibitors that reduced flux of their targets. Each cross on this figure represents a "drug" pair targeting two enzymes of the E.coli metabolism. The test and control phenotypes are, in this case, fermentation versus aerobic conditions. In this plot the authors show that synergistic drug pairs under fermentation tend to have a high selectivity for that condition when compared to aerobic conditions.
The authors then went on to show that this was also the case for most of the experimental cases studied. Some of the experimental cases included cell lines derived from different tissues, highlighting the complexity of drug-interactions in multicellular organisms. These results are consistent with the observation that negative genetic interactions are poorly conserved across species (Tischler et al. Nat Genet. 2008, Roguev et al. Science 2008). Although these results are promising, in respect to the usefulness of combinatorial therapeutic strategies, they emphasize the degree of divergence of cellular interaction networks across species and perhaps even tissues. I am obviously biased but I think that fundamental studies of chemogenomics across species will help us to better understand the potential of combinatorial therapeutics.
There are several examples in this paper of specific interesting cases of drug synergies but most of the results are in supplementary materials. Given that most of the authors are affiliated with a company I expect that there will be little real therapeutic value in the data. Still, it looks like an interesting set for anyone interested in studying drug-gene networks.
Lehár, J., Krueger, A., Avery, W., Heilbut, A., Johansen, L., Price, E., Rickles, R., Short III, G., Staunton, J., Jin, X., Lee, M., Zimmermann, G., & Borisy, A. (2009). Synergistic drug combinations tend to improve therapeutically relevant selectivity Nature Biotechnology, 27 (7), 659-666 DOI: 10.1038/nbt.1549
Friday, June 26, 2009
Reply: On the evolution of protein length and phosphorylation sites
Lars just pointed out in a blog post that the average protein length of a group of proteins is a strong predictor of average number of phosphorylation sites. Although this is intuitive this is something that I honestly had not fully considered. As Lars mentions this has potential implications for some of the calculations in our recently published study on the evolution of phosphorylation in yeast species.
One potential concern relates to figure 1a. We found that, although protein phosphorylation appears to diverge quickly, there is a high conservation of the relative number of phosphosites per protein for different GO groups. Lars suggests that, at least in part, this could be due to relative differences in average protein size for these different groups that in turn is highly conserved across species.
To test this hypothesis more directly I tried to correct for differences in the average protein size of different functional groups by calculating the average number of phosphorylation sites per amino-acid, instead of psites per protein. These values were then corrected for the average number of phosphorylation sites per AA in the proteome.

As before, there is still a high cross-species correlation for the average number of psites per amino-acid for different GO groups. The correlations are only somewhat smaller than before. The individual correlation coefficients among the three species changed from: S. cerevisiae versus C. albicans – R~0.90 to 0.80; S. cerevisiae versus S. pombe – R~0.91 to 0.84; S. pombe versus C. albicans – R~0.88 to 0.83. It would seem that differences in protein length explains only a small part of the observed correlations. Results in figure 1b are also not qualitative affected by this normalization suggesting that observed differences are not due to potential changes in the average size of proteins. In fact the number of amino acids per GO group is almost perfectly correlated across species.
Another potential concern relates to the sequence based prediction of phosphorylation. As explained in the methods, one of the two approaches used to predict if a protein was phosphorylated was the sum over multiple phosphorylation site predictors for the same sequence. Given the correlation shown by Lars, could it be that, at least for one of the methods, we are mostly predicting the average protein size ? To test this I normalized the phosphorylation prediction for each S. cerevisiae protein by their length. I re-tested the predictive power of this normalized value using ROC curves and the known phosphoproteins of S. cerevisiae as postives. The AROC values changed from 0.73 to 0.68. This shows that the phosphorylation propensity is not just predicting protein size although, as expected from Lars' blog post, size alone is actually a decent predictor for phosphorylation (AROC=0.66). The normalized phosphorylation propensity does not correlate with the protein size (CC~0.05) suggesting that there might ways to improve the predictors we used.
Nature or method bias ?
Are larger proteins more likely to be phosphorylated in a cell or are they more likely to be detected in a mass-spec experiment ? It is likely that what we are observing is a combination of both effects but it would be nice to know how much of this observed correlation is due to potential MS bias. I am open to suggestions for potential tests.
This is also important for what I am planning to work on next. A while ago I had noticed that prediction of phosphorylation propensity could also predict ubiquitination and vice-versa. It is possible that they are mostly related by protein size. I will try to look at this in future posts.
One potential concern relates to figure 1a. We found that, although protein phosphorylation appears to diverge quickly, there is a high conservation of the relative number of phosphosites per protein for different GO groups. Lars suggests that, at least in part, this could be due to relative differences in average protein size for these different groups that in turn is highly conserved across species.
To test this hypothesis more directly I tried to correct for differences in the average protein size of different functional groups by calculating the average number of phosphorylation sites per amino-acid, instead of psites per protein. These values were then corrected for the average number of phosphorylation sites per AA in the proteome.
As before, there is still a high cross-species correlation for the average number of psites per amino-acid for different GO groups. The correlations are only somewhat smaller than before. The individual correlation coefficients among the three species changed from: S. cerevisiae versus C. albicans – R~0.90 to 0.80; S. cerevisiae versus S. pombe – R~0.91 to 0.84; S. pombe versus C. albicans – R~0.88 to 0.83. It would seem that differences in protein length explains only a small part of the observed correlations. Results in figure 1b are also not qualitative affected by this normalization suggesting that observed differences are not due to potential changes in the average size of proteins. In fact the number of amino acids per GO group is almost perfectly correlated across species.
Another potential concern relates to the sequence based prediction of phosphorylation. As explained in the methods, one of the two approaches used to predict if a protein was phosphorylated was the sum over multiple phosphorylation site predictors for the same sequence. Given the correlation shown by Lars, could it be that, at least for one of the methods, we are mostly predicting the average protein size ? To test this I normalized the phosphorylation prediction for each S. cerevisiae protein by their length. I re-tested the predictive power of this normalized value using ROC curves and the known phosphoproteins of S. cerevisiae as postives. The AROC values changed from 0.73 to 0.68. This shows that the phosphorylation propensity is not just predicting protein size although, as expected from Lars' blog post, size alone is actually a decent predictor for phosphorylation (AROC=0.66). The normalized phosphorylation propensity does not correlate with the protein size (CC~0.05) suggesting that there might ways to improve the predictors we used.
Nature or method bias ?
Are larger proteins more likely to be phosphorylated in a cell or are they more likely to be detected in a mass-spec experiment ? It is likely that what we are observing is a combination of both effects but it would be nice to know how much of this observed correlation is due to potential MS bias. I am open to suggestions for potential tests.
This is also important for what I am planning to work on next. A while ago I had noticed that prediction of phosphorylation propensity could also predict ubiquitination and vice-versa. It is possible that they are mostly related by protein size. I will try to look at this in future posts.
Tuesday, June 23, 2009
Comparative analysis of phosphoproteins in yeast species
My first postdoctoral project has just appeared online in PLoS Biology. It is about the evolution of phosphoregulation in yeast species. This analysis follows from a previous work I had done during my PhD on the evolution of protein-protein interactions after gene duplication (paper / blog post). One of the conclusions from that previous work was that interactions of lower specificity, such as those mediated by short peptides, would be more prone to change. In fact, one of the protein domains that we found associated with high rates of change of protein-protein interactions was the kinase domain.
Given that the substrate specificity of a kinase is usually determined by a few key amino-acids surrounding the target phosphosite it is easy to image how kinase-substrate interactions can be easily created and destroyed with few mutations. It is also well known that these phosphorylation events can have important functional consequences. We therefore postulated that changes in phosphorylation are an important source of phenotypic diversity.
To test this, we collected by mass-spectrometry in vivo phosphorylation sites for 3 yeast species (S. cerevisiae, C. albicans and S. pombe). These were compared in order to estimate the rate of change of kinase-substrate interactions. Since changes in gene expression are generally regarded as one of the main sources of phenotypic diversity we compared these estimates with similar calculations for the rate of change of transcription factor (TF) interactions to promoters. Depending on how we define a divergence of phosphorylation we estimate that kinase-substrate interactions change either at similar rates or at most 2 orders of magnitude slower than TF-promoter interactions.
Although these changes in kinase-substrate interactions appear to be fast, groups of functionally related proteins tend to maintain the same levels of phosphorylation across broad time scales. We could identify a few functional groups and protein complexes with a significant divergence in phosphorylation and we tried to predict the most likely kinases responsible for these changes.
Finally we compiled recently published genetic interaction data for S. pombe (from Assen Roguev's work) and for S. cerevisiae (from Dorothea Fiedler's work) in addition to some novel genetic data produced for this work. We used this information to study the relative conservation of genetic interactions for protein kinases and transcription factors. We observed that both proteins kinases and TFs show a lower than average conservation of genetic interactions.
We think these observations strongly support the initial hypothesis that divergence in kinase-substrate interactions contributes significantly to phenotypic diversity.
Technology opening doors
For me personally it really feels like I was in the right place at the right time. Many of the experimental methods we used are still under heavy development but I was lucky to be very literally next door to the right people. I had the chance to collaborate with Jonathan Trinidad who works for the UCSF Mass Spectrometry Facility directed by Alma Burlingame. I also arrived at a time when the Krogan lab, more specifically Assen Roguev (twitter feed), has been working to develop genetic interaction assays for S. pombe (Roguev A 2007). As we describe in the introduction, these technological developments really allow us to map out the functional and physical interactions of a cell at an incredible rate. What I am hoping for is that soon they are seen in much the same light as genome sequencing. We can and should be using these tools to study, simultaneously, groups of species and not just the same usual model organisms that diverged from each other more than 1 billion years ago.
Evolution of signalling
There are many more protein interactions that are determined by short linear peptide motifs (Neduva PLoS Bio 2005). A large fraction of these determine protein post-translational modifications and are crucial for signal transduction systems. For the next couple of years I will try to continue to study the evolution of signal transduction systems. There are certainly many experimental and computational challenges to address. I am particularly interested in looking at the co-regulation by combinations of post-translational modifications and their co-evolution. I will do my best to share some of that work as it happens here in the blog.
Given that the substrate specificity of a kinase is usually determined by a few key amino-acids surrounding the target phosphosite it is easy to image how kinase-substrate interactions can be easily created and destroyed with few mutations. It is also well known that these phosphorylation events can have important functional consequences. We therefore postulated that changes in phosphorylation are an important source of phenotypic diversity.
To test this, we collected by mass-spectrometry in vivo phosphorylation sites for 3 yeast species (S. cerevisiae, C. albicans and S. pombe). These were compared in order to estimate the rate of change of kinase-substrate interactions. Since changes in gene expression are generally regarded as one of the main sources of phenotypic diversity we compared these estimates with similar calculations for the rate of change of transcription factor (TF) interactions to promoters. Depending on how we define a divergence of phosphorylation we estimate that kinase-substrate interactions change either at similar rates or at most 2 orders of magnitude slower than TF-promoter interactions.
Although these changes in kinase-substrate interactions appear to be fast, groups of functionally related proteins tend to maintain the same levels of phosphorylation across broad time scales. We could identify a few functional groups and protein complexes with a significant divergence in phosphorylation and we tried to predict the most likely kinases responsible for these changes.
Finally we compiled recently published genetic interaction data for S. pombe (from Assen Roguev's work) and for S. cerevisiae (from Dorothea Fiedler's work) in addition to some novel genetic data produced for this work. We used this information to study the relative conservation of genetic interactions for protein kinases and transcription factors. We observed that both proteins kinases and TFs show a lower than average conservation of genetic interactions.
We think these observations strongly support the initial hypothesis that divergence in kinase-substrate interactions contributes significantly to phenotypic diversity.
Technology opening doors
For me personally it really feels like I was in the right place at the right time. Many of the experimental methods we used are still under heavy development but I was lucky to be very literally next door to the right people. I had the chance to collaborate with Jonathan Trinidad who works for the UCSF Mass Spectrometry Facility directed by Alma Burlingame. I also arrived at a time when the Krogan lab, more specifically Assen Roguev (twitter feed), has been working to develop genetic interaction assays for S. pombe (Roguev A 2007). As we describe in the introduction, these technological developments really allow us to map out the functional and physical interactions of a cell at an incredible rate. What I am hoping for is that soon they are seen in much the same light as genome sequencing. We can and should be using these tools to study, simultaneously, groups of species and not just the same usual model organisms that diverged from each other more than 1 billion years ago.
Evolution of signalling
There are many more protein interactions that are determined by short linear peptide motifs (Neduva PLoS Bio 2005). A large fraction of these determine protein post-translational modifications and are crucial for signal transduction systems. For the next couple of years I will try to continue to study the evolution of signal transduction systems. There are certainly many experimental and computational challenges to address. I am particularly interested in looking at the co-regulation by combinations of post-translational modifications and their co-evolution. I will do my best to share some of that work as it happens here in the blog.
Thursday, June 11, 2009
HFSP fellows meeting (Tokyo 2009)

This year marks the 20th anniversary of the program that also coincides with a period of change in leadership. Ernst-Ludwig Winnacker, current Secretary General of the European Research Council, will take over the role of Secretary General of the HFSP organization from Torsten Wiesel. Also, Akito Arima will replace Masao Ito as the president of HFSPO (press release). Probably because of this the meeting had plenty of political moments and speeches. Thankfully most of the people involved in this organization appear to be very lighthearted so these moments were not a burden.
The curse of specialization ?
A core focus of HFSP is to fund interdisciplinary projects that involve people from different areas or that help researchers change significantly their field of research. There was some time for discussions about the future of the organization as well as the future of "systems biology". For me personally, these debates helped to crystallized many of my own doubts. I am a biochemist but spent 90% of my PhD doing computational work. At this point I feel very much like a jack of all trades and master of none. In my previous work I have mostly hit walls due to lack of data so I plan to spend the next few years leaning a lot more about experimental work. Still, it is hard to be sure of what is best for the future. How much should I sacrifice in productivity to learn new skills ? Is it best to work as a specialist in interdisciplinary teams or be trained as an interdisciplinary person (Eddy SR, PloS Comp Bio 2005) ?
The broad scope of HFSP was well reflected in the topics presented in the meeting (PDF of program). There were many interesting talks, like the keynote by Takao Hensch about "How experience shapes the brain", in particular during the very early stages of life. He showed amazing work about "windows of opportunity" in learning and how these can be manipulated genetically or pharmacologically. Still, when I was looking around in the poster session I could not help but feel a bit of lack of interest since most of the topics were outside my previous work experience. This brings me back to the topic of specialization. Isn't it upsetting that we have to specialize so ? I don't think I can read and enjoy more than a third of a typical issue of Nature. This is for me the curse of specialization, it focuses not only your skills but your interests and curiosity.
Tokyo/Kyoto
Aside from the science, this was my first trip to Japan. I really liked it and hope to come back one day with more time to explore. I loved the temples, gardens, food, colors and all the differences.
Sunday, April 26, 2009
Guestimating PLoS ONE impact factor (Update)
Abhishek Tiwari did some analysis on the number of citations that PLoS ONE is getting so far using Scopus database. We had a small discussion over the numbers on FriendFeed and I ended up looking at different set of values also from Scopus. I tried to predict the first Impact Factor for PLoS ONE that might be out sometime this year.
Before showing the numbers I will repeat again that I think the IFs of the journal where a paper is published is a very poor measure of a papers importance. Although it is probably a good measure of the relative value of a journal (within a given field) we should be striving to pick what we read based on the value of a paper instead of the journal.
The Impact Factors that will be published this year are calculated as the total number of citations from 2008 to papers published in 2006 and 2007, divided by the number of citable units in 2006-2007 (articles and reviews). The data I am looking at is from Scopus so it varies a bit from the one in ISI. The variability comes from the decision of what to include as "citable" articles and from the journals that are covered in Scopus versus ISI.
One problem I found with Scopus data was that, for some journals, the database has multiple entries due to small variations in article titles. For PLoS Biology, PLoS Computational Biology and PLoS Genetics the number of articles published should be less than half of what is reported. This does not appear to be the case for PLoS ONE.
I downloaded the tables of published articles and tried to removed redundancies looking at the tittles and authors. I counted only articles and reviews as citable items but used all articles published in 2006-2007 to get the number of citations in the year 2008. I also did the same calculations for the impact factor of the previous year to be able to compare with the data from ISI. The results were comparable but not the same.

In summary, PLoS ONE might get an impact factor of about half of the expected for PLoS Computational Biology. The usual disclaimers should be said: I have no idea of how complete Scopus data is and how exactly it relates to ISI.
Before showing the numbers I will repeat again that I think the IFs of the journal where a paper is published is a very poor measure of a papers importance. Although it is probably a good measure of the relative value of a journal (within a given field) we should be striving to pick what we read based on the value of a paper instead of the journal.
The Impact Factors that will be published this year are calculated as the total number of citations from 2008 to papers published in 2006 and 2007, divided by the number of citable units in 2006-2007 (articles and reviews). The data I am looking at is from Scopus so it varies a bit from the one in ISI. The variability comes from the decision of what to include as "citable" articles and from the journals that are covered in Scopus versus ISI.
One problem I found with Scopus data was that, for some journals, the database has multiple entries due to small variations in article titles. For PLoS Biology, PLoS Computational Biology and PLoS Genetics the number of articles published should be less than half of what is reported. This does not appear to be the case for PLoS ONE.
I downloaded the tables of published articles and tried to removed redundancies looking at the tittles and authors. I counted only articles and reviews as citable items but used all articles published in 2006-2007 to get the number of citations in the year 2008. I also did the same calculations for the impact factor of the previous year to be able to compare with the data from ISI. The results were comparable but not the same.
In summary, PLoS ONE might get an impact factor of about half of the expected for PLoS Computational Biology. The usual disclaimers should be said: I have no idea of how complete Scopus data is and how exactly it relates to ISI.
Update:
The official impact factor for PLoS ONE for 2008 is out and I is ~ 4.3. I underestimated it by 1.5. It is also amazing how many people search for this online. This post is my number one source of traffic to this blog. If you are reading this and typically sit in panels that decide on new faculty, please stop evaluating people by where they publish. This way, postdocs like me can focus on doing interesting science instead of trying to get into nature/cell/science.
Sunday, March 22, 2009
Thank you Nature
A while ago Euan Adie from Nature asked for help to categorize comments in PLoS ONE for analysis. A lot of people took some time to read some of the comments and the final results of this crowdsourcing effort was made available here. They randomly selected two people from the users that contributed some time for this to get some Nature branded ... stuff. I was one of the two lucky recipients. It took a while, but it arrived today:

Thank you NPG for the kind gifts, next time .. white t-shirt ?! :)

Thank you NPG for the kind gifts, next time .. white t-shirt ?! :)
Monday, November 17, 2008
Why do we blog?
Martin Fenner, asked some questions to science bloggers in Nature Networks that I think are interesting. Plus, the meme is going around my blogging neighbourhood so I thought I would join in as well:
1. What is your blog about?
It is mostly about science and technology with a particular focus on evolution, bioinformatics and the use of the web in science.
2. What will you never write about?
I will never blog about blog memes like this one. I tend to stay away from religion and politics but never is a very strong word.
3. Have you ever considered leaving science?
Does this mean academic research, research in general or science in general ? In any case no. I love problem solving and the freedom of academic research. The only thing I dislike about it is not being sure that I can keep doing this for as long as I wish.
4. What would you do instead?
If I could not do research I would probably try to work in scientific publishing. Doing research usually means that we have to focus on a very narrow field. Editors on the other hand are almost forced to broaden their scope and I think I would like this. I would also be interested in the use of new technologies in publishing.
5. What do you think will science blogging be like in 5 years?
Five years is a lot of time for the pace of technological development but not a long time for cultural change. I could be wrong but, if anything, there will be only a small increase in adoption of blogging as part of personal and group online presence along with the already existing web pages. I wish blogging (and other tools) would be use to further decentralize research agendas from physical location but I don't think that will happen in 5 years.
6. What is the most extraordinary thing that happened to you because of blogging?
I have gained a lot from blogging. The most concrete example was an invitation to attend SciFoo but there are many other things that are harder to evaluate. In some ways it is related to the benefits of attending conferences. It is useful because you get to interact with other scientists, exchange ideas, forces you to think through different perspectives, etc.
7. Did you write a blog post or comment you later regretted?
I probably did but I don't remember an example right now.
8. When did you first learn about science blogging?
As many other bioinformatic bloggers I started blogging in Nodalpoint, according to the archives in November 2001. I started this blog some two years after that.
9. What do your colleagues at work say about your blogging?
Not much really, I don't think many of them are aware of it. If any, the responses have been generally positive but I don't usually find many people interested in knowing more about blogging in science.
1. What is your blog about?
It is mostly about science and technology with a particular focus on evolution, bioinformatics and the use of the web in science.
2. What will you never write about?
I will never blog about blog memes like this one. I tend to stay away from religion and politics but never is a very strong word.
3. Have you ever considered leaving science?
Does this mean academic research, research in general or science in general ? In any case no. I love problem solving and the freedom of academic research. The only thing I dislike about it is not being sure that I can keep doing this for as long as I wish.
4. What would you do instead?
If I could not do research I would probably try to work in scientific publishing. Doing research usually means that we have to focus on a very narrow field. Editors on the other hand are almost forced to broaden their scope and I think I would like this. I would also be interested in the use of new technologies in publishing.
5. What do you think will science blogging be like in 5 years?
Five years is a lot of time for the pace of technological development but not a long time for cultural change. I could be wrong but, if anything, there will be only a small increase in adoption of blogging as part of personal and group online presence along with the already existing web pages. I wish blogging (and other tools) would be use to further decentralize research agendas from physical location but I don't think that will happen in 5 years.
6. What is the most extraordinary thing that happened to you because of blogging?
I have gained a lot from blogging. The most concrete example was an invitation to attend SciFoo but there are many other things that are harder to evaluate. In some ways it is related to the benefits of attending conferences. It is useful because you get to interact with other scientists, exchange ideas, forces you to think through different perspectives, etc.
7. Did you write a blog post or comment you later regretted?
I probably did but I don't remember an example right now.
8. When did you first learn about science blogging?
As many other bioinformatic bloggers I started blogging in Nodalpoint, according to the archives in November 2001. I started this blog some two years after that.
9. What do your colleagues at work say about your blogging?
Not much really, I don't think many of them are aware of it. If any, the responses have been generally positive but I don't usually find many people interested in knowing more about blogging in science.
Wednesday, November 12, 2008
Open Science - just do it
My blog is 5 years old today and to celebrate I am trying to actually do some blogging. There are a couple of reasons why I have blogged less in the past months. In part it was due to FriendFeed and also in part because I was trying to finish a project on the evolution of phospho-regulation in yeast species. Nearing the end of a project should actually provide some of the most interesting blogging material but I did not ask for permission from everyone involved to write about ongoing work.
I have to admit that although I have been discussing and evangelizing open science for over two years I have done very little of it. I have used this blog sometimes to put up small analysis or mini-reviews but never to describe ongoing projects. I have tried to start a side-project online but I over-estimated the amount of "spare cycles" I have for this. So, I have talked it over with my supervisor and I am now free to "risk" as much as I want in trying out Open Science. The first project I will be trying to work on will be on E3 target prediction and evolution.
Prediction and evolution of E3 ubiquitin ligase targets
As I have mentioned above, I have been working in the past months on the evolution of phosphorylation and kinase-substrate interactions in yeast species. I am interested in the evolution of regulatory interactions in general because I believe that they are important for the evolution of novel phenotypes. This is why I will be trying to study the evolution of E3 target interactions. In order to get there I will try first to develop some methods to predict ubiquitination and E3 targets. Since a lot of the ideas and methodology applies to other post-translational modifications and even localization signals I will in the future try to generalize the findings to other types of interactions.
Some of the questions that I will try to address:
- How accurately can we predict E3 substrates ?
- How quickly in evolution do E3-targets change ?
- Is there co-regulation by kinases and E3s on the same targets (and how these evolve) ?
Once I have something substantial I will open a code repository on Google Code.
I have to admit that although I have been discussing and evangelizing open science for over two years I have done very little of it. I have used this blog sometimes to put up small analysis or mini-reviews but never to describe ongoing projects. I have tried to start a side-project online but I over-estimated the amount of "spare cycles" I have for this. So, I have talked it over with my supervisor and I am now free to "risk" as much as I want in trying out Open Science. The first project I will be trying to work on will be on E3 target prediction and evolution.
Prediction and evolution of E3 ubiquitin ligase targets
As I have mentioned above, I have been working in the past months on the evolution of phosphorylation and kinase-substrate interactions in yeast species. I am interested in the evolution of regulatory interactions in general because I believe that they are important for the evolution of novel phenotypes. This is why I will be trying to study the evolution of E3 target interactions. In order to get there I will try first to develop some methods to predict ubiquitination and E3 targets. Since a lot of the ideas and methodology applies to other post-translational modifications and even localization signals I will in the future try to generalize the findings to other types of interactions.
Some of the questions that I will try to address:
- How accurately can we predict E3 substrates ?
- How quickly in evolution do E3-targets change ?
- Is there co-regulation by kinases and E3s on the same targets (and how these evolve) ?
Once I have something substantial I will open a code repository on Google Code.
Subscribe to:
Comments (Atom)