Tuesday, August 12, 2008

Freebase parallax

Freebase parallax is a new browsing interface for Freebase. It allows the user to drill in and connect sets of objects to other sets of objects within Freebase and draw maps and graphs with the information. This really shows the power of having well structured data available online. Here is a video describing how it works with great examples of data mining:

Sunday, August 10, 2008

Post-publication journals

With the increase in the number of journals and articles being published every year and the possibility of having an even larger set of "gray literature" available online we face the challenge of filtering out those bits of information that are relevant for us.

Let us define as "perceived impact" this subjective measure of importance that some bit of information holds for us as scientists. This information is typically an article but it could be applied later to pre-prints and database entries in general.

Everyone of us creates some rules to select from the constant stream of scientific output what to pay attention to. We could picture this sorting process in the form a triangle with a large base of very specific knowledge that is somewhat important to us and a small amount of more general but highly important content at the top. For the majority of scientists today, these sorting rules are based on journal topic (cell biology, physics, evolution, etc) and journal impact factor. Below the base we could place the gray literature that today is mostly out of sight and is not peer-reviewed.

With the advent of the web and in particular the social aspects of this new medium we should expect better than evaluation of articles based on the quality of the journal that it was published in. In the words of Eugene Garfield, the inventor of the impact factor:
“In order to shortcut the work of looking up actual (real) citation counts for investigators the journal impact factor is used as a surrogate to estimate the count. I have always warned against this use”. Eugene Garfield (1998)
Scientific publishing is now digital with every article having an universal digital identifier (DOI). However, as an author I can get (for free) much more information about how people are using the content from this blog than for articles I published. Information about the number of downloads, citations in other articles, in scientific blogs or in bookmarking services could help us sort through information in a better way than relying solely on journal editors (impact factors). We should be using the social web to re-sort articles after peer-review to reflect our preferences:
How would we build such a personalized sorting system ? In the words of the chief-editor of Nature:
(…) nobody wants to have to wade through a morass of papers of hugely mixed quality, so how will the more interesting papers in such an archive get noticed as such? Philip Campbell

It is obviously challenging to use some of those metrics mentioned above as signals to rank the important of individual articles when they are so easy to game. On the other hand some of them are already useful and working today. I already subscribe to RSS feeds from some users of Connotea that consistently bookmark articles that I find useful. Similarly through FriendFeed I get recommendations of articles to read from people I trust. So, although I do not have a clear solution on how to build such a system I think there is a need for it and there are clear ideas to try.
Here is something like a mind-map of what I think would work best, a mixture of the social recommendations of FriendFeed with the pure algorithmic ideas of Google News:


These ideas of sorting based on measures of usage is already being tested by the new Frontiers journals. These are a series of open access journals published by an international not-for-profit foundation based in Switzerland. As PLoS ONE, these journals aim to separate the peer-review process of quality and scientific soundness from the more subjective impact evaluation. In practice they are doing this by publishing research in a tiered system with articles submitted to a set of specialty journals. The articles are evaluated based on the reading activity of the users and the top 10% advance up to the next tier journal.
So far Frontiers has started with neuroscience specialty journals with a single top tier journal (Frontiers in Neuroscience) but if this is successful they could easily add other disciplines and have a third tier on top of very general content. In order to contribute to the evaluation procedure, readers must fill out their profile. This information is taken into consideration since they will rank users usage metrics differently according to their expertise.

Summary
No single individual wants to go through all published literature to find the useful information but together we effectively do this. The challenge is how to evaluate specific articles by a combination of metrics to promote them to wider audiences in a way that is not easy to exploit. Kevin Kelly said recently in a Ted Talk that "The price of total personalization is total transparency". Would this bother scientists ? Lets say that a few science publishers get together with some of these scientific social sites (social networks, bookmarking sites) to mimic the Frontiers model in a larger scale. Users would install a browser plugin that would link their scientific profile and social contacts with their reading activity. The publishers could then use this information to create personal reading hubs for users.

Saturday, August 09, 2008

BioBarCamp wrapup

In the last two days I attended the first BioBarCamp here in the bay area in the Institute for the Future. There is a lot of micro blogging coverage of the event in FriendFeed and even some recorded video from Cameron Neylon (click on demand and pick BioBarCamp).

The meeting was fun due of the unstructured nature of the event and also because I got to meet a lot of people I knew only from blogs. Two highlights of the event were the talks by Aubrey de Grey (see notes and also Cameron's video above) and Jon Trowbridg from Google that talked about this.

There were four parallel discussions going on but I kept mostly with the open science and web tolls related talks. There are a couple of ideas that I take away from these discussions that I will mention below but in general these overlap with what Shirley already mentioned in her post.

Pragmatic steps for Open Science and web tool adoption
Kaitlin Thaney and Cameron Neylon talked about open science and data commons. Cameron in particular is making the case that we need to demand open data the same way we demand for open access to science articles. Although publishers will say that they already try ask for availability to everything required to reproduce the results the truth is that this is not really well enforced. Funding agencies should provision funds to make raw results freely available for re-use once an article is accepted for publication.

On the side of web tools for science, Ricardo Vidal (OpenWetWare), Vivek Murthy (Epernicus), Jeremy England and Mark Kaganovich (Labmeeting) discussed user adoption. Adoption rates among scientists tend to be slow and there is a large generational gap. Again here pragmatic steps need to taken to promote the usage of these tools in science. Some of the current problems include fragmentation of user base, lack of focus in tool development, too few security restrictions.

These tools should try to focus on solving a few important problems really well. Examples of these problems include finding the person in my network that might have some expertize that I need; better ways to find articles that I find relevant or to manage my lab notebook and article library, etc. To reduce the fragmentation of user base it would be great that these websites find a way to share the social graph.

Finally the question of privacy online was again revisited. The idea of having open lab notebooks that anyone can see (as in OpenWetWare) might be a bit too radical and put away users that want to try the tools without the risks associated with exposing your research online. As has been discussed elsewhere there are advantages in having electronic notebooks (easier to access, share with peers and backup) but very few people will risk having their lab notebooks freely available online. Therefore allowing for privacy should increase usage.

Sunday, July 27, 2008

Some backlash on Open Science

During ISMB, thanks to Shirley Wu (FF announcement), there was an improvised BoF (Birds of a Feather) session on web tools for scientists. Given that the meeting was not really announced we were not really expecting a full room. I would say that we had around 20 to 30 people that sayed at least for a while. We talked in general about tools that are useful in science (things like online reference managers, pre-print archives, community wikis, FriendFeed, Second Life) and we also talked a bit about the culture of sharing and open science.

Curiosly, the most interesting discussion I had about open science was not at this BoF session but after it. In the following day the subject come up again in a conversation between me and tree other people (two PhD students and a PI from a different lab). I will not identify the people because I don't know if they would like that or not. The most striking thing for me about this conversation was the somewhat instinctive negative reaction against open science from the part of the two PhD students. After a long discussion they made a few interesting arguments that I will mention below but what was strange for me was that this is the first time I see someone react instinctively in a negative way against the concepts of open science.

One of the students in particular was arguing that the fact that scientists sharing their results online (prior to peer review) is not only silly on their part (the scooping argument) but it would be detrimental to science as a whole. The most concrete argument he offered was that seeing someone "stake claim" to a research problem might scare other people away from even trying to solve it. I would say that it would be better to have people collaborating on the same research problems instead of the current scenario where a lot of scientists waste years (of their time and resources) working in parallel without even knowing about it. He argues simply that some people might not want to collaborate at all and should be allowed to work in this way. I don't think scientists should be forced to put their work online before peer-review, I just happen to think that this would improve collaborations and decrease the current waste or resources.

The second argument against sharing of research ideas and results prior to peer review was more consensual. They all mention the problem of noise and how it is already difficult to find relevant results in the peer reviewed literature. They suggest that this problem would be further increased if more people were to share their ideas and results online. I fully agree that this is a problem but not related at all with open science. This is a sorting/filtering problem that is already important today with the large increase in journals and published articles. We do need better recommendation and filtering tools but sharing ideas and results in blogs/wikis/online project management tools is not going to seriously increase the noise since these are all very easily separated from peer-reviewed articles. No-one is forced to track shared projects, but if they are available it would make it that much easier to start a collaboration when and if it makes sense to do so. Are open source repositories detrimental to the software industry ?

It took around 3 years since people started discussing the idea of open science and open notebooks for these concepts to get some attention. It is inevitable (and healthy) that as more people are exposed to a meme that more counter-arguments emerge. I guess that a backlash only means that the meme is spreading.




Thursday, July 17, 2008

ISMB 2008


I am leaving soon to Toronto to attend ISMB 2008. I usually stay way from big conferences since typically in small conferences is easier to really have time to talk to everyone. The nice thing about attending a big conference is that it looks like everyone is there. There is no shortage of science bloggers attending and it is going to be nice to get to know the people behind some of the blogs for the first time.

There is a room in FriendFeed were several people attending are gathered and for those not going it will probably be a good place to check for coverage of the conference. Alternatively here is a list of bloggers that are attending ISMB or some of the conferences before/after it:

Saturday, July 05, 2008

On the PLoS business model

Declan Butler wrote a news article about PLoS' business model that has generates a lot of discussion. A good summary of blog reactions is available from Bora's blog and there is a long thread of discussions at FriendFeed.

It is hard to read the piece as impartial reporting due to the general negative undertone. Describing PLoS ONE as a database and referring to PLoS ONE and other PLoS journals of lower impact as "bulk, cheap publishing of lower quality papers". I have nothing against the factual content in the news piece. From that perspective it is an interesting report on the PLoS business model. According to the news story PLoS is on track to become economically self-sustainable within two years. We learn that this is possible due to the expansion of PLoS as a publisher to cover a broader range of subjects and different degrees of perceived impact. This is hardly surprising. I wrote a year ago:
"On an author pays model, the most obvious way to limit the cost per paper and still provide a solid evaluation of perceived impact, is to have journals that cover the broad spectrum of perceived impact. In this way, for the publisher, the overall rejection rates decrease, the papers are evaluated and directed to the appropriate "level" of perceived impact."

Most people agree that in principle Open Access publishing would benefit science. Up until know publishers have been reluctant to admit that there is a viable business model with author fees. Some open access publishers (including BioMedCentral) were already showing that this was a viable business model but PLoS will be the first to have viable business model with high impact factor journals within the set of journals they publish.

Two of the most interesting comments on this discussion so far have come from Timo Hannay at Nascent and from Lars Juhljensen
Timo argues that PLoS has failed to show that it is possible to have a business model for a publisher that only has journals of high editorial input (high rejection rates and high perceived impact). Also, the existence of PLoS creates a barrier to entry to other science publishers interested in publishing with an open access (OA) model. There is no argument against the first statement, so far I have not seen any publisher that has managed to reduce the costs of maintaining such "high impact" journals to the point were authors fees would be sufficient. I think this is possible and the PLoS Community journals are the closest form of this but this is another discussion.
What I disagree with Timo is that PLoS somehow creates barriers to entry to other OA publishers. PLoS did require (still requires) philanthropic grants to establish themselves but pioneers have typically a harder time than creative followers. Anyone trying to follow PLoS has access to the records of success and failures, detailed financial reports and (I think) even the publishing infrastructure that they have developed.

Most people know that the strongest barrier to entry to scientific publishing is a perception of quality. NPG has used this fact to their advantage many times. Journals with Nature brand typically establish themselves quickly among the top of their topic. I am sure Nature invests a lot in excellent professional editors but without the Nature brand supporting these journals there would be nothing to choose from to start with. NPG also publishes many more journals than the Nature branded journals and as Lars has pointed out the majority of these have lower impact factors. I don't think there is financial information available so it is hard to know what is the fraction of NPG's income that comes from the high impact or lower impact journals.

Going back to one of Timo's main points, I don't agree that PLoS creates barriers to market entry to other OA publishers. At least certainly not because they used philanthropic grants until they reached break even point. If there are barriers in the market they are due to perception of quality and strong brand name. Here OA publishers have the added advantage that creating a strong brand is easier when most people perceive OA as something good. From the example of PLoS and to some extent BMC there are now clear paths for any publisher (specially one with a strong brand name) to set up a viable business OA model.

Tuesday, July 01, 2008

Bioinformatics around the globe

Did you ever wanted to have a global impression of the field of bioinformatics ? What types of tools they used, or how different is the work in academia versus industry ? Michael Barton from Bioinformatics Zen created a survey that will be running for the next month (until the 1st of August) that tries to address some of these questions. The more people complete the survey, the more informative the picture will be. The survey is anonymous and all information will be made available for those interested in analyzing it.
If you have a blog you can re-post it on your blog (see intructions here) or send a link to any of these blog pages that host the survey to other bioinformatic/computational biology researchers.

Saturday, June 28, 2008

Capturing biology one model at a time

Mathematical and computational modeling is (I hope) a well accepted requirement in biology. These tools allow us to formalize and study systems of higher complexity that are hard to conceptualize with logic thinking. There have been great advances in our capacity to model different biological systems, from single components to cellular functions and tissues. Many of these efforts have been ongoing separately, each one dealing with a particular layer of abstraction (atoms, interactions, cells, etc) and some of them are now reaching a level of accuracy that rivals some experimental methods. I will try to summarize, in a series of blog posts, the main advances behind some of these models and examples of integration between them with particular emphasis on proteins and cellular networks. I invite others to post about models in their areas of interest to be collected for a review.

From sequence to fold
RNA and proteins once produced adopt structures that have different functional roles. In principle all information required to determine the structure is in the DNA sequence that encodes for the RNA/protein. Although there has been some success in the prediction of RNA structure from sequence ab-initio protein folding remains a difficult challenge (see review by R.Das and D.Baker). A more pragmatic approach has been to use the increasing structural and sequence data made available in public databases to develop sequence based models for protein domains. In this way, for well studied protein folds it is possible to ask the reverse question, what sequences are likely to fold this way.
(To be expanded in a future post, volunteers welcome)

Protein binding models

I am particularly interested in how proteins interact with other components (mainly other proteins and DNA) and in trying to model these interactions from sequence to function. I will leave protein-compound interactions and metabolic networks for more knowledge people.
As mentioned above even without a complete ab-initio folding model, it is possible to predict for some sequences what is their structure or determine to what protein/domain family the sequence belongs from comparative genomics analysis. This by itself might not be very informative from a cellular perspective. We need to know how cellular components interact and hwo these interconnected components create useful functions in a cell.

Docking
Trying to understand and predict how two proteins interact in a complex has been the challenge of structural computational biology for more than two decades . The initial attempt to understand protein-interaction from computational analysis of structural data (what is known today as docking) was published by Wodak and Janin in 1978. In this seminal study, the authors established a computational procedure to reconstitute a protein complex from simplified models of the two interacting proteins. In the twenty-years that have followed the complexity and accuracy of docking methods has steadily increased but still faces difficult hurdles (see reviews Bonvin et al. 2006, Gray, 2006). Docking methods start from the knowledge that two proteins interact and aim at predicting the most likely binding interfaces and conformation of these proteins in a 3D model of the complex. Ultimately, docking approaches might one day also predict new interactions for a protein by exhaustively docking all other proteins in the proteome of the species, but at the moment this is still not feasible.

Interaction types
It should still be possible to use the 3D structures of protein complexes to understand at least particular interactions types. In a recent study, Russel and Aloy have shown that it is possible to transfer structural information on protein-protein interactions by homology to other proteins with identical sequences (Aloy and Russell 2002). In this approach the homologous proteins are aligned to the sequences of the proteins in the 3D complex structure. Mutations in the homologous sequences are evaluated with an empirical potential to determine the likelihood of binding. A similar approach was described soon after by Lu and colleagues and both have been applied on large scale genomic studies (Aloy and Russell 2003 ; Lu et al. 2003). As any other functional annotation by homology this method is limited by how much the target proteins have diverged from the templates. Alloy and Rusell estimated that interaction modeling is reliable above 30% sequence identity (Aloy et al. 2003). Substitutions can also be evaluated with more sophisticated energy potentials after an homology model of the interface under study is created. Examples of tools that can be used to evaluate the impact of mutations on binding propensity include Rosetta and FoldX.
Althougt the methods described above were mostly developed for domain-domain protein interactions similar aproaches have been developed for protein-peptide interactions (see for example McLaughlin et al. 2006) and protein-DNA interactions (see for example Kaplan et al. 2005) .

In summary the accumulation of protein-protein and protein-DNA interaction information along with structures of complexes and the ever increase coverage of sequence space allow us to develop models that describe binding for some domain families. In a future blog post I will try to review the different domain families that are well covered by these binding models.

Previous mini-reviews
Protein sequence evolution

Thursday, June 12, 2008

@World

(caution, fiction ahead)


I wake up in the middle of the night startled by some noise. Pulse racing I try to focus my attention outwards. Something breaking, glass shattering? Is someone out there ? I reach out with my senses but an awkward feeling nags at me, bubbling up to my consciousness. I try hard to focus, it is coming from outside the room , someone is inside my house. I close my eyes but vertigo takes over and weightlessness empowers me. I am in the living room cleaning the floor, picking up a broken glass. The nagging feeling finally assaults me fully. I am moving but I am not in control. Panic rises quickly as I watch helpless the simple and quiet actions of someone else. I stop picking up glass and I feel curious, only it is not exactly me, the feeling is there besides me.
- Hi, who are you ?
The voice catches me by surprise and my fear goes beyond rational control. All I can think of is to escape. to go away from here. For a second time I find myself floating as if searching for a way out. When I open my eyes again I am by the beach and I breath a sigh of relief. The constant sound of the waves calms me down for a few seconds until my eyes start drifting to the side. No, stay there I am in control! I look into the eyes of a total stranger that smiles back at me in recognition. Two voices ask me if I am enjoying the view and I can only scream back in confusion.

I wake up in the middle of the night startled by some noise. I immediately flex my hands in front of my eyes to make sure it was nothing but a nightmare trying hard to calm down. What a dream. I get up and check on the noise coming from the living room realizing that it was just the storm outside. Feeling better I fire up my laptop and grab a glass of water from the kitchen. I open twitter and type away:
- I had the strangest dream !(cursor blinking) Our senses were all connected(enter)
I get up to open the window drinking another sip of water. After a couple of steps I feel a jabbing headache forcing me to stop and bright spots of light blur my vision. I close my eyes in pain and the voices of some unseen crowd thunder in my ears:
- I had the same dream - the all say in unison
The sound of glass shattering on the floor in the last thing I remember before collapsing.

I wake up in the middle of the night startled by some noise (...)

(Twistori was the main motivation for this post)

Previous fiction:
The Fortune Cookie Genome

Tuesday, June 10, 2008

Why does FriendFeed work ?

I have been using FriendFeed for a while and I have to say that it works surprisingly well. It is hard to define what FriendFeed is so the only real way of understanding it is to try it for a while.

One common way to define FF would be as a life-stream aggregator. Each user defines a set of feeds (blog, Flickr, Twitter, bookmarks, comments, etc) providing all other users with a single view of all the online activities of that user. Anyone can select how much to share (even nothing at all) and subscribe to a number of other users. Each item (photo, blog post, bookmark) can serve then as spark for discussions. The users can mark items as interesting or comment on them and this propagates to all other people that subscribe to you. In addition we can select sources to hide if for some reason there is a particular part of a user's activities you don't enjoy. All of this creates a very personalized view of whoever you elect to interact with online.

I still find it striking that there are so many long threads of discussions around items that we share in FriendFeed, sometimes more than in the original site. A couple of examples:
Google code as a science repository (discussion in FF, blog post)
Into the Wonderful (discussion in FF, slideshare site)
Bursty work (discussion in FF, blog post)

Why does it work so well ? One possible reason could be that a group of early adopter scientists happened to get together around this website creating the required critical mass to start the discussions. Still, most of those commenting were already participating on blogs so that might not be it. There might be something about the interface, maybe it is the ease of adding comments and that these comments can be edited that increases the participation. Ongoing discussions get bumped higher in the view so every new comment brings the item back to your attention. In this way you know who saw the item and who is thinking about it. A bit like talking about a movie you saw or a book you read with a bunch of friends.

Anyone interested in the science aspects of it should check out the Life Scientists room with currently around 85 subscribers. Here is an introduction to some of these people, in particular on what they work on. Connecting to other scientists in this way lets you see what are the articles they find interesting and discuss current scientific news. Even maybe start a couple of side-projects for the fun of it.

Monday, June 09, 2008

Evaluation metrics and Pubmed Faceoff

I have been reading recently a lot about evaluation metrics for papers and authors. It started with a blog post in Action Potential (Nature Neuroscience's blog) showing a correlation between the number of downloads of a paper and its citations. From the comments in that blog post I found out about a forum in Nature Network about Citation in Science and also the recently published group of perspectives on "The use and misuse of bibliometric indices in evaluating scholarly performance".

It could have been a coincidence but Pierre sparked a long discussion in FriendFeed when he suggested it would be nice to be able to sort Pubmed queries by the imapact factor of the journal. In reaction to this Euan set up a very creative interface to Pubmed that he named Pubmed Faceoff. He took several different factors into account (citations from Scopus, eigenfactor of the journal, the time the paper was published) and for each paper returned from a Pubmed query creates a face that describes the paper. The idea for the visualization is based on Chernoff Faces. It is really a creative idea and I wish Pubmed could spend more resources in coming up with alternative interfaces like this, something like a "labs" section where they could play with ideas or allow others to create interfaces that they would host.

I wont go here into the whole debate about the evaluation metrics since there is already a lot of discussion going on in some of those links I mentioned.

Wednesday, May 14, 2008

Prediction of phospho-proteins from sequence

I want to be able to predict what proteins in a proteome are more likely to be regulated by phosphorylation and hopefully use mostly sequence information. This post is a quick note to show what I have tried and maybe get some feedback from people that might have tried this before.

The most straightforward way to predict the phospho-proteins is to use existing phospho-site predictors in some way. I have used the GPS 2.0 predictor on the S. cerevisiea proteome with medium cutoff and including only Serine/Threonine kinases. The fraction of tyrosine phosphosites in S. cerevisiae is very low so I decided to for now not try to predict tyrosine phosphorylation.

This produces a ranked list of 4E6 putative phosphosites for the roughly 6000 proteins scored according to the predictor (each site is scored for multiple kinases). My question is how to best make use of these predictions if I mostly want to know what proteins are phosphorylated and not the exact sites. Using a set of known phosphorylated proteins in S. cerevisiae (mostly taken from expasy) I computed different final scores as a function of the of all phospho-site scores:
1) the sum
2) the highest value
3) the average
4) the sum of putative scores if they were above a threshold (4,6,10)
5) the sum of putative phosphosite scores if they were outside ordered protein segments as defined by a secondary structure predictor and above a score threshold

The results are summarized with the area under the ROC curve (known phosphoproteins were considered positives and all other negatives) :


In summary, the sum of all phospho-site scores is the best way that I found so far to predict what proteins are phospho-regulated. My interpretation is that phospho-regulated proteins tend to be multi-phosphorylated and/or regulated by multiple kinases so the maximum site score will not work as well as the sum. As a side note, although there are abundance biases in mass-spec data (the source of most of the phospho-data) protein abundance is a very poor predictor of phospho-regulation (AROC=0.55).

Disregarding putative sites outside predicted secondary structured protein segments did not improve the predictions as I would expect but I should try a few disorder predictors.

Ideas for improvements are welcomed, in particular sequence based methods. I would also like to avoid comparative genomics for now.

Wednesday, May 07, 2008

Drug-drug interactions and network connectivity

How does the effect of drug-drug combinations relate to the cellular interactions of their targets ? Last year, Joseph Lehár and colleagues published a paper in MSB looking into this question.

One way to study the effect of drug combinations on growth of a bacteria for example is to measure the inhibition of growth of all possible combinations of serially diluted doses of two combined drugs and plotting dose-matrices like the ones shown in figure 1 of the paper and shown here adapted from the paper. In fig1A the authors show how the combined effect of increasing doses of two drugs inhibit the growth of a methicillin-resistant Staphylococcus aureus strain. Light colors are equivalent to a strong inhibition of drug. One observation from this figure is that the two drugs can inhibit the growth of this strain in an additive fashion. The question the authors tried to address in this paper is how much does this sort of dose-matrix inform us about the possible interactions of the targets. The drugs could be interacting with the same target, different targets in the same pathway/complex, targets in different pathways both required for growth, etc.

In order to study this they first simulated an abstract metabolic network (using ODEs, see model file in Sup) with two different pathways required for growth, with branched and linear blocks and one negative feedback (see Fig3 in the paper). They simulated the effect of increasing drugs in their models by decreasing the enzyme activities of the simulated targets. For each possible drug-drug combination they then calculated the predicted dose-matrix effect on growth (pathway output). The observed that by fitting the obtained dose-matrices to 4 types of classical dose-matrix models (described in Fig2) they could predict where in this network the two targets would more likely be.
As an example , two sequential targets in an unbranched section of the network embedded in an negative feedback produces a dose-matrix that best fits a potentiation model (shown here, adapted from Fig3).

Having established by simulations that there is information on the drug-matrices that relate to the interaction of their targets they then tested the effect of 10 known antifungal drugs on the sterol pathway (also well established) of Candida glabrata. For each drug-drug combination they tried to fit the experimental dose matrices to the same 4 models and compared the best model fit to the expected for the position of the targets in the sterol pathway. For many cases (72%) the best model fit was the same as predicted from the sterol pathway model but only 54% of the best-fit models were unambiguous. There were some cases were drug-with-itself dose matrices (positive control) did not appear additive as expected. The authors mention that this is due to the "instability in the measured potency of a drug" but I am not sure why a drug-with-itself matrix would not be reproducible.

Finally the authors further tested this relation between drug combinations and target interactions by experimentally measuring drug dose-matrices for 94 drug/compounds in human(HCT116) tumor cells (see text for details).

In summary, even if the prediction accuracy is far from perfect, this work shows that it should be possible to either:
1 - use known pathway models plus drug dose-matrices to improve prediction of the most likely targets of the drugs
2 - use known drug-target relationships plus the drug dose-matrices to predict the network connectivity

One obvious complication is the multiple drug targets for the same compound that would reduce the usefulness of the predictions. Some interesting extensions could be to test drug-drug interactions in KO strains or in combinations with RNAi knock-downs
or protein over-expressions.

Thursday, April 24, 2008

SciFoo and BioBarCamp

(Via Attila) The invitations for the 3rd SciFoo have apparently been sent. It will be held from the 8th to the 10th of August at the Googleplex. There is also an idea floating around to organize a BarCamp at the same time as SciFoo. A BarCamp Check out the BioBarCamp wiki and discussion group. There are already several suggestions for venues to organize it and several people interested in attending.

On a side note it's fun to see something like this getting thought of and set up from Twitter/FriendFeed conversations. I have been trying out FriendFeed for a while now and although I am not a big fan of micro blogging (yet?) I really like the conversations around the feed streams.

Wednesday, April 16, 2008

The shuffle project


Most of my work in the last few years was computational, either looking at the evolution of protein-protein interactions or at the prediction of domain-peptide interactions. The nice thing of working on a lab were a lot of people were doing wet lab experiments was that I had the oportunity to, once in a while, grab some pipettes and participate in some of the work that was going on. One project that worked out well was published today (not open access sorry). My contribution to this project was small but it was a lot of fun and I am very interested in the topic that we worked on. We called it the shuffle project in lab.

The main objective of this work was to study how the addition of gene regulatory interactions impacts on a cell's fitness. We introduced different combinations of existing E.coli promoters and transcription/sigma factors either as plasmids or integrated in the genome. In effect, each construct mimics a duplication of one of the E.coli's sigma factors or transcription factors with a change in its promoter. We then tested the impact on fitness by measuring growth curves under different conditions or performing competition assays.

There were a couple of interesting findings but the two the I found most interesting were:
- The vast majority of the constructs had no measurable impact on growth even by testing different experimental conditions.
- A few constructs could out-compete the control in competition assays (stationary phase survival or passaging experiments in rich medium).

Both of these suggest that the gene regulatory network of E. coli is very tolerant to the addition of novel regulatory interactions. This is important because it tells us that regulatory networks are free to explore new interactions given that there is a limited impact on fitness. From this we could also argue that if there are many equivalent (nearly neutral) ways of regulating gene expression we can't expect to see individual gene regulatory interactions conserved across different species. There are a several recent studies, particularly in eukaryotic species, showing that there is in fact a fast divergence of transcription factor binding sites (see recent review by Brian B. Tuch and colleagues) and many other examples that show that although the selectable phenotype is found to be conserved the underlying interactions or regulations have diverged in different species. (see Tsong et al. and Lars Juhl Jensen et al.)

There are a couple of questions that come from these and other related works. What is the fractions of cellular interactions that are simple biologically irrelevant ? Is it possible to predict to what degree purifying selection restricts changes at different levels of cellular organization ? What is the extent of change in protein-protein interactions ?

Having previously worked on the evolution of protein-protein interactions this is the direction that most interests me. This is why I am currently looking at the evolution of phospho-regulation and signaling in eukaryotic species.

Monday, April 14, 2008

Life Sciences Virtual Conference and Expo

IBM Deep Computing will hold a 2 day virtual conference on Innovations in Drug Discovery and Development (16th and 17th of April 2008). The talks will be recorded and available for playback for those that register. The focus of the talks will be on the impact of High Performance Computing for life science research. The current list of talks:
  • Dr. Paul Matsudaira, Director Whitehead Institute Professor of Biology and Bioengineering, MIT : Advanced Imaging and Informatics Methods for Complex Life Sciences Problems
  • Professor Jan-Eric Litton, Director of Informatics, Karolinska Institute - Biobanking : The Challenge of Infrastructure for Large Scale Population Studies
  • Dr. Joel Saltz, Professor and Chair, Department of BioMedical Informatics, Ohio State University : The Cancer Biomedical Informatics Grid (caBIG™)
  • Professor Peter J. Hunter, University of Auckland, Bioengineering Institute : Innovation in biological system simulations
  • Dr. Ajay Royyuru, IBM Research, Computation Biology at IBM : Update on the IBM Genealogy Project co-sponsored with National Geographic
  • Dr. Michael Hehenberger, Solutions Executive, Global Life Sciences : IT Architectures and Solutions for Imaging Biomarkers

Tuesday, April 08, 2008

Structure based prediction of SH2 targets

One of the last few things I worked on during the PhD is now available in PLoS Comp Bio. It is about the structure based prediction of binding of SH2 domains to phospho-peptide targets.

The SH2 domain (src homology domain 2) is a small domain of around 100 amino-acid that has a strong preference to bind peptides that have phosphorylated tyrosines. The selectivity of each domain is typically further restricted by variable surfaces near the phospho-tyrosine binding pocket. See figure below:

The binding preference of each domain can be experimentally determined using for example peptide library screening, phage display or protein arrays. Alternatively we should be able to analyze the increasing amount of structural information and predict the binding specificity of peptide binding domains.
We tried to show here that given a structure of an SH2 domain in complex with a peptide it is possible to predict the binding specificity of this domain. It is also possible, to some extent, predict how mutations on these domains might affect their binding preferences. Finally, combining predictions of specificity with known human phospho sites allows for very reasonable predictions of in vivo SH2-target interactions.

The obvious limitation here is that we need to start with structure of the domain we know from some unpublished work that for families with good structural coverage, homology models can produce specificity predictions that as accurate as from x-ray structure. The other limitation is that giving the lack of dynamics a single conformation of the interactions is modeled and this should in part help determine the binding specificity. One possible to this problem that we have used with some success is to model different peptide conformation for each binding domain.

I should make clear that although I think there is an improvement over previous works there is already a considerable amount of research on this topic that we tried to cite in the introduction and discussion. I would say that some of the best previous work on structure based predictions of domain-peptide interactions has come from Wei Wang lab (see for example McLaughlin et al. or Hou et al.)

This manuscript was the first (and only so far) I collaborated on with Google Docs. It worked well and I recommend it to anyone that needs to co-write a manuscript with other people. It saves a lot of emails and annotations on top of annotations.

Bio::Blogs#20 - the very late edition

I said I would organize the 20th edition of Bio::Blogs here on the 1st of April but April fools and my current work load did not allow me to get Bio::Blogs up on time.

There were a couple of interesting discussions and blog posts in March worth noting. For example, Neil mentioned a post by Jennifer Rohn started that initiated what could be one of the longest threads in Nature Network :"In which I utterly fail to conceptualize". It started off as small anti-Excel rant but turned in the comments to 1st) a discussion of bioinformatic tools to use, 2nd) a discussion of wet versus dry mindset and how much one should devote to learn the other. Finally it ended up as a exchange about collaborations and how a social networking site like Nature Network could/should help scientists find collaborators. There was even a group started by Bob O'Hara to discuss this last issue further.

I commented on the thread already but can try to expand a bit on it here. Nature Network is positioned as a social networking site for scientists. So far the best that it has to offer has been the blog posts and forum discussions. This is not very different from a "typical" forum. It facilitates the exchange of ideas around scientific topics but NN could try to look at all the typical needs of scientists (lab books, grant managing, lab managing, collaborations, protocols, paper recomendations,etc) and decide on a couple that they could work into the social network site. Ways to search for collaborators and maybe paper recommendation engines that take advantage of your network (network+connotea) are the most obvious and easier to implement. Thinking long term, tools to help manage the lab could be an interesting addition.

Another interesting discussion started from a post by Cameron Neylon on a data model for electronic lab notebooks (part I, II, III). Read also Neil's post, and Gibson's reply to Cameron on FuGE.
How much of the day to day activities and results need to be structured ? How heavy should this structure be to capture enough useful computer readable information ? Although I find these questions and discussion interesting, I would guess that we are far from having this applied to any great extent. If most people are reluctant to try out new applications they will be even less willing to convey their day to day practices via a structured data model. I mentioned recently the experiment under way at FEBS letters journal to create structured abstracts during the publishing process. As part of the announcement the editors commissioned reviews on the topic. It is worth reading the review by Florian Leitner and Alfonso Valencia on computational annotation methods. They argue for the creation of semi-automated tools that take advantage of the automatic methods and the curators (authors or others). The problems and solutions for annotation of scientific papers are shared with digital lab notebooks. It hope that more interest in this problem will lead to easy to use tools that suggest annotations for users under some controlled vocabularies.

Several people blogged about the 15 year old bug found in the BLOSUM matrices and the uncertainty in multiple sequence alignments. See posts by Neil, Kay Lars and Mailund.
Both cases remind us of the importance of using tools critically. The flip side of this is that it is impossible to constantly question every single tool we use since this would slow our work down to a crawl.

In the topic of Open Science, in March the Open Science proposal drafted by Shirley Wu and Cameron Neylon, for the Pacific Symposium on Biocomputing was accepted. It was accepted as a 3 hour workshop consisting of invited talks, demos and discussions. The call for participation is here along with the important deadlines for submissions (talk proposals due June 1st and poster abstracts due the 12th of September).

On a related note Michael Barton has set up a research stream (explained here) He is collecting updates on his work, tagged papers and graphs posted to Flickr into one feed that gives an immediate impression of what he is working on at present time. This is really a great set up. Even for private use withing a lab or across labs for collaboration this would give everyone involved the capacity to tap into the interesting feeds. I would probably not like to have everyone's feeds and maybe a supervisor should have access to some filtered set of feeds or tags to get only the important updates but this looks a step in the right direction. The same way, machines could also have research feeds that I could subscribe too to get updates on some data source.

Also in March, Deepak suggested we need more LEAP (Lightly Engineered Application Products)in science. He suggests that it is better to have one tool that does a job very well than one that does many somewhat well. I guess we have a few examples of this in science. Some of the most cited papers of all time are very well known cases of a tool that does one job well (ex: BLAST).


Finally, some meta-news on Bio::Blogs. I am currently way behind on many work commitments and I don't think I can keep up the (light) editorial work required for Bio::Blogs so I am considering stopping Bio::Blogs altogether. It has been almost two years and it has been fun and hopefully useful. The initial goal of trying to nit together the bioinformatic related blogs and offering some form of highlighting service is still required but I am not sure this is the best way going forward.
Still, if anyone wants to take over from here let me know by email (bioblogs at gmail.com).

Tuesday, April 01, 2008

(April fools update) Leveling the playing field – NIH to ban brain enhancing practices

Update - This post was part of an April 1st news but I am sure everyone got it :). Still the pressure in science is real and worth thinking about.

There has been quite a buildup of discussion surrounding the idea of brain enhancing drugs in the last couple of days. It started early march with a New York Time piece “Brain Enhancement Is Wrong, Right?” and it has culminated with the recent announcement of the World Anti Brain Doping Authority (WABDA) a joint effort from the NIH and EU to initiates studies on the reach of brain enhancing practices in science today.
There are many points of view already expressed on the web, see for example: ·Chris Patil
·Bora
·Anna Kushnir
·Genome Technology
·Egghead
·Eye on DNA
·Bob Ohara
·Martin Fenner
·Jennomics

My first reaction was of pure skepticism, this must be some kind of joke I thought, so I tried to probe a little bit around the UCSF campus to see if anyone has ever heard of this as well. One of my supervisors mentioned that about a year ago he had to fill out a NIH survey addressing the current problem of very high rejection rates for NIH grants. It looks like within this survey there was a section regarding the problems of competition in science and some of these brushed around the topic of brain enhancing practices. It could be that at the time NIH was trying to measure how far would people go under an extreme competitive environment.
This really got me thinking about how we are engaged in an environment that is not that far removed from highly competitive sports. How many stories have we heard about data forgery and scandalous retractions in the last couple of years? To what extent will people go to secure their place in science? To be recognized?
So maybe NIH is right in being proactive. Even if the issue is not as serious in science as it is in sports, unless there is an amazing influx of money or a considerable decrease of working scientists this might become an important problem. If nothing else we will get to know the current extent of these practices and it highlights yet again how far we deviated from course. The money society puts into scientific research is being wasted on overlapping competitive projects. Research agendas should be open and free for anyone to participate in. Maybe NIH should regulate that as well.

Monday, March 31, 2008

call for Bio::Blogs #20

The 20th edition of Bio::Blogs will be posted here by the end of tomorrow. This is very short notice but if anyone would like to contribute please send a few links of the most interesting things of the past month and I will put everything together (email bioblogs at gmail).