Cellular Consequences of Genetic variation: open science

Showing posts with label open science. Show all posts

Wednesday, November 16, 2022

20 years of open science or how we haven't radically changed the way we do science online

Around 20 years ago I was a starting PhD student and it was an exciting time for the internet. It was the time of blogs, wikis and a large increase in public participation with more user generated content in what is commonly known as the start of Web 2.0. These were the times of web based online communities such as the now defunct Kuro5hin or the great survivor slashdot.org. I started this blog 19 years ago and I was also "hanging out" in an online community called Nodalpoint. Nodalpoint no longer exists but it was a discussion forum/wiki for bioinformatics with some of these discussions still preserved thanks to the magic of the way back machine.

Around the time of 2002-2006 all of the excitement around Web 2.0 was also infecting academia with many discussions around open science. I know that open science is a vague term that can mean many different things including open access, citizen science, open source and many others. One specific aspect that I want to focus on is the idea of organizing research in a way that is not based on local group structures. In 2005 I wrote a Nodalpoint post on "Virtual collaborative research" which is similar in spirit to open source software development but with a focus on discovery not tool development. Part of this would mean surfacing more of our ongoing research and taking part in research projects that are not organized by traditional research group structures. The idea of being extremely open about ongoing research activities was advocated by others under the term of "open notebook science".

Over the following years I made a few attempts at starting such open research projects with blog posts where I tried to set up tools and ideas where others could take part in (see posts from 2007, 2008 and 2010). The last project idea I tried to propose in such way ended up being one of the major projects from my postdoc and basically one of research lines I am still working on. In the end, none of these attempts really took off as open collaborative research projects. In hindsight, I am not surprised it didn't work. Even within local structures of research institutes and university departments there is so much discussion on incentives for local collaborations. While I think the traditional structures for organizing research do work, as a PhD student and postdoc I was very frustrated by the apparent difficulty of making the most of everyone's expertise. As a group leader I have more capacity to establish collaborations but I still think we aren't using the internet to its full capacity.

So what happened in the decade from 2010 to 2020 ? Blogs and online communities mostly died out and Web2.0 was swallowed by corporations. One major change was the rise of large social networks and the standardization of the stream as way for people to share information and interact. Academia started participating in social networks around the time of Friendfeed (2007-2015) and such participation become mainstream with the popularization of Twitter. I honestly would never have predicted the rise of academic twitter and it is truly a sign of how the geeks have inherited the earth.

The reason I am even thinking about open science these days is that over the past couple of years we have been involved in projects that have illustrated this potential of large collaborations empowered by the internet. I wanted to write this down also to have something to come back to in the future. The first project was a study of phosphorylation changes during SARS-CoV-2 infection. Like many others, when the pandemic sent our research group home, I though about what we could do to help and sent emails to a few people that could be working on the topic. Nevan Krogan, my former postdoc supervisor, was very keen to involve us which lead to several projects including this study of protein phosphorylation. This was probably one of the most exciting projects I have been involved with and included a very spontaneous collaboration among a large international team coordinated by a few people through slack. In this case the network of interactions was provided by Nevan and it was possible because everyone was pushing in the same direction triggered by a catastrophe. I wish everyone could feel the sense of power that I think we felt during this project. There was so much scientific capacity at the disposal of this single project and we could iterate through experiments and data analysis at an incredible pace. It is even hard to express how it felt to be able to just get things done when you had the world experts for what was required to do at every step.

A second even more interesting example was a community effort to study the value of AlphaFold2 in a series of applications. When AlphaFold2 was released, several scientists started sharing their early observations of how AlphaFold2 and predicted structures could be used for different applications. I though all of these examples were really exciting and that we could structure this output into a manuscript. So I just contacted people that were doing this and also asked on social media if anyone else wanted to participate. In the end every contribution to this was quite modular and it was easy to integrate this into a manuscript with a few meetings and a google doc to put things together. Perhaps the less usual thing that happened was receiving actual results through Twitter chat.

I think both of these examples required a trigger - the pandemic and the release of AlphaFold2 - that led to many scientists moving in the same direction. In both of these cases I think we achieved in a few months what would take a single group potentially one to several years to do. Yet, these interactions remain difficult to make. Perhaps simply because we are just too busy with our own research questions or more likely because of the importance of credit and evaluation systems in academia. These days I am actually less in favor of radical sharing of ongoing research, in the spirit of open notebook science. I don't think we have the attention span for it. It would be too difficult to navigate and may lead to more "group think" instead of divergent thinking and ideas. Maybe the simple existence of social networks like twitter are already a good step forward. I certainly get to know more people and what they may be up to via this. Lets see what the next 20 years bring.

Thursday, June 10, 2021

A not so bold proposal for the future of scientific publishing

Around 15 years ago I wrote a blog post about how we could open up more of the scientific process. The particular emphasis that I had in mind was to increase the modularity of the process in order to make it easier to change parts of it without needing a revolution. The idea would be that manuscripts would be posted to preprint servers that could accumulate comments and be revised until they are considered suitable for accreditation as a peer review publication. At the time I also though we could even be more extreme and have all of the lab notebooks open to anyone which I no longer consider to be necessarily useful.

Around 15 years have passed and while I was on point with the direction of travel I was very off the mark in terms of how long it would take us to get there. Quite a lot has happened in the last 15 years with the biggest changes being the rise of open access, preprint servers and social media. PLoS One started as a journal that wanted us to do post-publication peer review. It started with peer reviewed focused on accuracy, wanting then to leverage the magic of internet 2.0 to rank articles by how important they were through likes and active commenting by other scientists. The post-publication peer review aspect was a total failure but the journal was an economic success that led to the great PLoS One Clone Wars with consequences that are still being felt today - just go and see how many new journals your favourite publisher opened this year.

The rise of preprint servers has been the real magic for me. We live in each others scientific past by at least 2 years or so. If you sit down and have a science chat with me I can tell you about all of the work that we are doing which won't be public for some 2 years. If I didn't put our group's papers out as preprints you would be waiting at least 6-12 months to know about them. Preprint servers are a time machine, they move everyone forward in time by 12 months and speed up the exchange of ideas as they are being generated around the globe. If you don't post your manuscripts as preprints you are letting others live in the past and you are missing out on increased visibility of your own research.

Preprint servers also serve the crucial need to dissociate the act of making a manuscript public from the process of peer review, certification as a peer-reviewed paper and dissemination. This is important because it allows the whole scientific publishing system to innovate. This is needed because we waste too much money and time on a system that is currently not working to serve the authors or readers efficiently.

So after nearly 15 the updated version of the proposal is almost unchanged:

I no longer think it would be that useful to have lab notebooks freely available to anyone to read. There are parts of research that are too unclear and I suspect that the noise to information ratio would be too high for this to be of value. However, useful datasets that are not yet published could be more readily made available prior to publication. Along these lines, the ideas in the form of funded grant proposals should be disclosed after the funding period has lapsed. As for the flow from manuscript to publication, the main ideas remain and the system already exist to make these more than just ideas. There are already independent peer review systems like Review Commons. Such systems could eventually be paid and could lead to the establishment of professional paid peer reviewers. Such costs would then be deducted from other publishing costs depending on how the accreditation was done. Eventually "traditional" publishing could be replaced by overlay journals, like preLights, whose job would be to identify peer reviewed preprints that are of interest to a certain community.

Social media for me has been the most surprising change in scientific communication. I didn't expect so many scientists to join online discussions via social media. Then again, I didn't foresee the geekification of society. In many ways social media is already acting as a "publishing" system in the sense of distribution. Most of the articles I read today I find through twitter or Google Scholar recommendations. As we are all limited by the attention we can give, I think one day, instead of complaining about how impact factors distort hiring decisions we will be complaining about how social media biases distort what we think is high value science.

So finally, what can you do to move things along if you feel it is important ? If you think we have too many wasteful rounds of peer reviewing across different journals; that the cost of open access publishing is too high or even simply that publicly funded research should be free to read and openly available to mine ? Then the best single thing you can do today is make your manuscripts available via preprint servers.

Friday, November 27, 2015

Predicting PTM specificities from MS data and interaction networks

Around four years ago I wrote this blog post where I suggested that it might be possible to combine protein interaction data with phosphosites from mass-spectrometry (MS) data to infer the specificity of protein kinases. I did a very simple pilot test and invited others to contribute to the idea. Nobody really picked up on it until Omar Wagih, a PhD student in the group, decided to test the limits of the approach. To his credit I didn't even ask him to do it, his main project was supposed to be on individual genomics. I am glad that he deviated long enough to get some interesting results that have now been published.

As I described four years ago, the main inspiration for this project was the work of Neduva and colleagues. They showed that motif enrichment applied to the interaction partners of peptide binding domains can reveal the binding specificity of the domain. One step of their method was to filter out regions of proteins that were unlikely to be target sequences before doing motif identification. For PTM enzymes or binding domains we should be able to take advantage of the MS derived PTM data to select the peptides for motif identification by just taking the peptide sequences around the PTM sites. This was exactly what Omar set out to do by focusing on human kinases as a test case.

To summarize the outcome of this project the method works with some limitations. For around a third of human kinases that could be benchmarked he got very good predictions (AUC>0.7). For some kinase families the predictions are better than others and we think it due to how specific the kinase is for the residues around the target site. It is known that kinases find their targets via multiple mechanisms (e.g. docking sites, shared interactions, co-localization, etc). This specificity prediction approach will work better for kinases that find their targets mostly by recognizing amino-acids near the phosphosite. With the help of Naoyuki Sugiyama in Yasushi Ishihama's lab we validated the specificity predictions for 4 understudied human kinases. One advantage of using this approach is that it could be very general. Omar tried it also on 14-3-3 domains, that bind phosphosites and also on a bromodomain containing protein that is known to bind acetylated peptides. Finally, we also tried to use this to compare kinase specificity between human and mouse but given the current limitation of the method I don't it is possible to use these predictions alone to find divergent cases of specificity.

The predictions for human kinase specificity can be found here and a tutorial on how to repeat these predictions is here. The motif enrichment was done using the motif-x algorithm. Given that we could not really use the web version Omar implemented the algorithm in R and a package is available here.

There are many other ways to predict specificities for PTM enzymes and binding domains. If you have many known target sites the best way is to train a predictor such as Netphorest or GPS. There is also the possibility of using the known target sites in conjunction with structural data to infer rules about specificity and the specificity determining residues. A great example of this is Predikin and more recently KINspect. Ongoing work in the group now aims to combine what Omar did with some aspects of Predikin to study the evolution of kinase specificity.

Going back to beginning of the post this idea was my second attempt at an open science project. The first attempt was a project on the evolution and function of protein phosphorylation (described here). This ended up being one of the main projects of my postdoc and now the main focus of the group. I am still curious to know if distributed open science projects will ever take off. I don't mean a big project consortia but smaller scale research where several people could easily contribute with their expertise almost as "spare cycles". Often when you are an expert in some analysis or method you could easily add a contribution with little effort. However, there was much more excitement about open science a few years ago whereas now most of the discussions have shifted to pre-prints and doing away with the traditional publishing system. Maybe we just don't have time to pay attention or to contribute to such open projects.

Tuesday, March 10, 2015

A Borg moment and the end of Friendfeed

Apparently Facebook finally decided to shutdown Friendfeed after several years of declining usage. I only found out because Neil, Deepak and Cameron wrote posts about this. Although I was a heavy user I ended up moving with the crowd after the Facebook acquisition. For those that never used it but are familiar with Twitter or Facebook it might be hard to understand why some people like myself are so disappointed with it's decline. Friendfeed was simply leaps ahead of anything at the time as a mechanism for sharing information and organizing discussions around these shared items. In fact, although there has been no further development for 5 years it is still much better than Twitter for these things. As Neil mentioned in his post, it is hard to understand why this is the case. Maybe because comments were attached to a shared item and not limited to 140 characters so you could actually have meaningful discussions. Unlike forums the shared items were a feed/river so there was the same impression and emphasis on immediacy as twitter. However, recently commented items would jump up on your feed which would tend to foster discussions. It is possible that it only worked because those that joined were the right people at the right time. Maybe it would not scale with the trolls. We will never know.

For those that never used it I want to write down the best experience I ever had on Friendfeed. I was attending the ISMB conference in Toronto in 2008. The number of geeks at this conference is understandably high and there were many Friendfeed users attending. At the time Friendfeed had already introduced the notion of a "room" which was a separate public feed that anyone could join. Similar to tracking a hashtag on twitter. A feed for the conference was set up and many people at the conference joined and started participating. In fact, the feed is still available here so you can go have a look for the time being. This was the first time I really had the impression of connecting to a hive-mind. In this back channel tens of people were taking notes and giving comments about the several simultaneous talks. During keynotes you could even see, as the speaker was changing topics, different people would take up the slack of taking notes and commenting according to their own expertise. Unlike twitter these didn't feel like we were drowning in a sea of uncoordinated messages. You could always focus your attention on just one thread (i.e. a shared item) and its comments at any time. It worked so well that we ended up using the notes to write up a conference report that got published in PLOS Comp Bio.

That community of scientists and other open science advocates moved on to Twitter after the Facebook acquisition. Twitter usage by scientists and in particular by prominent established scientists also really took off at around the same time. Although it serves a similar purpose Twitter really is more of a broadcasting mechanism than a discussion forum. It is a pity that a lesser solution won out. Still, the amount of open scientific discussions that are going on online these days is just phenomenal and a drastic change from my PhD days.

Tuesday, April 27, 2010

Science isn’t fair

<rant>

Life isn’t fair, science is part of life therefore science isn’t fair. This would be a very short way to say what I am thinking but this is a rant so I will stretch it out a bit more.

We learn early on that in our line of work there is almost no correlation between the amount of work we do and the results we get. You need luck and I am not turning mystical on you here. I mean the low likelihood kind of luck. Even if you do everything right, being successful in science depends mostly on factors that are outside your control. A somewhat random pool of people end up being in the right place and the right time to go on with their academic work. Almost like a game of musical chairs, those with enough passion and perseverance to sustain the blows of lady luck get to play in the final rounds. Granted that I have been at this only for a few years but I have seen my share of hard working people getting scooped or hitting the wall with impossible projects. Try to explain scooping to non-scientists to see how ridiculous that sounds. I have also seen people (myself included) getting authorships for things I would not consider worthy of such.

So … science isn’t fair. This was exactly the sort of observations that made me start thinking about open science a few years ago. We could help to even out the playing field if we all are a bit more open about what we are working on. Too many financial and personal resources are eaten away to the duplication of research agendas.

</rant>

Sunday, January 03, 2010

Stitching different web tools to organize a project

A little over a year ago I mentioned a project I was working on about prediction and evolution of E3 ligase targets (aka P1). As I said back then, I am free to risk as much as I want in sharing ongoing results and Nir London just asked me how the project is going via the comments of that blog post so I decided to give a bit of an update.

Essentially, the project quickly deviated from course since I realized that predicting E3 specificity and experimentally determining ubiquitylation sites in fungal species (without having to resort to strain manipulation) were not going to be an easy tasks.
So, since the goal was to use these data to study the co-evolution of phosphorylation switches (phosphorylation regulating ubiquitylation) it makes little sense to restrain the analysis specifically to one form of post-translational modification (PTM). After a failed attempt to purify ubiquitylated substrates the goal has been to come up with ways to predict the functional consequences of phosphorylation. We will still need to take ubiquitylation into account but that will be a part of the whole picture.

With this goal in mind we have been collecting for multiple species data on phosphorylation as well as other forms of PTMs from databases and the literature and we have been trying to come up with ways to predict the function of these phosphorylation events. These predictions can be broken down mostly intro tree types:
- phosphorylation regulating domain activity
- phosphorylation regulating domain-domain interactions (globular domain interfaces)
- phosphorylation regulating linear motif interactions (phosphorylation switches in disordered regions)

We have set up a notebook where we will be putting some of the results and ways to access the datasets. Any new experimental data and results from the analysis will be posted with a significant delay both to give us some protection against scooping and also to try to guarantee that we don't push out things that are obviously wrong. This brings us to a disclaimer... all data and analysis in that notebook is to be considered preliminary and not peer reviewed, it probably contains mistakes and can change quickly.

I am currently colaborating with Raik Gruenberg on this project and we are open to collaborators that bring new skills to the project. We are particularly interested in experimentalist working in cell biology and cell signalling that could be interested in testing some of the predictions we are getting out of this study.

I won't talk much (yet) about the results we have so far but instead mention some of the tools we are using or planning to use:
- The notebook of the project hosted in openwetware
- The datasets/files are shared via Dropbox
- If need arises code will be shared via Google Code (currently empty)
- Literature will be shared via a Zotero group library
- The papers and other items can be discussed in a Friendfeed group

This will be all for now. I think we are getting interesting results from this analysis on the evolution of the functional consequences of phosphorylation events but we will update the notebook when we are a bit more confident that we ruled out most of the potential artifacts. I think the hardest part about exposing ongoing projects is having to explain to potential collaborators that we intend to do so. This still scares people away.

I'll end with a pretty picture. This is an image of an homology model for the Tup1 -Hhf1 interaction. Highlighted are two residues that are predicted by the model to be in the interface and are phosphorylated in two different fungal species. This exemplifies how the functional consequence of a phosphorylation event can be conserved although the individual phosphorylation sites (apparently) are not.

Tuesday, August 11, 2009

Translationally optimal codons do not appear to significantly associate with phosphorylation sites

I recently read an interesting paper about codon bias at structurally important sites that sent me on a small detour from my usual activities. Tong Zhou, Mason Weems and Claus Wilke, described how translationally optimal codons are associated with structurally important sites in proteins, such as the protein core (Zhou et al. MBE 2009). This work is a continuation of the work from this same lab on what constraints protein evolution. I have written here before a short review of the literature on the subject. As a reminder, it was observed that the expression level is the strongest constraint on a protein's rate of change with highly expressed genes coding for proteins that diverge slower than lowly expressed ones (Drummond et al. MBE 2006). It is currently believed that selection against translation errors is the main driving force restricting this rate of change (Drummond et al. PNAS 2005,Drummond et al. Cell 2008). It has been previously shown that translation rates are introduced, on average, at an order of about 1 to 5 per 10000 codons and that different codons can differ in their error rates by 4 to 9 fold, influenced by translational properties like the availability of their tRNAs (Kramer et al. RNA 2007).

Given this background of information what Zhou and colleagues set out to do, was test if codons that are associated with highly expressed genes tend to be over-represented at structurally important sites. The idea being that such codons, defined as "optimal codons" are less error prone and therefore should be avoided at positions that, when miss-translated, could destabilize proteins. In this work they defined a measure of codon optimality as the odds ratio of codon usage between highly and lowly expressed genes. Without going into many details they showed, in different ways and for different species, that indeed, codon optimality is correlated with the odds of being at a structurally important site.

I decided to test if I could also see a significant association between codon optimality and sites of post-translational modifications. I defined a window of plus or minus 2 amino-acids surrounding a phosphorylation site (of S. cerevisiae) as associated with post-translational modification. The rationale would be that selection for translational robustness could constraint codon usage near a phosphorylation site when compared with other Serine or Threonine sites. For simplification I mostly ignored tyrosine phosphorylation that in S. cerevisiae is a very small fraction of the total phosphorylation observed to date .
For each codon I calculated its over representation at these phosphorylation windows compared to similar windows around all other S/T sites and plotted this value against the log of the codon optimality score calculated by Zhou and colleagues.

Figure 1 - Over-representation of optimal codons at phosphosites

At first impression it would appear that there is a significant correlation between codon optimality and phosphorylation sites. However, as I will try to describe below this is mostly due to differences in gene expression. Given the relatively small number of phosphorylation sites per protein, it is hard to test this association for each protein independently as it was done by Zhou and colleagues for the structurally important sites. The alternative is therefore to try to take into account the differences in gene expression. I first checked if phosphorylated proteins tend to be coded by highly expressed genes.

Figure 2 - Distribution of gene expression of phosphorylated proteins

I figure 2 I plot the distribution of gene expression for phosphorylated and non-phosphorylated proteins. There is only a very small difference observed with phosphoproteins having a marginally higher median gene expression when compared to other proteins. However this difference is small and a KS test does not rule out that they are drawn from the same distribution.

The next possible expression related explanation for the observed correlation would be that highly expressed genes tend to have more phosphorylation sites. Although there is no significant correlation between the gene expression level and the absolute number of phosphorylation sites, what I observed was that highly expressed proteins tend to be smaller in size. This means that there is a significant positive correlation between the fraction of phosphorylated Serine and Threonine sites and gene expression.

Figure 3 - Expression level correlates with fraction of phosphorylated ST sites

Unfortunately, I believe this correlation explains the result observed in figure 1. In order to properly control for this observation I calculated the correlation observed in figure 1 randomizing the phosphorylation sites within each phosphoprotein. To compare I also randomized the phosphorylation sites keeping the total number of phosphorylation sites fixed but not restricting the number of phosphorylation sites within each specific phosphoprotein.

Figure 4 - Distribution of R-squared for randomized phosphorylation sites

When randomizing the phosphorylation sites within each phosphoprotein, keeping the number of phosphorylation sites in each specific phosphoproteins constant the average R-squared is higher than the observed with the experimentally determined phosphorylation sites (pink curve). This would mean that the correlation observed in figure 1 is not due to functional constraints acting on the phosphorylation sites but instead is probably due to the correlation observed in figure 3 between the expression level and the fraction of phosphorylated S/T residues.
The observed correlation would appear to be significantly higher than random if we allow the random phosphorylation sites to be drawn from any phosphoprotein without constraining the number of phosphorylation sites in each specific protein (blue curve). I added this because I thought it was an striking example of how a relatively subtle change in assumptions can change the significance of a score.

I also tested if conserved phosphorylation sites tend to be coded by optimal codons when compared with non-conserved phosphorylation sites. For each phosphorylation site I summed over the codon optimality in a window around the site and compared the distribution of this sum for phosphorylation sites that are conserved in zero, one or more than one species. The conservation was defined based on an alignment window of +/- 10AAs of S. cerevisiae proteins against orthologs in C. albicans, S. pombe, D. melanogaster and H. sapiens.

Figure 5 - Distribution of codon optimality scores versus phospho-site conservation

I observe a higher sum of codon optimality for conserved phosphorylation sites (fig 5A) but this difference is not maintained if the codon optimality score of each peptide is normalized by the expression level of the source protein (fig 5B).

In summary, when the gene expression levels are taken into account, it does not appear to be an association between translationally optimal codons with the region around phosphorylation sites. This is consistent with the weak functional constraints observed by in analysis performed by Landry and colleagues.

Friday, June 26, 2009

Reply: On the evolution of protein length and phosphorylation sites

Lars just pointed out in a blog post that the average protein length of a group of proteins is a strong predictor of average number of phosphorylation sites. Although this is intuitive this is something that I honestly had not fully considered. As Lars mentions this has potential implications for some of the calculations in our recently published study on the evolution of phosphorylation in yeast species.

One potential concern relates to figure 1a. We found that, although protein phosphorylation appears to diverge quickly, there is a high conservation of the relative number of phosphosites per protein for different GO groups. Lars suggests that, at least in part, this could be due to relative differences in average protein size for these different groups that in turn is highly conserved across species.

To test this hypothesis more directly I tried to correct for differences in the average protein size of different functional groups by calculating the average number of phosphorylation sites per amino-acid, instead of psites per protein. These values were then corrected for the average number of phosphorylation sites per AA in the proteome.

As before, there is still a high cross-species correlation for the average number of psites per amino-acid for different GO groups. The correlations are only somewhat smaller than before. The individual correlation coefficients among the three species changed from: S. cerevisiae versus C. albicans – R~0.90 to 0.80; S. cerevisiae versus S. pombe – R~0.91 to 0.84; S. pombe versus C. albicans – R~0.88 to 0.83. It would seem that differences in protein length explains only a small part of the observed correlations. Results in figure 1b are also not qualitative affected by this normalization suggesting that observed differences are not due to potential changes in the average size of proteins. In fact the number of amino acids per GO group is almost perfectly correlated across species.

Another potential concern relates to the sequence based prediction of phosphorylation. As explained in the methods, one of the two approaches used to predict if a protein was phosphorylated was the sum over multiple phosphorylation site predictors for the same sequence. Given the correlation shown by Lars, could it be that, at least for one of the methods, we are mostly predicting the average protein size ? To test this I normalized the phosphorylation prediction for each S. cerevisiae protein by their length. I re-tested the predictive power of this normalized value using ROC curves and the known phosphoproteins of S. cerevisiae as postives. The AROC values changed from 0.73 to 0.68. This shows that the phosphorylation propensity is not just predicting protein size although, as expected from Lars' blog post, size alone is actually a decent predictor for phosphorylation (AROC=0.66). The normalized phosphorylation propensity does not correlate with the protein size (CC~0.05) suggesting that there might ways to improve the predictors we used.

Nature or method bias ?
Are larger proteins more likely to be phosphorylated in a cell or are they more likely to be detected in a mass-spec experiment ? It is likely that what we are observing is a combination of both effects but it would be nice to know how much of this observed correlation is due to potential MS bias. I am open to suggestions for potential tests.
This is also important for what I am planning to work on next. A while ago I had noticed that prediction of phosphorylation propensity could also predict ubiquitination and vice-versa. It is possible that they are mostly related by protein size. I will try to look at this in future posts.

Wednesday, November 12, 2008

Open Science - just do it

My blog is 5 years old today and to celebrate I am trying to actually do some blogging. There are a couple of reasons why I have blogged less in the past months. In part it was due to FriendFeed and also in part because I was trying to finish a project on the evolution of phospho-regulation in yeast species. Nearing the end of a project should actually provide some of the most interesting blogging material but I did not ask for permission from everyone involved to write about ongoing work.

I have to admit that although I have been discussing and evangelizing open science for over two years I have done very little of it. I have used this blog sometimes to put up small analysis or mini-reviews but never to describe ongoing projects. I have tried to start a side-project online but I over-estimated the amount of "spare cycles" I have for this. So, I have talked it over with my supervisor and I am now free to "risk" as much as I want in trying out Open Science. The first project I will be trying to work on will be on E3 target prediction and evolution.

Prediction and evolution of E3 ubiquitin ligase targets
As I have mentioned above, I have been working in the past months on the evolution of phosphorylation and kinase-substrate interactions in yeast species. I am interested in the evolution of regulatory interactions in general because I believe that they are important for the evolution of novel phenotypes. This is why I will be trying to study the evolution of E3 target interactions. In order to get there I will try first to develop some methods to predict ubiquitination and E3 targets. Since a lot of the ideas and methodology applies to other post-translational modifications and even localization signals I will in the future try to generalize the findings to other types of interactions.

Some of the questions that I will try to address:
- How accurately can we predict E3 substrates ?
- How quickly in evolution do E3-targets change ?
- Is there co-regulation by kinases and E3s on the same targets (and how these evolve) ?

Once I have something substantial I will open a code repository on Google Code.

Saturday, August 09, 2008

BioBarCamp wrapup

In the last two days I attended the first BioBarCamp here in the bay area in the Institute for the Future. There is a lot of micro blogging coverage of the event in FriendFeed and even some recorded video from Cameron Neylon (click on demand and pick BioBarCamp).

The meeting was fun due of the unstructured nature of the event and also because I got to meet a lot of people I knew only from blogs. Two highlights of the event were the talks by Aubrey de Grey (see notes and also Cameron's video above) and Jon Trowbridg from Google that talked about this.

There were four parallel discussions going on but I kept mostly with the open science and web tolls related talks. There are a couple of ideas that I take away from these discussions that I will mention below but in general these overlap with what Shirley already mentioned in her post.

Pragmatic steps for Open Science and web tool adoption
Kaitlin Thaney and Cameron Neylon talked about open science and data commons. Cameron in particular is making the case that we need to demand open data the same way we demand for open access to science articles. Although publishers will say that they already try ask for availability to everything required to reproduce the results the truth is that this is not really well enforced. Funding agencies should provision funds to make raw results freely available for re-use once an article is accepted for publication.

On the side of web tools for science, Ricardo Vidal (OpenWetWare), Vivek Murthy (Epernicus), Jeremy England and Mark Kaganovich (Labmeeting) discussed user adoption. Adoption rates among scientists tend to be slow and there is a large generational gap. Again here pragmatic steps need to taken to promote the usage of these tools in science. Some of the current problems include fragmentation of user base, lack of focus in tool development, too few security restrictions.

These tools should try to focus on solving a few important problems really well. Examples of these problems include finding the person in my network that might have some expertize that I need; better ways to find articles that I find relevant or to manage my lab notebook and article library, etc. To reduce the fragmentation of user base it would be great that these websites find a way to share the social graph.

Finally the question of privacy online was again revisited. The idea of having open lab notebooks that anyone can see (as in OpenWetWare) might be a bit too radical and put away users that want to try the tools without the risks associated with exposing your research online. As has been discussed elsewhere there are advantages in having electronic notebooks (easier to access, share with peers and backup) but very few people will risk having their lab notebooks freely available online. Therefore allowing for privacy should increase usage.

Sunday, July 27, 2008

Some backlash on Open Science

During ISMB, thanks to Shirley Wu (FF announcement), there was an improvised BoF (Birds of a Feather) session on web tools for scientists. Given that the meeting was not really announced we were not really expecting a full room. I would say that we had around 20 to 30 people that sayed at least for a while. We talked in general about tools that are useful in science (things like online reference managers, pre-print archives, community wikis, FriendFeed, Second Life) and we also talked a bit about the culture of sharing and open science.

Curiosly, the most interesting discussion I had about open science was not at this BoF session but after it. In the following day the subject come up again in a conversation between me and tree other people (two PhD students and a PI from a different lab). I will not identify the people because I don't know if they would like that or not. The most striking thing for me about this conversation was the somewhat instinctive negative reaction against open science from the part of the two PhD students. After a long discussion they made a few interesting arguments that I will mention below but what was strange for me was that this is the first time I see someone react instinctively in a negative way against the concepts of open science.

One of the students in particular was arguing that the fact that scientists sharing their results online (prior to peer review) is not only silly on their part (the scooping argument) but it would be detrimental to science as a whole. The most concrete argument he offered was that seeing someone "stake claim" to a research problem might scare other people away from even trying to solve it. I would say that it would be better to have people collaborating on the same research problems instead of the current scenario where a lot of scientists waste years (of their time and resources) working in parallel without even knowing about it. He argues simply that some people might not want to collaborate at all and should be allowed to work in this way. I don't think scientists should be forced to put their work online before peer-review, I just happen to think that this would improve collaborations and decrease the current waste or resources.

The second argument against sharing of research ideas and results prior to peer review was more consensual. They all mention the problem of noise and how it is already difficult to find relevant results in the peer reviewed literature. They suggest that this problem would be further increased if more people were to share their ideas and results online. I fully agree that this is a problem but not related at all with open science. This is a sorting/filtering problem that is already important today with the large increase in journals and published articles. We do need better recommendation and filtering tools but sharing ideas and results in blogs/wikis/online project management tools is not going to seriously increase the noise since these are all very easily separated from peer-reviewed articles. No-one is forced to track shared projects, but if they are available it would make it that much easier to start a collaboration when and if it makes sense to do so. Are open source repositories detrimental to the software industry ?

It took around 3 years since people started discussing the idea of open science and open notebooks for these concepts to get some attention. It is inevitable (and healthy) that as more people are exposed to a meme that more counter-arguments emerge. I guess that a backlash only means that the meme is spreading.

Tuesday, April 08, 2008

Bio::Blogs#20 - the very late edition

I said I would organize the 20th edition of Bio::Blogs here on the 1st of April but April fools and my current work load did not allow me to get Bio::Blogs up on time.

There were a couple of interesting discussions and blog posts in March worth noting. For example, Neil mentioned a post by Jennifer Rohn started that initiated what could be one of the longest threads in Nature Network :"In which I utterly fail to conceptualize". It started off as small anti-Excel rant but turned in the comments to 1st) a discussion of bioinformatic tools to use, 2nd) a discussion of wet versus dry mindset and how much one should devote to learn the other. Finally it ended up as a exchange about collaborations and how a social networking site like Nature Network could/should help scientists find collaborators. There was even a group started by Bob O'Hara to discuss this last issue further.

I commented on the thread already but can try to expand a bit on it here. Nature Network is positioned as a social networking site for scientists. So far the best that it has to offer has been the blog posts and forum discussions. This is not very different from a "typical" forum. It facilitates the exchange of ideas around scientific topics but NN could try to look at all the typical needs of scientists (lab books, grant managing, lab managing, collaborations, protocols, paper recomendations,etc) and decide on a couple that they could work into the social network site. Ways to search for collaborators and maybe paper recommendation engines that take advantage of your network (network+connotea) are the most obvious and easier to implement. Thinking long term, tools to help manage the lab could be an interesting addition.

Another interesting discussion started from a post by Cameron Neylon on a data model for electronic lab notebooks (part I, II, III). Read also Neil's post, and Gibson's reply to Cameron on FuGE.
How much of the day to day activities and results need to be structured ? How heavy should this structure be to capture enough useful computer readable information ? Although I find these questions and discussion interesting, I would guess that we are far from having this applied to any great extent. If most people are reluctant to try out new applications they will be even less willing to convey their day to day practices via a structured data model. I mentioned recently the experiment under way at FEBS letters journal to create structured abstracts during the publishing process. As part of the announcement the editors commissioned reviews on the topic. It is worth reading the review by Florian Leitner and Alfonso Valencia on computational annotation methods. They argue for the creation of semi-automated tools that take advantage of the automatic methods and the curators (authors or others). The problems and solutions for annotation of scientific papers are shared with digital lab notebooks. It hope that more interest in this problem will lead to easy to use tools that suggest annotations for users under some controlled vocabularies.

Several people blogged about the 15 year old bug found in the BLOSUM matrices and the uncertainty in multiple sequence alignments. See posts by Neil, Kay Lars and Mailund.
Both cases remind us of the importance of using tools critically. The flip side of this is that it is impossible to constantly question every single tool we use since this would slow our work down to a crawl.

In the topic of Open Science, in March the Open Science proposal drafted by Shirley Wu and Cameron Neylon, for the Pacific Symposium on Biocomputing was accepted. It was accepted as a 3 hour workshop consisting of invited talks, demos and discussions. The call for participation is here along with the important deadlines for submissions (talk proposals due June 1st and poster abstracts due the 12th of September).

On a related note Michael Barton has set up a research stream (explained here) He is collecting updates on his work, tagged papers and graphs posted to Flickr into one feed that gives an immediate impression of what he is working on at present time. This is really a great set up. Even for private use withing a lab or across labs for collaboration this would give everyone involved the capacity to tap into the interesting feeds. I would probably not like to have everyone's feeds and maybe a supervisor should have access to some filtered set of feeds or tags to get only the important updates but this looks a step in the right direction. The same way, machines could also have research feeds that I could subscribe too to get updates on some data source.

Also in March, Deepak suggested we need more LEAP (Lightly Engineered Application Products)in science. He suggests that it is better to have one tool that does a job very well than one that does many somewhat well. I guess we have a few examples of this in science. Some of the most cited papers of all time are very well known cases of a tool that does one job well (ex: BLAST).

Finally, some meta-news on Bio::Blogs. I am currently way behind on many work commitments and I don't think I can keep up the (light) editorial work required for Bio::Blogs so I am considering stopping Bio::Blogs altogether. It has been almost two years and it has been fun and hopefully useful. The initial goal of trying to nit together the bioinformatic related blogs and offering some form of highlighting service is still required but I am not sure this is the best way going forward.
Still, if anyone wants to take over from here let me know by email (bioblogs at gmail.com).

Wednesday, December 05, 2007

Open Science project on domain family expansion

Some domain families of similar function have expanded more than others during evolution. Different domain families might have significantly different constraints imposed by their fold that could explain these differences. This project aims to understand what properties determine these differences focusing in particular on peptide binding domains. Examples of constraints to explore include average cost of production or capacity to generate binding diversity for the domain family.

This project is also a test for using Google Code as a research project management system for open science (see here for project home). Wiki pages will be used to collect previous research and milestone discoveries during the project development and to write the final manuscript towards the end of the project. Issue tracking system can be used to organize the required project tasks and assign them to participants. The file repository can hold the datasets and code used to derive any result.

I plan to use the blog as a notebook for the project (tag: domainevolution) and the project home at Google Code as the repository and organization center. The next few post regarding the project will be dedicated to explain better why I am interested in the question and develop further what are some of my expectations. Anyone interested in contributing is more than welcome to join in along the way. I should say that I am not in any hurry and that this is something for my 20% time ;).

Monday, November 19, 2007

Linking Out - Open Science and a new blog

Cameron Neylon posted a request for collaboration in his blog:
...we are using the S. aureus Sortase enzyme to attach a range of molecules to proteins. We have found that this provides a clean, easy, and most importantly general method for attaching things to proteins.
(...)
We are confident that it is possible to get reasonable yields of these conjugates and that the method is robust and easy to apply. This is an exciting result with some potentially exciting applications. However to publish we need to generate some data on applications of these conjugates.

They are looking for collaborators interested in applying this method. Go check the blog posts if you are interested or know someone that works on something similar.

(via Open Access News) Liz Lyon, Associate Director of UK Digital Curation Centre posted an interesting presentation on Open Science: "Open Science and the Research Library: Roles, Challenges and Opportunities?".

(via Fungal Genomes) I found a new blog related to evolution called Thirst for Science with a lot of insightful posts.

Tuesday, November 13, 2007

Last call for Open Laboratory 2007

Bora has issued a last call for submissions to the Science Blogging anthology of 2007. As last year, the objective is to collect some of the best science blog posts of the year and compile it into a book to print on demand (deadline on December 20th 2007). Submissions can be sent using an online form and they will be reviewed by a panel that will compile the final list.
Anyone interested in participating can send in links to their favorite blog posts of the year and also volunteer to be part of the reviewing process (see instructions here).

Tuesday, September 18, 2007

More on open science

I am still catching up with a backlog of feeds and e-tocs but I just noticed that Benjamin Good posted his manuscript on E.D. in Nature Precedings. I wend back to his post where he first presented the manuscript to have a look at the comments and there is a nice discussion going on there. It is a good example of the usefulness of posting our work online. There might be still few people knowledgeable about particular interests to gain very good feedback in all areas but this will tend to grow with time.

Michael Barton from Bioinformatics Zen started a new blog to use as an open science notebook about his own research.

I have a mini project in mind about the evolution of domain families that I will start describing and working on here in the blog soon.

Saturday, July 14, 2007

Another Open lab book

(Via Open Reading Frame) Jeremiah Faith is given open notebook science a try and compiling some tips. He joins Rosie Redfield (microbiology) and Jean-Claude Bradley (chemistry) in exposing most of their research online and leading the way to changing the mindset towards open science.

Jeremiah Faith also has an interesting idea about using conference money to pay for advertisement. He figures that well targeted ads can get you more attention than a talk. He like the idea because it is thinking out of box but I think that the type of connection that one can create on a conference with other people is not so easy to recreate online. Also, there might not be any need to spend money on advertisement if the blogs keeps on topic and is interesting enough to get incoming links. The blog can be a good personal marketing tool.

Monday, July 09, 2007

Metadata infrastructure

Deepak and Neil blogged today about tagging and adding more structured metadata to the science web. I started by commenting to Deepak's post but it grew a bit so I changed it to a blog post.

The most obvious start for me would be to find a standard to communicate information on the perceived impact of a paper (extending hReview for example). It has a unique digital identifier and ways to resolve it but no way to communicate number of downloads at publisher site X, number of incoming citations in other papers, in blog posts, simple rating by users.

On the user side the blogging platforms, social network sites and wikis would need some way to add microformat support. See for example this plugin for wordpress (via F&L). If someone knows how to do the same for blogger please tell me in the comments. It needs to be something like clicking a button to link to a paper and out comes a formated hReview.

I think finding standards for manuscripts is a good start because a lot of people already tag and blog about papers. There is a lot of information to aggregate and a lot of interest in having a good measure of impact for individual papers. What we learn from putting this in place can later be used for other types of data communication (e-lab books). Another possible good start would be conferences and conference reports (related to hCalendar ?).

Of course, this would require the participation of science publishers. They are the ones best in place to set up the tools and expose some of the information in a structured way to help enforce a standard.

Saturday, July 07, 2007

Referee reports in Nature Precedings ?

I was having a look at some of the bioinformatics manuscripts available in Precedings and I come upon this paper on "The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies". After the figures there is a letter to the editor with the response to the questions from the referees.

I could not find the paper published in a peer-reviewed journal and I wonder if this was intentional of maybe part of an (opt-in and maybe buggy) automatic procedure from Nature to have submitted papers appear in Precedings. If I was an editor of a bioinformatics/genomics journal I could now consider if this paper with these referee reports would be interesting to the journal and send an email to authors suggesting that if by some chance their paper gets reject my journal would be willing to publish it.

Deepak was recently saying that it would be good to have access to this type of information. Why was a paper rejected from some journal and published in another? Most manuscripts go through several editorial and referee evaluations before getting published. Biology Direct and now PLoS ONE (too some extent) capture this information. I have found that many times it is useful to read the referee comments in Biology Direct because it provides with several independent criticism that makes it easier to home in on good and bad parts of the work.

I am sure that this has come up before in the context of ArXive but wouldn't it be more efficient to have journal editors somehow fish out from a common pool what is more interesting and hight impact to their community instead of the current submission ladder that i assume a lot of people go through ? We would submit to a preprint server and tag the paper according to perceived audience (i.e cell biology, bioinformatics, etc). Editors would flag their interest for the paper and the authors would select one of the journals. You can imagine some of the dynamics that this could create with some journals only looking at manuscripts that have already been flagged by some other journals, etc.

The referee reports would be attached the paper and the editor would make a decision. If rejected the paper would be up again for editorial selection but with the previous information attached. Other journals could just decide to publish with those referee comments.

I think this is not far from what already happens within publishing houses. Referee reports can be passed around to other journals of the same publisher. This would make it more general. Although there are clear advantages to authors (fewer rounds of refereeing and quicker publishing), it would be hard to convince most publishers to such a scheme. For those publishing mostly journals with low rejection rates it would be beneficial since most likely the papers have been already refereed, but for those with high rejection rates it could feel like they would be giving away their work for free. Since it is really the work of the referees maybe it should be up to the referees to decide if the reports can be made public or not, period.

Wednesday, June 27, 2007

Call for Bio::Blogs #12

I am collecting submissions for the 12th edition of Bio::Blogs. Send in links to blog posts you want to share from your blog or that you enjoyed reading in other blogs to bioblogs at gmail until the end of the month. The next edition will be up at Nodalpoint on the 1st of Jully.

Maybe it could be cool to try out a section on papers of the month as voted by everyone (Neil used to do this once in a while). Anyone interested in participating just has to send a link to a paper, published last month and related to bioinformatics, with a short paragraph explaining what is nice about the paper.

Mike over at Bioinformatics Zen is asking how to continue the Tips and Tricks section of Bio::Blogs. He has put up a wiki page on open science in Nodalpoint to collect information for a possible future edition of the special section.