Cellular Consequences of Genetic variation

Monday, October 21, 2013

Project management (online) tools

I am currently looking for a tool to centralize project management across the group. I asked on twitter for suggestions and received a number useful tips. In case this is of use to others here are a few notes I took when exploring of few of these options. The features I am particularly interested are: low/no set-up or upkeep requirements, intuitive use, rich project notebooks with the possibility to add images and back-up support. Nice features to have: possibility to share with public; integration with Dropbox and/or Google Drive.

Here are the notes in no particular order with my preferences at the end.

Basecamp

Simple, intuitive and well designed project management and collaboration tool. Each project can have: project updates (activity list), text documents (simple text documents, cannot add images), To-do lists (linked to the calendar); discussion items (text and embedded images that can stand-alone or be linked to any other item including other discussions). Group view can quickly show you updates across all projects you are involved in. The group and projects view is great but it would be nicer to have notebooks within each project as implemented in Evernote. Discussions can be used as notebooks but they get mixed in with comments on any item such as a to-do list item. All projects can be downloaded for back-up but automation required 3rd party service or coding via the API. iOS app available and Android via 3rd party app. No free account (60 day trail), plans start at $20/month 10 projects 3GB limit up to £3000/year unlimited projects 500GB limit. Basecamp can be extended from a list of additional services (mostly 3rd party) and they tend to cost additional fees.

Freedcamp
Project views with to-do lists, discussions, milestones, file attachments. Dashboard view with group activity. Marketplace with additional group and project widgets to add (eg. Group chat and wiki's). Free account 20MB limit with paid accounts starting at $2.5/month for 1GB up to $40/month for unlimited storage. Fairly cheap but below average design and somewhat sluggish.

Evernote

This tool is centred on the idea of notebooks (collections of notes). Notes can contain text, embedded images, to-do lists, voice clips. Has a stand-alone program that facilitates copy-paste actions into the notebooks (mac and windows but works well under wine). Notebooks from free accounts cannot be edited by others. Premium accounts (£35 per year) can have notebooks edited by others. One premium account could be used to centralise group notebooks. Business accounts (£8.00/user/month) are needed to have group management features. Limited tools for group interactions (no comments, chat, activity dashboard) when compared with others.

Redmine
Free but but requires local installation. Fully fledged project management tool: activity, roadmap, issue tracker, gannt charts,calendar, news, documents, wiki, forum, files. Recommended by several people in twitter. I only had a quick look since I would prefer an online tool without set-up.

Trello
Card concept – Each card can have Activities (could be text description of project entries), to-do lists, files, can be assigned to specific people, due-dates,Attachments including google drive and dropbox. Cards can be stacked in groups, moved around, tagged with color codes, stickers and individuals responsible for them. It looks nice but I don't like the design for project management. Android and iOS apps. 10mb standard, 250mb gold (plus additional customization features) $5/month or $45 per year

Teambox

Dashboard concept; Users can be assigned to projects. Dashboard view has the list of tasks and notifications for the day. Projects can have activities, conversations tasks, notes, files and members. Notes would be were the project/sub-project/task notes could be added. Notes have version history and can be shared to public. Images can be embedded in the notes. Additional group tools: calendar , gantt chart, time tracking, video conference (by Zoom). Pro accounts also have workload and group chat. iOS and Android apps available. Free - 5 users/5 projects – Pro accounts are $5 per user per month (annual- 20% discount, two years – 30% discount) with unlimited projects. dropbox integration, workload views, group chat functionality and priority support.

Labguru
Project management with a specific focus on science labs. Very large number of features including: dashboard with activity feed, projects (organized into past/present/future milestones, notes with embedded and resizeable images, attachments, pubmed integration, automatic report generation), lab equipment/reagents inventory. Organizing science into milestones makes more sense than into tasks as it fits more the spirit of research versus engineering. Android and iOS apps meant to be used to follow protocols, take pictures, check storage, etc. Overkill for a computational group. Not very smooth as every action results in a full webpage refresh. Expensive ($12 per person/month, yearly billing).

Projecturf
Dashboard view and project view. Projects have: overview, calendar, tasks, tickets, time (could be useful for contract work or grant reporting issue), files, conversations and notes. Files can be integrated with Google drive and dropbox. Notes can have embedded images. Pricing starts at 5 projects, 5GB $20/month up to unlimited projects, 100GB for $200/month (1 month free for annual billing). Very directed towards engineering code based projects.

Summary
My favourites at this point are Basecamp, Teambox and Evernote. Evernote is clearly lacking as as group tool but has a nice focus on notebooks (as in lab notebooks). Basecamp is more polished and intuitive than Teambox but is missing a proper "notebook" within each project and is somewhat expensive. Teambox is not as well designed as Basecamp but should work well, is cheaper and has integration with Google drive.

Saturday, October 19, 2013

Scientific Data - ultimate salami slicing publishing

Last week a new NPG journal called Scientific Data started accepting submission. Although I discussed this new journal with colleagues a few times I realized that I never argued here why I think this a very strange idea for a journal. So what is Scientific Data ? In short it is a journal that publishes metadata for a dataset with data quality metrics. From the homepage:

Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets. It introduces a new type of content called the Data Descriptor designed to make your data more discoverable, interpretable and reusable.

So what does that mean ? Is this a journal for large scale data analysis ? For the description of methods ? Not exactly. Reading the guide to authors we can see that an article "should not contain tests of new scientific hypotheses, extensive analyses aimed at providing new scientific insights, or descriptions of fundamentally new scientific methods". So instead one assumes that this journal is some sort of database where articles are descriptors of the data content and data quality. The added value of the journal would be to store the data and provide fancy ways to allow for re-analysis. That is also not the case since the data is meant to be "stored in one or more public, community recognized repositories". Importantly, these publications are not meant to replace and do not preclude future research articles that make use of these data. Here is an example of what these articles would look like. This example more likely represents what the journal hopes to receive as submissions so let's see how this shapes up in a year when people try to test the limits of this novel publication type.

In summary, articles published by this journal are mere descriptions of data with data quality metrics. This is the same information that any publication already should have except that Scientific Data articles are devoid of any insight or interpretation of the data. One argument in favor of this journal would be that this is a step into micro-publication and micro-attribution in science. Once the dataset is published anyone, not just the producers of the data, can make use of this information. A more cynical view would be that NPG wants to squeeze as much money as they can from scientists (and funding agencies) by promoting salami slicing publishing.

Why should we pay $1000 for a service that does not even handle data storage ? That money is much better spent supporting data infrastructures (disclaimer: I work at EMBL-EBI). There is no added value from this journal that is not or cannot be provided by data repository infrastructures. Yet, this journal is probably going to be a reasonable success since authors can essentially publish their research twice for an added $1000. In fact, anyone doing a large-scale data driven project can these days publish something like 4 different papers: the metadata, the main research article, the database article and the stand-alone analysis tool that does 2% better than others. I am not opposed to a more granular approach to scientific publication but we should make sure we don't waste money in this process. Right now I don't see any incentives to limit this waste nor any real progress in updating the way we filter and consume this more granular scientific content.

Monday, September 23, 2013

Single-cell genomics: taking noise into account

Technical variation versus average read counts
Reprinted by permission from Macmillan Publishers Ltd
Nat Methods, advance online (doi:10.1038/nmeth.2645)

Sequencing throughput and amplification strategies have improved to a point where single cell sequencing has become feasible. There was a recent review in Nat Rev Gen covering the progress in single cell genomics and some of its potential applications that is worth a read. However, the required amplification steps are likely to introduce significant variation for small amounts of starting material. A group of investigators from the EBML-Heidelberg, EMBL-EBI and the Sanger had a look at this problem and developed an approach to quantify and account for such technical variability. The method is described in a paper that is now in press and makes use of spiked-ins to estimate technical variation across a range of different mean expression strengths (see Figure). As with most of these short communications a lot of work is included in supplementary materials, including a detailed R workflow description that should allow anyone to recreate the main figures from the paper.

This paper is a starting point for more things to come. It is focused on the method and there is clearly a lot of biological findings to be made from those data. More broadly, the Sanger and the EMBL-EBI have recently set up a joint single cell genomics centre to acquire an develop the required technology. From the EBI side this is headed by Sarah Teichmann (also affiliated with Sanger) and John Marioni. Unfortunately, for my interests in post-translational regulation, single-cell proteomics is still lagging way behind. The Cytof comes closest but still requires antibodies for detection.

Tuesday, July 02, 2013

Interdisciplinary EMBL postdoc fellowship in genome evolution and chemical-biology

The EMBL Interdisciplinary Postdocs (EIPOD) program is now accepting applications (deadline 12 of September). This program funds interdisciplinary research projects between different units of the EMBL. Applicants are encouraged to discuss self-defined project ideas with EMBL scientists or select up to two project ideas available at the EIPOD website.

One of the project ideas listed this year is for a joint project between our group (EMBL-EBI) and the group of Nassos Typas at the EMBL Genome Biology Unit in Heidelberg. Here is a short description of project idea, entitled "Modeling genotype-to-phenotype relationships in a bacterial cell":

Understanding how phenotypic variability originates from mutations at the level of the DNA is one of the fundamental problems in biology. Sequencing of genomes for multiple individuals along with rich phenotypic profiling data allows us to pose the question of how the sum of mutations in each individual genome results in the observed phenotypic differences. The goal of this project is to develop computational methods to predict the consequences of mutations and gene-content variation on fitness in different conditions for different strains of E. coli.

The Typas group develops high-throughout approaches to study gene function via chemical-genetics and genetic-interaction screening. Previous publications and current research interests are listed in the group webpage. Our group is generally interested in studying the evolution of cellular interaction networks and in this context is interested in understanding how mutations and gene-content variation results in phenotypic consequences for different individuals.

Potential applicants are encouraged to get in touch to discuss a project proposal that relates to this topic. We are particularly keen on applicants with previous experience in any of the following: chemical-informatics, chemical-biology, protein and genome evolution, sequence/structural based prediction of effect of mutations, bacterial pan-genome studies.

Saturday, June 08, 2013

Doing away with scientific journals

I got into a bit of an argument with Björn Brembs on twitter last week because of a statement I made in support of professional editors. I was mostly saying that professional editors were no worse than academic editors but our discussion went mostly into the general usefulness of scientific journals. Björn was arguing his positions that journal rankings in the form of the well known impact factor are absolutely useless. I was trying to argue that (unfortunately) we still need journals to act as filters. Having a discussion on Twitter is painful so I am giving my arguments some space in this blog post.

Björn arguments are based on this recently published review regarding the value of journal ranking (see paper and his blog post). The one line summary would be:

"Journal rank (as measured by impact factor, IF) is so weakly correlated with the available metrics for utility/quality/impact that it is practically useless as an evaluation signal (even if some of these measures become statistically significant)."

I covered some of my arguments before regarding the need of journals for filtering here and here. In essence I think we need some way to filter through the continuous stream of scientific literature and the *only* current filter we have available is the journal system. So lets break this argument in parts. Is it true that : we need filters; journals are working as filters; there are no working alternatives ?

We need filters

I hope that few people will try to argue that we have no need for filters in scientific publishing. On pubmed there are 87551 abstract entries for May which is getting close to 2 papers per minute. It is easy to see that the rate of publishing is not going down any time soon. All current incentives on the author and publishing side will keep pushing this rate up. One single unfiltered feed of papers would not work and it is clear we need some way to sort out what to read. The most immediate way to sort would be on topics. Assuming authors would play nice and not try to tag their papers as broadly as possible (yeah right) this would still not solve our problem. For the topics that are very close to what I work on I already have feeds with fairly broad pubmed queries that I go through myself. For topics that might be one or several steps removed from my area of work I still want to be updated on method developments and discoveries that could have an impact on what I am doing. I already spend an average of 1 to 2 hours a day scanning abstracts, I don't want to increase that.

Journals as filters

If you follow me this far then you might agree that we need filtering processes that go beyond simple topic tagging. Without even considering journal "ranking", the journals already do more than topic tagging since journals are also communities that form around areas of research. To give a concrete example both Bioinformatics and PLOS Computational Biology publish papers in bionformatics but while the first tends to publish more methods papers the latter tends to publish more biological discoveries. Subjectively I tend to prefer the papers published in the PLOS journal due to its community and that has nothing to do perceived impact.

What about impact factors and journal ranking ? In reviewing the literature Björn concludes that there is almost no significant association between impact factors and future citations. This is not in agreement with my own subjective evaluation of different journals I pay attention to. To give an example, the average paper on journals of the BMC series are not the same to me as average papers published in Nature journals. Are there many of you that have a different opinion ? Obviously, this could just mean that my subjective perception is biased and incorrect. This would mean also that journal editors are doing a horrible job and the time they spend evaluating papers is useless. I have worked as an editor for a few months and I can tell you that it is hard work and it is not easy to imagine that it is all useless work. In his review Björn points to, for example, the work by Lozano and colleagues. In that work the authors correlated the impact factor of the journal with future citations of each paper in a given year. For biomedical journals the coefficient of determination has been around 0.25 since around 1970. Although the correlation between impact factor and future citations is not high (r ~ 0.5) it is certainly highly significant given that they looked at such large numbers (25,569,603 articles for biomed). Still this also tell us that evaluating the impact/merit of an individual publication by the journal it is published prone to error. However, what I want to know is: given that I have to select what to read, do I improve my chances of finding potentially interesting papers by restricting my attention to subsets of papers based on the impact factor ?

I tried to get my hands on the data used my Lozano and colleagues but unfortunately they could not give me the dataset they used. Over email, Lozano said I would have to pay Thomson Reuters on the order of $250,000 for access (not so much reproducible research). I wanted to test the enrichment over random of highly versus lowly cited papers in relation to impact factors. After a few other emails Lozano pointed me to this other paper where they calculated enrichment for a few journals in their Figure 4, which I am reproducing here under a Don't Sue me licence. For these journals they calculated the fraction of 1% top most cited papers divided by the fraction of top 1% cited papers across all papers. This gives you an enrichment over random expectation that for journals like Science/Cell/Nature turns out to be around 40 to 50. So there you go, high impact factors, on average, tend to be enriched in papers that will be highly cited in the future.

As an author I hate to be evaluated by the journals I publish in instead of the actual merit of my work. As a reader I admit to my limitations and I need some way to direct my attention to subsets of articles. Both the data and my own subjective evaluation tells me that journal impact factors can be used as way to enrich for potentially interesting articles.

But there are better ways ...

Absolutely ! The current publishing system is a waste of everyone's time as we try to submit papers down a ladder or perceived impact. The papers get reviewed multiple times in different journals, reviewers think that articles need to be improved with year long experiments and discoveries stay hidden in this reviewing limbo for too long. We can do better than this but I would argue that the best way to do away with the current journal system is to replace it with something else. Instead of just shouting for the destruction of journal hierarchies and the death of the impact factor talk about how you are replacing it. I try out every filtering approach I can find and I will pay for anything that works well and saves me time. Google Scholar has a reasonably good recommendation system and it is great to see people developing applications like the Recently app. PLOS is doing a great job of promoting the use of article level metrics that might help others to build recommendation systems. There is work to do but the information and technology for building such recommendation systems is all out there already. I might even start using some of my research budget to work on this problem just out of frustration. I have some ideas on how I would go about this but this blog post is already long. If anyone wants to chat about this drop me a line. At the very least we can all start using preprint servers and put our work out before we bury it for a year in the publishing limbo.

Monday, May 13, 2013

EBI-Sanger postdoctoral fellowship on Plasmodium kinase regulatory networks

I am happy to announce a call for applications for a EBI-Sanger postdoctoral fellowship to study the kinase regulatory networks in Plasmodium. This is one of four currently open calls in the the EBI–Sanger Postdoctoral (ESPOD) Programme and the call closes on the 26th of July. This interdisciplinary programme is meant to foster collaborations between the EBI and the Wellcome Trust Sanger Institute, both at the Genome Campus near Cambridge UK. Our project is a collaboration between myself (EBI), Jyoti Choudhary (mass-spectrometry group leader at Sanger) and Oliver Billker (group leader at Sanger studying malaria parasites). The postdoctoral fellow would have the opportunity to work at the interface between bioinformatics, mass-spectrometry (MS) and Plasmodium biology. A description of the project can be found online (PDF) but briefly the objective is to characterize kinase regulatory network of the malaria parasite by combining quantitative phosphoproteomics with computational analysis. There will be a strong emphasis on the computational analysis of the MS data and some prior computational experience is a plus. The ideal candidate would have prior experience in phosphoproteomics with a strong interest in learning the computational aspects required or prior experience in the relevant computational skills and interest in learning/performing some of the experimental work. Feel free to contact me if you require more information about the project or the ESPOD fellowship.

Sunday, April 07, 2013

The case for article submission fees

For scientific journal articles the cost of publishing is almost exclusively covered by the articles that are accepted for publication. Either by the published authors or by the libraries. Advertisement and other items like the organization of conferences are probably not a very significant source of income. I don't want to argue here again the value of publishers and how we should be decoupling the costs of publishing (close to zero) from peer-review, accreditation and filtering. Instead I just want to explore the idea for a very obvious form a income that is not used - submission fees. Why don't journals charge all potential authors a fixed cost per submission, even if the article ends up being rejected ? I am sure publishers have considered this option and they have reached the conclusion that this is not viable. I would like to know why and maybe someone reading this can give a strong argument against. Hopefully someone from the publishing side that has crunched the numbers.

The strongest reason against that I can imagine would be a reduction in submission rates. If only some publishers adopt this fee authors will send their papers to journals that don't charge for submission. Would the impact be that significant ? For journals with high-rejection rates this might even be useful since it would preferentially deter authors that are less confident about the value of their work. For journals with lower rejection rates the impact of the fee would be small since authors are less concerned with a rejection. Publishers might even benefit from implementing a submission charge in the from of a lock-in effect if they do not charge when transferring articles between their journals. Publishers already use this practice of transferring articles and peer-review comments between their journals. It already functions as a form of lock-in since authors, wishing to avoid another lengthy round of peer-review, will tend to accept. If the submission fee is only charged once the authors are even more likely to keep the articles within the publisher. Given the current trend of publishers trying to own the full stack of high-to-low rejection rate journals these lock-in effects are going to be increasingly valuable.

The overall benefit would be an increased viability of open access. A submission fee might also accelerate the decoupling of peer-review from the act of publishing. If we get used to paying separately for publishing and for submission/evaluation we might get used to having these activities performed by different entities. Finally, if it results also in less slicing into ever smaller publishable units we might all benefit.

Update: Anna Sharman sent me a link to one of her blog posts where she covers this topic in much more detail.

Photo adapted from: http://www.flickr.com/photos/drh/2188723772/

Tuesday, April 02, 2013

Benchmark the experimental data not just the integration

There was a paper out today in Molecular Systems Biology with a resource of kinase-substrate interactions obtained from in-vitro kinase assays using protein micro-arrays. It is clear that there is a significant difference between what a kinase regulates inside a cell and what it could phosphorylate in-vitro given appropriate conditions. In fact, reviewer number 1 in the attached comments (PDF), explains at length why these protein-array based kinase interactions may be problematic. The authors are aware of this and integrate the protein-array data with additional data sources to derive a higher confidence dataset of kinase interactions. The authors then provide computational and experimental benchmarks of the integrated dataset. What I have an issue with is that the original protein-array data itself it not clearly benchmarked in the paper. How are we to know what is the contribution of that feature and all of the hard experimental work for the final integrated predictor ?

A very similar procedure was used in a recent Cell paper paper where co-complex membership was predicted based on the elution profiles of proteins detected by mass-spectrometry. Here again, the authors do not present benchmarks of the interactions predicted solely on the co-elution data. Instead they integrate it with around 15 other features before evaluating and studying the final result. In this case, they have in supplementary material some indirect indication of the value of the experimental data by itself by providing the rank each feature has in the predictor.

I don't think the papers are incorrect. In both cases the authors provide an interesting final result with the integrated set of interactions benchmarked and analysed. However, in both cases, we are unsure of the value of the experimental data that is presented. I don't think it is an unreasonable request. There are many reasons why this information should be clearly presented before additional data integration steps are used. At the very least this is important for other groups thinking about setting up similar experimental approaches.

Thursday, March 28, 2013

The glacial pace of innovation in scientific publishing

Nature made available today a collection of articles about the future of publishing. One of these is a comment by Jason Priem on "Scholarship: Beyond the paper". It is beautifully written and inspirational. It is clear that Jason has a finger on the pulse of the scientific publishing world and is passionate about it. He sees a future of a "decoupled" journal, where modular distributed data streams can be built into stories openly and in real time. Where certification and filtering are not tied to the act of publishing and can happen on the fly by aggregating social peer review. While I was reading I could not contain a sigh of frustration. This is a future that several of us like Neil and Greg debated at Nodalpoint many years ago. Almost 7 years ago I wrote in a blog post:

"The data streams would be, as the name suggests, a public view of the data being produced by a group or individual researcher.(...) The manuscripts could be built in wikis by selection of relevant data bits from the streams that fit together to answer an interesting question. This is where I propose that the competition would come in. Only those relevant bits of data that better answer the question would be used. The authors of the manuscript would be all those that contributed data bits or in some other way contributed for the manuscript creation. (...) The rest of the process could go on in public view. Versions of the manuscript deemed stable could be deposited in a pre-print server and comments and peer review would commence."

I hope Jason wont look back some 10 years from now and feel the same sort of frustration I feel now with how little scientific publishing has changed. So what happened in the past 7 years ? Not much really. Nature had an open peer review trial with no success. Publishers were slow to allow comments on their websites and we have been even slower at making use of them. Euan had a fantastic science blog/news aggregator (Postgenomic) but it did not survive long after he went to Nature. Genome Biology and Nature both tried to create pre-print servers for biomed authors but ended up closing them for lack of users. We had a good run at an online discussion forum with Friendfeed (thank you Deepak) before Facebook took the steam out of that platform. For most publishers we can't even know the total number of times an article we wrote has been seen, something that blog authors have taken for granted for many years. Even some cases where progress has been made, it has taken (or is taking) way too long. The most obvious example is the unique author id where after many (oh so many) years there is a viable solution in sight. All that said, some progress was made in the past few years. Well, mainly two things - PLOS One and Twitter.

Money makes the world go round

PLOS One had a surprising and successful impact in the science publishing world. Its initial stated mission was to change the way peer review was conducted. The importance of a contribution would be judged by how readers would rate or comment on the article. Only it turns out that few people take the time to rate or comment on papers. Nevertheless, thanks to some great management, first by Chris Surridge and then by Peter Binfield, PLOS One was a huge hit as an novel, fast, open access (at a fair price) journal. PLOS One, catch-all approach saw a steady increase in number of articles published (and very healthy profits) and got the attention of all other publishers.

If open-access is suitable as a business model then funding sources might feel that is OK to mandate immediate open-access. If that were to happen then only publishers with a similar structure to PLOS would survive. So, to make a profit and to hedge against a mandate for open access all other publishers are creating (or buying) a PLOS One clone. This change is happening at an amazing pace. This is great for open access and it goes in the direction of a more streamlined and modular system of publishing. It is not so great for filtering and discoverability. I have said in the past that PLOS One should stop focusing on growth and go back to the initial focus on filtering and the related problem of credit attribution. To their credit they are one of the few very actively advocating for the development of these tools. Jason, Heather, Euan and others are doing a great job of developing tools that report these metrics.

1% of the news scrolling by

Of the different tools that scientists could have picked up to be more social Twitter was the last one I would expect to see taking off. 140 characters ?! Seriously ? How geeky is that ? No threaded discussions, no groups, some weird hashtagsomethings. It what world is this picked up by established tenured university professors that don't have time to leave a formal comment on a journal website ? I have no clue how it happened but it did. Maybe the simple interface with a single use case; the asymmetric (i.e. flattering) network structure; the fact that updates don't accumulate like email. Whatever the reason, scientists are flocking to twitter to share articles, discuss academia and science (within the 140 char) and rebel against the Established System. It is not just the young naive students and disgruntled postdocs. Established group leaders are picking up this social media megaphone. Some of them are attracting audiences that might rival some journals so this alone might make them care less about that official seal of approval from a "high-impact" journal.

The future of publishing ?

So after several years of debates about what the web can do for science we have: 1) a growing trend for "bulk" publishing with no solid metrics in place to help us filter and provide credit to authors; and 2) a discussion forum (Twitter) that is clunky for actual discussions but is at least being picked up by a large fraction of scientist. Were are going from here ? I still think that a more open and modular scientific process would be more productive and enjoyable (less scooping). I am just not convinced that scientists in general even care about these things. From my part I am going to continue sharing ideas on this blog and, now that I coordinate a research group, start posting articles to arXiv. I hope that Jason is right and we will all start to take better advantage of the web for science.

Clock image adapted from tinyurl.com/cmy9fn5

Wednesday, December 26, 2012

My Californian Life

Warning: No science ahead

I am in Portugal for the holidays having just left San Francisco. It is a part of the academic life that we have to keep moving around with each new job and after Germany and the US (California) I am moving to the UK in a few days. Its not easy to keep rebuilding your roots in new places but it is certainly rewarding to experience new cultures. It has been great to spend almost 5 years in California and it was very (!) hard to leave. I decided to try to write down a few thoughts about life in the golden state. Maybe it will be useful for others considering moving there. I apologize in advance for the generalizations.

Geek heaven

It is impossible to live in silicon valley (I was in Menlo Park) without noticing the geekiness of the place. Just a few random examples: the Caltrain conductors frequently make geeky jokes ("warp speed to San Francisco"); the billboards on the 101 highway between San Francisco and San Jose often advertise products that only programmers would care about (e.g. types of databases); every time I went for a hot chocolate at the Coupa Cafe there was someone demoing or pitching an idea for a website or app. For someone that likes technology it is a great place to be. It is thrilling to find out that so many of the tech companies you read about are just around the corner. It is also very likely that the people you will meet know about the latest tech trends if they are not, in fact, actually developing them. Unfortunately, every nice thing has its downsides and there is so much money around from tech salaries and companies that everything is horribly expensive if you don't work in the tech sector yourself.

The "can do" attitude and personal responsibility

It is nearly impossible to pin-down what makes silicon valley such a breeding ground for successful companies but one thing that impressed me was their winning attitude. It is more generally an american trait and not just found in California. People often believe their ideas can succeed to a point that borders on naivety. To paint a (somewhat) exaggerated picture: it is not enough to be good, you should strive to be number one. There are many positive and negative consequences of this attitude that are not easy to unwrap. There are many obvious advantages that come with all that drive and positive thinking. This connects also to the notion of personal responsibility - it should be up to each one of us to make our success. As a negative consequence, failure is then also our individual responsibility, even when in reality it isn't.

I don't want to go into politics but I will say that I have learned a lot also about the role of government. It was interesting to live in a place that emphasized personal responsibility so much more than Portugal. I think it served well to calibrate my own expectations of what the state should and should not be responsible for. As for many other things, I wish more people could have the experience of living in different countries and cultures.

Fitness freaks surrounded by amazing nature and obsessed with organic local food

Before going to the US I had many friends making jokes about how much weight I would gain and the stereotypical view of overweight america. In fact any generalizations of S.F. or silicon valley would have to go in the opposite direction. I had friends waking up at crazy hours to exercise and I even learned to enjoy running (yikes !). It helps that California is sunny and filled with beautiful nature like the state parks and coastline (do the California route 1). Also, the food is great although there is tiny exaggerated obsession with locally grown organic food. The constant sunshine and great food are probably the two things I will miss most when I get to the UK. I might have to buy a sun lamp. The interests for the outdoors was not that different from Germany but it is something I wish was more prevalent in Portugal. Portugal has such a nice weather and outdoors that it is a waste that we don't take better advantage of them.

A thank-you note

Californians are amazingly friendly people. It is true that sometimes it feels superficial. In restaurants it can even be annoying when a waiter comes for the tenth time to ask if you are really enjoying your meal. Still, it was great to live there with the easy smiles and unexpected chit-chat or compliments. It was easy to feel at home and I never felt like a foreigner. As I have learned from these years of living in California, one should always send a polite thank-you note after an event. So thank you California for these wonderful years. It would be most appropriate to say that it was "awesome".

Tuesday, November 06, 2012

Scholarly metrics with a heart

I attended last week the PLOS workshop on Article Level Metrics (ALM). As a disclaimer, I am part of the PLOS ALM advisory Technical Working Group (not sure why :). Alternative article level metrics refer to any set of indicators that might be used to judge the value of a scientific work (or researcher or institution, etc). As a simple example, an article that is read more than average might correlate with scientific interest or popularity of the work. There are many interesting questions around ALMs, starting even with simplest - do we need any metrics ? The only clear observation is that more of the scientific process is captured online and measured so we should at least explore the uses of this information.

Do we need metrics ? What are ALMs good for

As any researcher I dislike the fact that I am often evaluated by the impact factor (IF) of the journals I publish in. When a position has hundreds of applicants it is not practical to read each candidate's research and carefully evaluate them. As a shortcut, the evaluators (wrongly) estimate the quality of a researcher's work by the IFs of the journals. I wont discuss the merit of this practice since even Nature journal has spoken out against the value of IFs. So one of the driving forces behind the development of ALMs is this frustration with the current metrics of evaluation. If we cannot have a careful peer evaluation of our work then the hope is that we can at least have better metrics that reflect the value/interest/quality of our work. This is really an open research question and as part of the ALMs meeting, PLOS announced a PLOS ONE collection of research articles on ALMs. The collection includes a very useful introduction to ALMs by Jason Priem, Paul Groth and Dario Taraborelli.

Beyond the need for evaluation metrics ALMs should also be more broadly useful to develop filtering tools. A few years ago I noticed that articles that were being bookmarked or mentioned in blog posts had an above average number of citations. This has now being studied in much detail. Even if you are not persuaded by the value of quantitative metrics (number of mentions, PDF downloads, etc) you might be interested instead in referrals from trust-wordy sources. ALM metrics might be useful by tracking the identity of those reading, downloading, bookmarking an article. There are several researchers I follow on social media sites because they mention articles that I consistently find interesting. In relation to identity, I also learned in the meeting that ORCID author ID initiative has finally a (somewhat buggy) website that you can use to claim an ID. Also, ALMs might be useful for filtering if they can be used, along with natural language processing methods, to improve automatic classification of an articles' topic. This last point, on the importance of categorization, was brought up in the meeting by Jevin West who had some very interesting ideas on the topic (e.g. clustering, automatic semantic labeling, tracking ideas over time). If the trend for the growth of mega-journals (PLOS ONE, Scientific Reports, etc) continues, we will need these filtering tools to find the content that matters to us.

Where are we now with ALMs ?

In order to work with different metrics of impact we need to be able to measure them and these need to made available. From the publishers side PLOS has lead the way in making several metrics available through an API and there is some hope that other publishers will follow PLOS. Nature for example has recently made public a few of the same metrics for 20 of their journals although, as far as I know, they cannot be automatically queried. The availability of this information has allowed for research on the topic (see PLOS ONE collection) and even the creation of several companies/non-profit that develop ALM products (Altmetrics, ImpactStory, Plum Analytics, among others). Other established players have also been in the news recently. For example, the reference management tool Mendeley has recently announced that they have reached 2 million users whose actions can be tracked via their API and Springer announced the acquisition of Mekentosj, the company behind the reference manager Papers. The interest surrounding ALMs is clearly on the rise as publishers, companies and funders try their best to gauge the usefulness of these metrics and position themselves to have an advantage in using them.

The main topics at the PLOS meeting

It was in this context that we got together in San Francisco last week. I enjoyed the meeting format with a mix of loose topics but strict requirements for deliverables. It was worth attending even just for that and the people I met. After some introductions we got together in groups and quickly jotted down in post-its the sort of questions/problems we though were worth discussing. The post-its were clustered on the walls by commonality and a set of broad problem sets were defined (see the list here).

Problems for discussion included:

how do we increase awareness for ALMs ?
how to prevent the gaming (i.e. cheating to increase the metrics of my papers) ?
what can be and is worth measuring ?
how to exchange metrics across providers/users (standards) ?
how to give context/meaning/story to the metrics ?

We were then divided into parallel sessions where we further distilled these problems into more specific action lists and very concrete steps that can be taken right now.

Metrics with a heart

From my own subjective view of the meeting it felt like we spent a considerable amount of time discussing how to give more meaning to the metrics. I think it was Ian Mulvany who wrote in the board in one of the sessions: "What does 14 mean ?". The idea of context came up several times and from different view points. We have some understanding of what a citation means and from our own experience we can make some sense of what 10 or 100 citations mean (for different fields etc). We lack a similar sense for any other metric. As far as I know, ImpactStory is the only one trying to give context to the metrics shown by comparing the metrics of your papers with random sets of the same year. Much more can be done along these same lines. We arrived at a similar discussion from the point of view of how we present ourselves as researchers to the rest of the world. Ethan Perlstein talked about how engaging his audience through social media and giving feedback on how his papers were being read and mentioned by others was enough to tell a story that increased interest for his work. The context and story (e.g. who is talking about my work) is more important than the number of views. We reached again to the same sort of discussions when we talked about tracking and using the semantic meaning or identity/source of the metrics. For most use cases of ALMs we can think of we would benefit or downright need more context and this is likely to drive the next developments and research in this area.

The devil we don't know

Heather Piwowar asked me at some point if I had any reservations about ALMs. In particular from the point of view of evaluation (and to a lesser extent filtering) it might turn out that we are substituting a poor evaluation metric (journal impact factor) by an equally poor evaluation criteria - our capacity to project influence online. In this context it is interesting to follow some experiments that are being done in scientific crowdfunding. Ethan Perlstein has one running right now with a very catchy tittle: "Crowdfund my meth lab, yo". Success in crowdfunding should depend mostly on the capacity to project your influence or "brand" online. An exercise in personal marketing. Crowdfunding is an extreme scenario where researchers are trying to side-step the grant system and get funding directly from the wider public. However, I fear that evaluation by ALMs will tend to reward exactly the sort of skills that relate to online branding. Not to say that personal marketing is not important already, this is why researchers network in conferences and get to know editors, but ALMs might reward personal (online) branding to an even higher level.

Thursday, July 19, 2012

I am starting a group at the EMBL-EBI

I signed the contract this week to start a research group at the European Bioinformatics Institute (EBI) in Cambridge, an outstation of the European Molecular Biology Laboratory (EMBL). After blogging my way through a PhD and postdoc it is a pleasure to be able to write this blog post. In January, I will be joining an amazing group of people working in bioinformatics services and basic research where I plan to continue studying the evolution of cellular interaction networks. I am currently interested in the broad problem of understanding how genetic variability gets propagated through molecules and interaction networks to have phenotypic consequences. The two main research lines of the group will continue previous work on the evolution of protein interactions and post-translational modifications (see post) and the evolution of chemical genetics/personal genomics (see post 1 and post 2). I will continue to use this blog to discuss research ideas and on-going work and as always the content here reflects my own personal views and not of my current/future employers.

I take also the opportunity to mention that I am looking for a postdoc to join the group in January to work on one of the two lines described above. If you know anyone that might be ~~crazy~~ ~~adventurous~~ interested please send them a link to this post. Past experience (i.e. published work) in computational analysis of cellular interaction networks is required (ex. post-translational modifications, mass-spectrometry, linear motif based interactions, structural analysis of interaction networks, genetic-interactions, chemical-genetics, drug-drug interactions, comparative genomics, etc). The work will be done in collaboration with experimental groups in the EMBL-Heidelberg and the Sanger. Pending approval from EMBL, a formal application announcement will appear in the EMBL jobs page.

I wanted to also share a bit of my experience of trying to get a job after the postdoc “training” period. I have ranted in the past sufficiently about how many problems the academic track system has but the current statistics are informative enough. About 15% of biology related PhDs get an academic position within 5 years. The competition is intense and in the past year and a half I have applied to 15 to 20 positions before taking the EBI job. On a positive note, I had the impression that better established places actually cared less about impact factors. Nevertheless, it has been a super stressful period and even so I know I have been very lucky. I don’t mean this in any superstitious way but really in how statistically unlikely it is to have supportive supervisors and enough positive research outcomes (i.e. impactful publications) to land a job. I think we are training too many PhDs (or not taking advantage of the talent pool) and the incentives are not really changing. From my part I will try my best to contribute to changing the incentives behind this trend. We could have less funding for PhD in bio related areas, smaller groups and more careers within the lab besides the group leader. At least, PhDs should be aware and train for alternative science related paths.

Evolution and Function of Post-translational Modifications

A significant portion of my postodoctoral work is finally out in the last issue of Cell (link to paper). In this study we have tried to assign a function to post-translational modifications (PTMs) that are derived from mass-spectrometry (MS). This follows directly from previous work where we looked at the evolution of phosphorylation in three fungal species (paper, blog post). We (and other groups) have seen that phosphorylation sites diverge rapidly but we don't really know if this divergence of phosphosites results in meaningful functional consequences. In order to address this we need to know the function of post-translational modifications (if they have any). Since these MS studies now routinely report several thousand PTMs per analysis we have a severe bottleneck in the functional analysis of PTMs. These issues are the motivations for this last work. We collected previously published PTMs (close to 200.000) and obtained some novel ubiquitylation sites for S. cerevisiae (in collaboration with Judit Villen's lab). We revisited the evolutionary analysis and we set up a couple of methods to prioritize those modifications that we think are more likely to be functionally important.

As an example, we have tried to assign function to PTMs by annotation those that likely occur at interface residues. One approach that turned out to be useful was to look for conservation of the modification sites within PFAM domain families. For example, in the figure above and under "Regulation of domain activity", I am depicting a kinase domain. Over 50% of the phosphorylation sites that we find in the kinase domain family occur in the well known activation loop (arrow), suggestion that this is an important regulatory region. We already know that the activation loop is an important regulatory region but we think that this conservation approach will be useful to study the regulation of many other domains. In the article we give other examples and an experimental validation using the HSP70 domain family (in collaboration with the Frydman lab).

I won't describe in detail the work as you can (hopefully) read the paper. Leave a comment or send me an email if you can't and/or if you have any questions regarding the paper or analysis. I also put up the predictions in a database (PTMfunc) for those who want to look at specific proteins. It is still very alpha, I apologize for the bugs and I will try to improve it as quickly as possible. If you want access to the underlying data just ask and I'll send the files. I am also very keen on collaborations with anyone collecting MS data or interested in the post-translational regulation of specific proteins, complexes or domain families.

Blogging and open science
Having a blog means I can give you also some of the thoughts that don't fit in a paper or press release. You can stop reading if you came for the sciency bits. One of the cool things I realized was that I have discussed in this blog three papers in the same research line, that run through my PhD and postdoc. It is fun to be able to go back not just to the papers but to the way I was thinking about these ideas at the time. Unfortunately, although I try to use this blog to promote open science this project was yet-another-failed open science project. Failed in the sense that it started with a blog post and a lot of ambition but never gained any momentum as an online collaboration. Eventually I stopped trying to push it online and as experimental collaborators joined the project I gave up on the open science side of it. I guess I will keep trying whenever if makes sense. This post closes project 1 (P1) but if you are interested in online collaborations have a look at project 2 (P2).

Publishable units and postdoc blues
This work took most of my attention during the past two years and it is probably the longest project I have worked on. Two years is not particularly long but it has certainly made me think about what is an acceptable publishable unit. As I described in the last blog post, this concept is very hard to define. While we probably all agree that a factoid in a tweet is not something I should put on my CV we allow and even cheer for publishing outlets that accept very incremental papers. The work I described above could have easily been sliced into smaller chunks but would it have the same value ? We would have put out the main ideas much faster but it could have been impossible to convince someone to test them. I feel that the combination of the different analysis and experiments has more value as a single story but an incremental approach would have been more transparent. Maybe the ideal situation would be to have the increments online in blogs, wikis and repositories and collect them in stories for publication. Maybe, just maybe, these thoughts are the consequence of postdoc blues. As I was trying to finish and publish this project I was also jumping through the academic track hoops but I will leave that for a separate post.

Wednesday, May 09, 2012

The Minimal Publishable Unit

What constitutes a minimal publishable unit in scientific publishing ? The transition to online publishing and the proliferation of journals is creating a setting where anything can be published. Every week spam emails almost beg us to submit our next research to some journal. Yes, I am looking at you Bentham and Hindawi. At the same time, the idea of a post-publication peer review system also promotes an increase in number of publications. With the success of PLoS ONE and its many clones we are in for another large increase in the rate of scientific publishing. Publish-then-sort as they say.

With all these outlets for publication and the pressure to build up your CV it is normal that researchers try to slice their work into as many publishable units as possible. One very common trend in high-throughout research is to see two to three publications that relate to the same work: the main paper for the dataset and biological findings and 1 or 2 off-shoots that might include a database paper and/or a data analysis methods paper. Besides these quasi duplicated papers there are the real small bites, specially in bioinformatics research. You know, those that you read and you think to yourself that it must have taken no more than a few days to get it done. So what is an acceptable publishable unit ?

I mapped phosphorylation sites to modbase models of S. cerevisiae proteins and just sent this tweet with a small fact about protein phosphosites and surface accessibility:

Should I add that tweet to my CV ? This relationship is expected and probably already published with a smaller dataset but I could bet that it would not take much more to get a paper published. What is stopping us from adding trivial papers to the flood of publications ? I don't have an actual answer to these questions. There are many interesting and insightful "small-bite" research papers that start from a very creative question that can be quickly addressed. It is also obvious that the amount of time/work spent on a problem is not proportional to the interest and merit of a piece of research. At the same time, it is very clear that the incentives in academia and publishing are currently aligned to increase the rate of publication. This increase is only a problem if we can't cope with it so maybe instead of fighting against these aligned incentives we should be investing heavily in filtering tools.

Wednesday, March 28, 2012

Individual genomics of yeast

Nature Genetics used to be one of my favorite science journals. It consistently had papers that I found exciting. That changed about 5 years ago or so when they had a very clear editorial shift into genome-wide association studies (GWAS). Don't take me wrong, I think GWAS are important and useful but I don't find it very exciting to have lists of regions of DNA that might be associated with a phenotype. I want to understand how variation at the level of DNA gets propagated through structures and interaction networks to cause these differences in phenotype. I mostly stayed out of GWAS since I was focusing on the evolution of post-translational networks using proteomics data but I always felt that this line research was not making full use of what we know already about how a cell works.

In this context, I want to tell you about a paper that came out from Ben Lehner's lab that finally made me excited about individual variation and why I think it is such a great study. I was playing around with the similar idea when the paper came out so I will start with the (very) preliminary work I did and continue with their paper. I hope it can serve as small validation of their approach.

As I just mentioned, I think we can make use of what we know about cell biology to interpret the consequence of genetic variation. Instead of using association studies to map DNA regions that might be linked to a phenotype, we can take a full genome and try to guess what could be deleterious changes and their consequences. It is clear that full genome sequences for individuals are going to be the norm so how do we start to interpret the genetic variations that we see ? For human genetic variation, this is a highly complex and challenging task.

Understanding the consequences of human genetic variation from the DNA to phenotype requires knowledge of how variation will impact on proteins's stability, expression and kinetics; how this in turn changes interaction networks; how this variation is reflected in each tissue function; and ultimately to a fitness difference, disease phenotype or response to drugs. Ultimately we would like to be able to do this but we can start with something simpler. We can take unicellular species (like yeast) and start by understanding cellular phenotypes before we move to more complex species.

To start we need full genome sequences for many different individuals of the same species. For S. cerevisiae we have genome sequences for 38 different isolates by Liti et al. We then need phenotypic differences across these different individuals. For S. cerevisiae there was a great study published June last year by Warringer and colleagues were they tested the growth rate of these isolates under ~200 conditions. Having these data together we can attempt to predict how the observed mutations might result in the differences in growth. As a first attempt we can look at the non-synonymous coding mutations. For these 38 isolates there are something like 350 thousand non-synonymous coding mutations. We can predict the impact of these mutations on a protein either by analyzing sequence alignments or using structures and statistical potentials. There are advantages and disadvantages to both of the approaches but I think they end up being complementary. The sequence analysis required large alignments while the structural methods require a decent structural model of the protein. I think we will need a mix of both to achieve a good coverage of the proteome.

I started with the sequence approach as it was faster. I aligned 2329 S. cerevisiae proteins with more than 15 orthologs in other fungal species and used MAPP from the Sidow lab at Stanford to calculate how constrained each position is. I got about 50K non-synonymous mutations scored with MAPP of which about 1 to 8 thousand could be called potentially deleterious depending on the cut-off. To these we can add mutations that introduce STOP codons, in particular if they occur early in the protein (~710 of these within the first 50 AAs of proteins).

So up to here we have a way to predict if a mutation is likely to impact negatively on a protein's function and/or stability. How do we go from here to a phenotype like a decrease growth rate under the presence of stress X ? This is exactly the question that chemical-genetic studies try to address. Many labs, including our own, have used knock-out collections (of lab strains) to measure chemical-genetic interactions that give you a quantitative relative importance of each protein in a given condition. So, we can make the *huge* simplification that we can take all deleterious mutations and just sum up the effects assuming a linear combination of the effects of the knock-outs.

To test this idea I picked 4 conditions (out of the 200 from mentioned above) for which we have chemical-genetic information (from Parsons et al. ) and where there is a high growth rate variation across the 38 strains. With everything together I can test how well we can predict the the measured growth rates under these conditions (relative to a lab strain):

Each entry in the plot represents 1 strain in a given condition. Higher values report worse predicted/experimental growth (relative to a lab strain). There is a highly significant correlation between measured and predicted growth defects (~0.57) overall but cisplain growth differences are not well predicted by these data. Given the many simplifications and poor coverage of some of the methods used I was even surprised to see the correlation at all. This tells us, that at least for some conditions, we can use mutations found in coding regions and appropriately selected gene sets to predict growth differences.

This is exactly the message of the Rob Jelier's paper from Ben Lehner's lab. When they started their work, the phenotypic dataset from Warringer and colleagues was not yet published so they had to generate their own measurements for this study. In addition their study is much more careful in several different ways. For example they only used the sequences for 19 strains that they say have higher coverage and accuracy. They also tried to estimate the impact of indels and they try to increase the size of the alignments (a crucial step in this process) by searching for distant homologs. If you are interested in making use of "personal" genomes you should really read this paper.

Stepping back a bit I think I was excited about this paper because it finally connects the work that has been done in high-throughput characterization of a model organism with the diversity across individuals of that species. It serves as bridge for many people to come to work in this area. There are a large number of immediate questions like how much do we really need to know to make good/better predictions ? What kind of interactions (transcriptional, genetic, conditional genetic) do we need to know to capture most of the variation ? Can we select gene-set and gene weights in other species without the conditional-genetics information (by homogy) ?

As we are constantly told, the deluge of genome sequences will continue so there are plenty of opportunities and data to analyze (I wish I had more time ;). Some recent examples of interest include the sequencing of 162 D. melanogaster lines with associated phenotypic data and the (somewhat narcissistic) personal 'omics study of Michael Snyder. To start to make the jump to human I think it would be great to have cellular phenotypic data (growth rate/survival under different conditions) for the same cells/tissue across a number of human individuals with a sequenced genome. Maybe in a couple of years I wont be as skeptical as I am now about our fortune cookie genomes.

Wednesday, February 29, 2012

Book Review - The Filter Bubble

Following my previous post I thought it was on topic to mention a book I read recently called “The Filter Bubble”. The book, authored by Eli Pariser, discusses the several applications of personalization filters in the digital world. As several books I have read in the past couple of years, I found it via a TED talk where the author neatly summarizes the most important points. Even if you are not too interested in technology it is worth watching it. I am usually very optimistic about the impact of technology on our lives but Pariser raises some interesting potential negative consequences of personalization filters.

The main premise of the book is that the digital world is increasingly being presented to us in a personalized way, a filter bubble. Examples include Facebook’s newsfeed and Google search among many others. Because we want to avoid the flood of digital information we willingly give commercially valuable personal information that can be used for filtering (and targeted advertisement). Conversely, the fact that so many people are giving out this information has created data mining opportunities in the most diverse markets. The book goes into many examples of how these datasets have been used by different companies such as dating services and the intelligence community. The author also provides an interesting outlook for how these tracking methods might even find us in the offline world a la Minority Report.

If sifting through the flood of information to find the most interesting content is the positive side of personalization what might be the downside? Eli Pariser tries to argue that this filter “bubble”, that we increasingly find ourselves in, isolates us from other points of view. Since we are typically unaware that our view is being filtered we might get a narrow sense of reality. This would tend to re-enforce our perception and personality. It is obvious that there are huge commercial interests in controlling our sense of reality so keeping these filters in check is going to be increasingly important. This narrowing of reality may also stifle our creativity since so often novel ideas are found at the intersection between different ways of thinking. So, directing our attention to what might be of interest can inadvertently isolate us and make us less creative.

As much as I like content that resonates with my interest I get a lot of satisfaction from finding out new ideas and getting exposed to different ways of thinking. This is way I like the TED talks so much. There are few things better than a novel concept well explained - a spark that triggers a re-evaluation of your sense of the world. Even if these are ideas that I strongly disagree with, as it happens often with politics here in the USA, I want to know about them if a significant proportion of people might think this way. So, even if the current filter systems are not effective to the point of isolating us I think it is worth noting these trends and taking precautions.

The author offers an immediate advice to those creating the filter bubble – let us see and tune your filters. One of the biggest issues he tries to bring up is that the filters are invisible. I know that Google personalizes my search but I have very little knowledge of how and why. The simple act of making these filters more visible should make us see the bubble. Also, if you are designing a filtering system, make it tunable. Sometimes I might want to get out of my comfort zone and see the world from a different lens.

Thursday, February 23, 2012

Academic value, jobs and PLoS ONE's mission

Becky Ward from the blog "It Takes 30" just posted a thoughtful comment regarding the Elsevier boycott. I like the fact that she adds some perspective as a former editor contributing to the ongoing discussion. This follows also from a recent blog post from Michael Eisen regarding academic jobs and impact factors. The tittle very much summarizes his position: "The widely held notion that high-impact publications determine who gets academic jobs, grants and tenure is wrong". Eisen is trying to play down the value of the "glamour" high impact factor magazines and fighting for the success of open access journal. It should be a no-brainer really. Scientific studies are mostly payed for by public money, they are evaluated by unpaid peers and published/read online. There is really no reason why scientific publishing should be behind pay-walls.

Obviously it is never as simple as it might appear at first glance. If putting science online was the only role publishers played I could just put all my work up on this blog. While I write up some results as blog posts I can guarantee you that I would soon be out of job if I only did that. So there must be other roles that scientific publishing plays and even if these roles might be outdated or performed poorly they are needed and must be replaced for us to have a real change in scientific publishing.

The value of scientific publishing

In my view there are 3 main roles that scientific journals are currently playing: filtering, publishing and providing credit. The act of publishing itself is very straightforward and these days could easily cost near zero if the publishers have access to the appropriate software. If publishing itself has benefited greatly with the shift online, filtering and credit are becoming increasingly complex in the online world.

Filtering
Moving to the digital world created a great attention crash that we are still trying to solve. What great scientific advances happened last year in my field ? What about in unrelated fields that I cannot evaluate myself ? I often hear that we should be able to read the literature and come up with answers to these questions directly without regard to where the papers where published. However, try to just imagine for a second that there were no journals. If PLoS ONE and its clones get what they are aiming for, this might be on the way. A quick check on Pubmed tells me that 87134 abstracts were made available in the past 30 days. That is something like 2900 abstracts per day ! Which ones of these are relevant for me ? The currently filtering system of tiered journals with increasing rejection rates is flawed but I think it is clear that we cannot do away with it until we have another in place.

Credit attribution
The attribution of credit is also intimately linked to the filtering process. Instead of asking about individual articles or research ideas credit is about giving value to researchers, departments or universities. The current system is flawed because it overvalues the impact/prestige of the journals where the research gets published. Michael Eisen claims that impact factors are not taken into account when researchers are picked for group leader positions but honestly this idea does not ring true to me. From my personal experience of applying for PI positions (more on that later), those that I see getting shortlisted for interviews tend to have papers in high-impact journals. On twitter Eisen replied to this comment by saying "you assume interview are because of papers, whereas i assume they got papers & interviews because work is excellent". So either high impact factor journals are being incorrectly used to evaluate candidates or they are working well to filter excellent work. In either case, if we are to replace the current credit attribution system we need some other system in place.

Article level metrics
So how do we do away with the current focus on impact factors for both filtering and credit attribution? Both of those could be solved if we could focus on evaluating articles instead of the journals. The mission of PLoS ONE was exactly to develop article level metrics that would allow for a post-publication evaluation system. As they claim in their webpage they want "to provide new, meaningful and efficient mechanisms for research assessment". To their credit PLoS has been promoting the idea and making some article level indicator easily accessible but I have yet to see a concrete plan to provide the readers with a filtering/recommendation tool. As much as I love PLoS and try to publish in their journals as much as possible, in this regard PLoS ONE has so far been a failure. If PLoS and other open access publishers want to fight Elsevier and promote open access they have to invest heavily in filtering/recommendation engines. Partner with academic groups and private companies with similar goals (ex. Mendeley ?) if need be. With PLoS ONE they are contributing to the attention crash and making (finally) a profit off of it. It is time to change your tune, stop saying how big PLoS ONE is going to be next year and start staying how you are going to get back on track with your mission of post-publication filtering.

Summary
Without replacing the current filtering and credit attribution roles of traditional journals we wont do away with the need for tiered structure in scientific publishing. We could still have open access tiered systems but the current trend for open access journals appears to be the creation of large journals focused on the idea of post-publication peer review since this is economically viable. However, without filtering systems, PLoS ONE and its many clones can only contribute to the attention crash problem and do not solve the issue of credit attribution. PLoS ONE's mission demands it that they work on filtering/recommendation and I hope that if nothing else they can focus their message, marketing efforts and partnerships on this problem.