Cellular Consequences of Genetic variation: Scientific Data

Last week a new NPG journal called Scientific Data started accepting submission. Although I discussed this new journal with colleagues a few times I realized that I never argued here why I think this a very strange idea for a journal. So what is Scientific Data ? In short it is a journal that publishes metadata for a dataset with data quality metrics. From the homepage:

Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets. It introduces a new type of content called the Data Descriptor designed to make your data more discoverable, interpretable and reusable.

So what does that mean ? Is this a journal for large scale data analysis ? For the description of methods ? Not exactly. Reading the guide to authors we can see that an article "should not contain tests of new scientific hypotheses, extensive analyses aimed at providing new scientific insights, or descriptions of fundamentally new scientific methods". So instead one assumes that this journal is some sort of database where articles are descriptors of the data content and data quality. The added value of the journal would be to store the data and provide fancy ways to allow for re-analysis. That is also not the case since the data is meant to be "stored in one or more public, community recognized repositories". Importantly, these publications are not meant to replace and do not preclude future research articles that make use of these data. Here is an example of what these articles would look like. This example more likely represents what the journal hopes to receive as submissions so let's see how this shapes up in a year when people try to test the limits of this novel publication type.

In summary, articles published by this journal are mere descriptions of data with data quality metrics. This is the same information that any publication already should have except that Scientific Data articles are devoid of any insight or interpretation of the data. One argument in favor of this journal would be that this is a step into micro-publication and micro-attribution in science. Once the dataset is published anyone, not just the producers of the data, can make use of this information. A more cynical view would be that NPG wants to squeeze as much money as they can from scientists (and funding agencies) by promoting salami slicing publishing.

Why should we pay $1000 for a service that does not even handle data storage ? That money is much better spent supporting data infrastructures (disclaimer: I work at EMBL-EBI). There is no added value from this journal that is not or cannot be provided by data repository infrastructures. Yet, this journal is probably going to be a reasonable success since authors can essentially publish their research twice for an added $1000. In fact, anyone doing a large-scale data driven project can these days publish something like 4 different papers: the metadata, the main research article, the database article and the stand-alone analysis tool that does 2% better than others. I am not opposed to a more granular approach to scientific publication but we should make sure we don't waste money in this process. Right now I don't see any incentives to limit this waste nor any real progress in updating the way we filter and consume this more granular scientific content.

Cellular Consequences of Genetic variation

Saturday, October 19, 2013

Scientific Data - ultimate salami slicing publishing