Wednesday, October 12, 2005

In support of text mining

There is a commentary in Nature Biotech where the authors used text mining to look at how knowledge about molecular interactions grows over time. To do this, they used time-stamped statements about molecular interactions taken from full-text articles from 60 journals from 1999-2002. They describe how knowledge mostly expands from known "old" interactions instead of "jumping" to areas of the interaction space that is totally unconnected from previous knowledge. Since this work is based on statements about interactions I guess that the authors did not take into account the data coming from the high-throughput methods that is not described in the papers but is deposited in databases. In fact, in a recent effort to map the human protein-protein interaction network there was very little overlap between the know interactions and the new set of proposed interactions. What we might conclude from this is that although high-throughput methods are more error-prone than small-scale experiments they help us to jump to unexplored knowledge space.
The other two main conclusions of the commentary are that some facts are restricted to "knowledge pockets" and that only a small part of the network is growing at a given time. In general they try to make a case for the use of text mining but they do not go into the details of how this should be implemented. They do not talk about the possible roles of databases, tagging, journals, funding agencies, etc in this process of knowledge growth. Databases should help to solve the problem of knowledge pockets the authors mention. Tagging can eliminate the need for mining the data and journals/funding agencies have the power to force the authors to deposit the data in databases or tag their research along with the paper.
Without wanting to attract the wrath of people working on text mining, my opinion is that at least an equal amount of effort should be dedicated in making the knowledge that is discovered in the future easier to recollect.