Cellular Consequences of Genetic variation

Wednesday, May 02, 2007

Bio::Blogs #10

The 10th edition of Bio::Blogs, the bioinformatics blog journal has been posted at Nodalpoint. The PDF can be downloaded from Box.net.

Sunday, April 29, 2007

Bio::Blogs #10 - reminder

The May 1st edition of Bio::Blogs will be hosted at Nodalpoint. Anyone can participate by sending in links for interesting blog posts from April (to bioblogs at gmail). If you send in links to your own blog posts please also say if you agree or not to have the post copied to the PDF version for offline reading.

Friday, April 27, 2007

Bio-science hype cycle

I found out about the Gartner's Hype Cycle today a story at Postgenomic (via Science Notes and HealthNex)

The Gartner's Hype Cycle is meant to highlight the relative maturity of different IT technologies. The idea originated from the common pattern of evolution of human perception towards nascent technologies. From their initial trigger passing trough exaggerated expectations, disillusionment and finally to a maturity and stability.

Just for the fun of it I tried to plot the same graph with some bio-science related technologies and/or ideas:

This is a very limited and biased view but for me it was interesting trying to think were to place the different technologies.

Thursday, April 26, 2007

The publisher's reaction

Sarah Cooney, Director of Publications for the Society of Chemical Industry as issued an official reply to the gathering criticism:

"We apologise for any misunderstanding. In this situation the publisher would typically grant permission on request in order to ensure that figures and extracts are properly credited. We do not think there is any need to pursue this matter further."

The email was posted in Shelley Batts blog and also in this blog post at Nature Network. Overall it is good news, this was an honest mistake and not a policy from the journal nor the publisher. There is a hint in the official reply to some potentially abusive emails sent to the editor. Maybe, as Euan Adie suggested this is also a lesson for science bloggers to take care of rising mob mentality when handling these issues. I would guess that many editors might not even be aware of their own copyright/fair use policies and this issue might at least raise the discussion.

How can a publisher be so dumb - Update

I posted a while ago about copyright policies of different science publishers regarding images. I concluded by saying that in any case we should be safe to blog images since no publisher would likely sue a blogger for using an image or two to promote one of their papers. Well ... apparently I was wrong. Shelley Batts from Retrospectable got an email from an editor of the Journal of the Science of Food and Agriculture (published by Wiley Interscience):

"The above article contains copyrighted material in the form of a table and graphs taken from a recently published paper in the Journal of the Science of Food and Agriculture. If these figures are not removed immediately, lawyers from John Wiley & Sons will contact you with further action."

This is not a legal action but the threat is there. I cannot see what they were thinking. Are they really willing to sue a blogger for what is very likely fair use of their content? The content used is a small fraction of the whole, the blog post is educational and most likely has increased the traffic to that paper. If anything this email just bought them a lot of indignation and it will be a PR nightmare for the journal and the publisher.

Science bloggers are doing a great service of covering science news, faster and more in depth than most traditional news services. Every time I have a look these days at the first page of Postgenomic I see there what is going to be the main science stories of the next day in the normal news outlets. Not only that but I will likely find someone that actually works on the subject and can give a very good explanation of what the work is about. Publishers should be fostering this by crafting policies directed at this use of their material not the other way around.

If you want to contact the editor that made the decision the email is on Shelley's post.

There is a large number of posts reacting to this in Postgenomic.

Update - Boing Boing is giving coverage to this too. If you also think that this was a bad decision from the journal editor/publisher consider writing about it or sending them an email. Even if it is within their legal right to do so we can at least tell them that we don't find this appropriate or fair.

Tuesday, April 24, 2007

Cellular adaptation to unforeseeable events

How do cells react to changes in external conditions ? It has been noted before than in many cases the immediate transcriptional response includes unspecific changes in gene expression for a large group of genes (Gash et al, 2000). Fong and colleagues have shown that in E. coli, 20 to 40 days after the initial changes, most of the genes return to expression levels prior to the modifications of the environment. The differentially expressed genes at this stage are situation specific but not necessarily always the same. In this same paper, the gene expression changes were followed for different independent populations evolving under the same changes in conditions. Out of ~1100 gene expression changes (on average) that were possibly adaptive to the new conditions, only 70 were common to all 7 parallel populations.

A new studied published in MSB, adds more information to these interesting findings. In this study the authors tried to challenge S. cerevisiae with a perturbation that these cells should not have seen during their evolutionary history. They used a his3 deletion strain with a plasmid having HIS3 under the GAL1 promoter. In these cells the essential HIS3 gene should be efficiently turned off in a glucose medium. They then tracked the gene expression changes over time when the medium was changed from galactose to glucose. The cells adapted to these conditions within around 10-20 generations. Again the initial gene expression changes involved a large number of genes (~1000-1600 genes> 2 fold change) with most of them (65%-70% ) returning to their original expression levels in 10-20 generations. Again, different populations had different genes differentially expressed in response to the transition from gal to glu.

There is a detailed analysis in the paper regarding the functional classes of the genes but for me these general trends were by themselves very interesting. How does the cell cope with unforeseeable events ?

Maybe there is a general mechanism that senses discrepancies between metabolic requirements and the current cellular state and, in the absence of a programed response, drives an almost chaotic search for plausible solutions ? If there is such a sensing mechanism it could provide the necessary feedback for the selection of cellular states at a physiological time scale. In a environment were frequent unpredictable changes occur such a system could possibly be selected for.

For further reading have a look at the news and views by Eugene Koonin

Tuesday, April 17, 2007

The Seven Stones blog and more quick links

The Nature family of blogs as a new member - The Seven Stones - from the Molecular Systems Biology journal. I gave some help to set it up during my 3 months stay with the journal. Go over there and say hello to the editors :).

(via Deepak) The TED.com site was relaunched. It is has one of the most amazing collection of video talks available. The current main focus is The Rise of Collaboration.

(via Konrad and Richard Akerman) There was an interesting conference organized by Alen press - Emerging Trends in Scholarly Publishing. Both Konrad and Richard Akerman describe in their blogs what the conference was about and what they talked about.

Wednesday, April 11, 2007

Sharing notes of science papers not just on PLoS ONE

Niel just posted some examples of webtools to take notes of websites. I"ll pick up on the theme to show you another example - Diigo. Diigo is a collaborative note sharing webapp and it allows us to annotate (highlight and add notes) to a website and share this with anyone also using Diigo. I will go trough an example. I registered to the service and dragged a bookmarklet to the browser.

Then I went to have a look at this paper, that showed up in the Systems Biology Connotea feed, and clicked the bookmarklet:

The paper discusses the usefulness of machine learning methods to study dynamics of cellular pathways so I just added a link to the wikipedia entry on fuzzy logic.

Now anyone using Diigo can browse the article and see the notes I added (I think you can select to make them public or private). I do most of the reading on paper so for me this would be useful only to dissect papers with someone online. In a sense this is what PLoS ONE does but extended to any website. For example at the same time we blog about a paper, we could add the link to the blog post on the paper itself or add some of the comments directly on the paper if it makes more sense. You can also use Diigo to gather a clip to add to a blog post and create groups to share bookmarks and annotations.

The biggest drawback is that I don't know if there are any Diigo comments or annotations for this site without clicking the bookmarklet (I did not try the toolbar they have for instalation) . Even after clicking the bookmarklet there is no way of knowing how many (if any) notes exist on the site without scrolling to look for them.

As for most social applications, this becomes more useful if more people start using it. On the other hand if this (or a similar service) grows too much it will be to attractive to spammers. There are more examples of this type of tools on todays post on Techcrunch were I found out about Diigo. With so many ways to add annotations to a webpage it would be nice to have some kind of abstract way of communicating annotations between web applications. Something like trackbacks but able to convey more information, not just this content points to that content.

Thursday, April 05, 2007

The state of the Science Blogosphere - In reply to Bora

Bora wrote a post on science blogging were he argues that blog carnivals seam to be slowly morphing into blog journals including some aspects of the editorial and review processes that go on in science journals. The response was growing a bit long so I post it on the blog instead.

When we started Bio::Blogs, some 10 months ago, I though one day the submissions could grow to a rate were it would be reasonable to create a limit. Once a limit is established selection quicks in and the carnival would slowly morph into a journal. Even without the selection, there is already some sort of review process since people tend to send links to things that were either already popular within the blog or something they found interesting in other blogs.

Unfortunately the size of the bioinformatics blogosphere is not growing significantly. Several new blogs have appeared but other bloggers have stopped posting. I am not sure if this is true for most science related blogs or just the particular case of bioinformatics. Postgenomic keeps track of the number of active blogs and blog posts, and at least there, it looks like were are holding steady at around 400 science blogs active per week since November last year (see picture below).

The blogs tracked by Postgenomic have to be submitted or picked up by Stew so there are surely many more science blogs that are not being tracked in this service. In fact Technorati lists around 20 000 blogs tagged as science. Given that it is in the best interest of the bloggers to tag their blogs very broadly to attract a wider audience, Technorati surely includes many blogs that are not really science related. This number is also inflated by duplicated blogs of people that moved from one blogging platform to another and blogs that have been created but are not active. So the real number of active science blogs is somewhere between 400 and 20000.
If you have a science blog (or know of a science blog) that is not tracked by Postgenomic, submit it by email (instructions on the site).

Also, even if there are some great quality posts, very few people are posting new data (that I know of). There is almost no open science going on. There are great examples but so far with limited impact. As Bora states:

"Scientists, as a whole, are very reluctant to write novel ideas, hypotheses or data on blogs, and are very slow to test the waters of Open, Source Publishing. Most of what one finds on science blogs is commentary on other peoples' ideas, hypotheses and data found in journals and mass media."

Nevertheless, judging by a recent story in The Scientist and an article in Nature Jobs, science blogs are now taken more seriously. Blogging is finally getting perceived just has a means of communication (for the best and the worst) and no longer something that the MySpace kids do.

What will take other people to join in blogging and publishing their science openly? I think examples will drive it. The success of OpenWetWare is a tremendously good example. Soon I hope to see some papers getting published by open science projects like Usefulchem. If community projects running in similar models as open source development are truly a more efficient way to produce knowledge then examples of successful projects will be the best way to get other people to participate.

Going back to the blog carnivals. I would suggest two concrete changes: 1) stop calling them blog carnivals, call them blog journals instead; 2) have a PDF version for offline reading.
Both things bring the carnival closer to the model of publishing we are used to seeing. There has been a PDF version of Bio::Blogs (inspired by the 1st Science Anthology) for the last two editions and they are being downloaded significantly. The PDF for the April edition has been downloaded around 40 times (in 5 days). Not bad for a such a specific field. There is a bit of a hassle to get the permissions every month but it is worth while.

I don't think in any way that blog journals should replace traditional science publishing. They serve a place in a less formal layer of science communication. They also bring some order and quality control to the very chaotic and fast flow nature of blogs.

Further reading by other bloggers:
Idea for Discussion: An Academic Blog Review
Science Blogging, Blog Carnivals and Secondary Publishing

Wednesday, April 04, 2007

OpenWetWare is hiring

(Via the SynBio discuss list) The open science community OpenWetWare is looking for people to take full time responsibilities in the development of OWW. They are looking for:

- Senior technology developer
"Define and carryout lead software development and technology integration in support of OpenWetWare. Cultivate and respond to community input. Leverage internal volunteer development resources and establish productive relationship with external open source projects. Oversee outsourced server operations. Help lead development of our long term technology development strategy."

- Senior knowledge developer
"Lead improvements to the OWW community and knowledge structure. Develop and implement knowledge management resources that improve the sharing of information via OWW. Lead conversations with OWW users and Technology Developers to ensure continuing relevance of ongoing knowledge management improvements. Help lead development of our long term community and knowledge development strategy."

The growth of OWW shows that open science is not a totally naive concept and that people are willing to collaborate with others openly online.

More details on the job posting

Sunday, April 01, 2007

Bio::Blogs #9 - small update

Welcome to the ninth edition of the bioinformatics blog journal Bio::Blogs posted online on the 1st of April of 2007. The archive from previous months can be found at bioblogs.wordpress.com.

Today is an exciting day for bioinformatics and open science in general. I am happy to report on an ongoing project in Nature that has been under wraps for quite a long time. It is called Nature Sherlock and it promises to turn the dream of rich semantic web for scientist a reality. This service is still in closed beta but you can have a look at (http://sherlock.nature.com/) to see that the service does exist and you might from the name get a sense for what it might do. I have been allowed to use Sherlock for some time and according to the FAQ of the main website it has been co-developed by Google and Nature and it is one of the results of meetings that went on during the 1st Science Foo Camp (also co-organized by Google and Nature). Access to the main site requires a beta tester password but I can say that Sherlock looks like a very promising tool. Sherlock is the code-name for the main bot that is set to crawl text and databases from willing providers (current partners include Nature, EBI, NCBI and Pubmed Central) to produce semantic web objects that abide to well established standards in biology. Some of the results, specially regarding the text mining, are of lower accuracy (details can be found on the help pages) but overall it looks like an amazing tool. I hope that they get this out soon.

In this month's Bio::Blogs I have included many posts that were not submitted but I thought were interesting and worth mentioning. This might be a more biased selection but in this way I can make up for the current low number of submission. As in the last edition, the blog posts mentioned were converted into PDF for anyone interested in downloading and reading Bio::Blogs offline (anyway you might enjoy this). There are many interesting comments in online blog posts that I did not include in the PDF, so if you read this offline and find something interesting go online for the discussion.

News and Views
This month saw the announcement of the main findings coming from the Global Ocean Sampling Expedition. Several articles published in PLoS Biology detail the main conclusions of Craig Venter's efforts to sequence the microbial diversity. Both Konrad and Roland Krause blogged some comments on this metagenomics initiative.

Articles
I will start up this section highlighting Stew's post on software availability. Testing around 111 resources taken from the Application Notes published in the March issues of Bioinformatics shows that between 11% to 17% (depending on the year) of these resources are no longer available. Even considering that bioinformatic research runs at very fast pace and that some of these resources might be outdated by now there is no reason why these resources should not be available (as was required for publication).
RPG from Evolgen submitted a post entitled “I Got Your Distribution Right Here” were he analyzes the variation of genome sizes among birds. He concludes by noting that the variability of genome sizes in aves , is smaller than in squamata (lizards and snakes), and testudines (turtles, tortoises, and terrapins). An interesting question might then be why do birds have a smaller distribution of genome sizes. Is there a selection pressure ?
Barry Mahfood submitted a blog post where he ask the question: “Is Death Really Necessary?”. Looking at the human life-expectancy in different periods in time and thinking about what might determine self, Barry thinks that eternal life is achievable in the very near future.

Semantic web/Mach-up/web-services series
This month there were several blog posts regarding mash-ups, web-services and semantic web. All of these relate to the ease of accessing data online and combining data and services together to produce useful and interesting out-comes.
Freebase has a large potential to bring some of the semantic web concept closer to reality. Deepak sent in a link to his description of Freebase and the potential usefulness of the site for scientists. I had the fortune of receiving an invitation to test the service but I did not have time yet to fully explore it.
I hope you saw trough my April fools introduction to Nature Sherlock. Even if Nature Sherlock does not really exist (it is a service to look for similar articles), it is clear that the Nature publishing group is the most active science publisher on the web. Tony Hammond in Nascent gave in a recent blog post a brief description of some of the tools Nature is working on.
While we are waiting for web services and data to become easier to work with we can speed up the process by using web scraping technologies like openKapow (described by me) or dapper (explained by Andrew Perry). These tools can help you create an interface to services that do not provide APIs.

Tips and Tricks
I will end up this months edition with a collection of tips for bioinformatics. Bosco wrote an interesting post - “Notes to a young computational biologist”- were he collects a series of useful tips for anyone working in bioinformatics. There is a long thread of of comments with other people's ideas making it a useful resource. On a similar note Keith Robison, wrote about error messages and the typical traps that might take a long time to debug if we are not familiar with them. (Update) In reply to a recent editorial in PLoS Computational Biology, Chris sent in some tips for collaboration work.
From Neil Saunder's blog comes a tutorial on setting up a reference manager system using LaTeX. I work mostly on a windows machine and I am happy with Word plus Endnote but I will keep this in mind if I try to change to a Linux set up.
Finally I end up this month's edition with a submission from Suresh Kumar on “Designing primer through computational approach”. It is a nice summary of things to keep in mind for primer design along with useful links to tools and websites that might come in handy.

Update - Just to be sure, the Nature Sherlock is as real as the new Google TiSP wifi service.

Saturday, March 31, 2007

Bio::Blogs #9 call for submission

The 9th edition of Bio::Blogs will be posted here tomorrow. I will go around the usual blogs and look for interesting blog posts to make a round-up of what happened during the month. I will try to make again an offline version including the blog posts authorized by the authors. Fell free to submit links to bioinformatic related blog posts you find interesting from your blog and any other blogs during today and tomorrow. Submissions can be sent by email to bioblogs at gmail or in a comment to this post.

Wednesday, March 28, 2007

Open Access in a different way

Just for fun I though I could try this new cartoon website. This was I got out in a couple of minutes :). Take a wild guess of what publishers I was thinking about.

View in the toon webiste

Usage-based measurements of journal quality

(Via Nautilus) The UK Serials Group (UKSG) and the online usage metrics organization COUNTER are exploring the possibility of using online statistics as a metric to determine the impact of a journal. There is a user survey for anyone interested in giving their opinion on the subject. The survey aims to:
* Discover what you think about the measures that are currently used to assess the value of scholarly journals (notably impact factors)
* Gauge the potential for usage-based measures
* Provide an opportunity for you to suggest possible different, additional measures.

As a blogger I am used to the idea of tracking readership statistics. I was curious to have a look at how this statistics are tracked by the journals so I had a better look at this COUNTER initiative. According to the about page:
"Launched in March 2002, COUNTER (Counting Online Usage of Networked Electronic Resources) is an international initiative designed to serve librarians, publishers and intermediaries by facilitating the recording and exchange of online usage statistics."

If I understood the project, COUNTER aims to define standards for tracking of web statistics, to serve as a hub for gathering of this information from the publishers and to provide statistical analysis to interested parties (libraries). The publishers are responsible for gathering the usage data and producing the reports (according to COUNTER standards). In the website there is a list of publishers that are providing this information to the project.

It is possible to certify a publisher as COUNTER compliant by passing in a somewhat convoluted process where a library checks the publishers report for compliance with the standards. I can't help think that there should be more efficient ways of doing this. For bloggers it takes a second to include a few lines of code from one of many free online tracking services (i.e. sitemeter, Google Analytics, Feedburner) in the website, to get instant and free user statistics. In this case, the services have the responsibility to track the users and I have little or no control over the reported data. If online user statistics is to become a measure of journal impact (and I think it should), then I hope COUNTER licenses or creates technology similar to what is powering these services. It should be an independent tracking service providing the results, not the publishers.

Tuesday, March 27, 2007

Google Base API

While we are waiting for freebase to give us a chance to preview their service we can go ahead and try something that probably is very similar in spirit to freebase. Google Base has been up a long time but only recently have they opened it up for automatic access (see Google Base API). There are some restrictions but in the end we can think of it as a free online database that we can use remotely.

How easy is it to use ? If you like Java, C# or PHP you are in luck because they have client libraries to help you get started.

I also found this Google Base code in CPAN and decided to give this a try. After reading some of the information in the API website and having a look at the code it comes down to 3 main tasks: 1)authentication; 2)insert/delete/update; 3)query

Having installed the above mentioned CPAN module the authentication step is easy:


use WWW::Google::API::Base;
my $api_user = "username"; #Google user name
my $api_pass = "pass";     #Google pass
my $api_key =  "API_KEY";
            #any program using the API must get a key


my $gbase = WWW::Google::API::Base->new(
                             { auth_type => 'ProgrammaticLogin',
                               api_key   => $api_key,
                               api_user  => $api_user,
                               api_pass  => $api_pass  },
                                     { } );

That's it, $gbase is authorized to use that google account in Gbase.

Now to insert something useful in the database requires a bit more effort. The CPAN module comes with an example on how to insert recipes. I am not that great a cook so I created a new function in Base.pm that comes with the module. I called it insertSequence


sub insertSequence {
  my $self= shift;
  my $id = shift;
  my $seq_string = shift;
  my $seq_type = shift;
  my $spp = shift;
  my $l=shift;
  $self->client->ua->default_header('content-type', 
                                    'application/atom+xml');
  my $xml = <<EOF;

<?xml version='1.0'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:g='http://base.google.com/ns/1.0'
xmlns:c='http://base.google.com/cns/1.0'>
<author>
<name>API insert</name>
</author>
<category scheme='http://www.google.com/type' term='googlebase.item'/>
<title type='text'>$id</title>
<content type='text'>$seq_string</content>
<g:item_type>sequence</g:item_type>
<g:spp type='text'>$spp</g:spp>
<g:length type='int' >$l</g:length >
<g:sequence_type type='text'>$seq_type</g:sequence_type>
</entry>


EOF
 
  my $insert_request = HTTP::Request->new( 
                        POST => 'http://www.google.com/base/feeds/items',
                                      $self->client->ua->default_headers,
                                      $xml);
  my $response;
  eval {
    $response = $self->client->do($insert_request);
  };
  if ($@) {
    my $error = $@;
    die $error;
  }
  my $atom = $response->content;
  my $entry = XML::Atom::Entry->new(\$atom);
  return $entry
}

The function takes in information on the sequence like the ID, the sequence string, type , species, length and creates and XML entry to submit to Google Base according to the specifications they provide in the website. In this case it will be an entry of type "sequence" (that is non standard for GBase). The only detail in this was that I could not get the sequence string into an item attribute of type text because there seams to be a size limit in these. This is why the sequence is in the description.

Ok, with this new function adding a sequence to the database is easy. After the authentication code as above we just need to do:

$gbase->insertSequence($simple ,$seq_str,
                           "protein","s.cerevisiae",$l);

After getting the information from somewhere to populate the variables. According to the Google API faq, there is a limit of 5 queries per second. In about 25 lines we can get a FASTA to GBase pipe. Here is an example of protein sequence in Gbase (it might get deleted in time).

Now I guess one of the interesting parts is that we can use Google to filter results using the Google Base query language. The CPAN module above already has a query tool. It is still very simple but it gets the results of a search into an ATOM object. Here is a query that returns items from S.cerevisiae that have length between 350 and 400:

my $new_id="http://www.google.com/base/feeds/items
        ?bq=[spp:s.cerevisiae][length(int) : 350..400]";
my $select_inserted_entry;
eval {
$select_inserted_entry =$gbase->select($new_id);
print $select_inserted_entry->as_xml;# The output in XML format
};
if ($@) {
my $e = $@;
die $e->status_line;  # HTTP::Response
}

I am not sure yet if these items are available to other users to query not the code that would do it. I think this example here only gets the items in my account. This was as far as I got with this. The last step would be to have an XML parser turn the returned ATOM object into something more useful.

Tuesday, March 20, 2007

BIND bought by Thomson Scientific

(via email and Open Access News) Thomson Scientific has acquired Unleashed Informatics the owner of, among other resources, the Biomolecular Interaction Network Database (BIND). In the email announcing the transaction it is claimed that the information will remain freely available:
"We will continue to provide you with open access to BOND — there will be
no change to the way you obtain and use the information you currently
have available — and we will work to ensure that the database remains knowledgeable, up-to-date and within the current high editorial standards."

Hopefully this will mean a better chance of survival for the database. BIND was created in 1998 under the Blueprint initiative, lead by Chris Hogue in Mount Sinai Hospital in Toronto. It is one of the best examples of the difficulties of financing scientific databases with current grant schemes. In 2005, having burned trough something like $25 million and with a staff of 68 curators, the database started having trouble securing funding to keep up with the costs. Finally in November 2005, the database stopped its curation activities and Unleashed Informatics (a spin-off also created by Chris Hogue) bought the rights to BIND and kept the data available.

In December 2006 Chris Hogue wrote in his blog:
"I am no longer employed by Mount Sinai Hospital and the Blueprint Initiative is now, obsolete. (...) For now I am self-employed as a writer and working part-time at Unleashed Informatics."

According to the public announcement posted in the site of Unleashed Informatics the management team of UI will now be part of Thomson Scientific, so it is possible that Chris Hogue is now heading Unleashed under the safer umbrella of Thomson.

Previous posts regarding BIND and database funding:
BIND database runs out of funding
BIND in the news
Stable scientific databases

Monday, March 19, 2007

Twitter - Reality streams

In the past couple of weeks there has been a lot of buzz around Twitter:
"A global community of friends and strangers answering one simple question: What are you doing?"

In Twitter the messages are limited to 140 characters and can be sent to everyone or to a restricted group of friends via phone, IM or the web. It is amazing to look at the landing page of Twitter and seeing all these messages flowing by of what random people are doing right now. Here is a random sample:
"Waaaahhhh... I want to go back to sleep, not go to work. Maybe the shower will help. less than 10 seconds"
"Just discovered Twitterholic (twitterholic.com), have to twitter more if I want to get on the list :P !! less than 20 seconds ago"
"was thinking of saying Hello World but has changed his mind less than 20 seconds ago"

I cannot find a good reason to even set up an account at Twitter. The only possible interesting use would be to keep in touch with friends and family but I have IM for that. I can use this blog to publish what I am thinking without the 140 character limitation. For once I agree with Nick Carr's view of this community application, it sounds a little bit narcissistic. As usual he puts his points across in a provocative manner:

"The great paradox of "social networking" is that it uses narcissism as the glue for "community." Being online means being alone, and being in an online community means being alone together. (...) As I walk down the street with thin white cords hanging from my ears, as I look at the display of khakis in the window of the Gap, as I sit in a Starbucks sipping a chai served up by a barista, I can't quite bring myself to believe that I'm real. But if I send out to a theoretical audience of my peers 140 characters of text saying that I'm walking down the street, looking in a shop window, drinking tea, suddenly I become real."

It seams like every time there is a technology that enables a more immediate communication between people, we jump to it (ex letters,emails,sms,IM).

Friday, March 16, 2007

Bioinformatic web scraping/mash-ups made easy with kapow

In bioinformatics it is common that we might need to use a web service multiple times. Ideally, whoever built the web service provided a way to automatically query the site via an API. Unfortunately, Lincoln Stein's dream of a bioinformatics nation is still not a reality. When there is no programmable interface available and the underlying database information is not available it's usually necessary to write some code to scape the content from the web service.

In come openKapow, a free tool to (easily) build and publish robots to turn any website into a real web service. To illustrate how easy it is to use it I have built a Kapow robot to get, for any human geneID, a list of orthologs (with species and IDs). I downloaded the robotmaker and tried it on the Ensembl database. To be fair Ensembl is probably one of the best bioinformatics resources with available API and easy data mining tools like Biomart. This was just to give an example.

You start the robot by defining the initial webpage and the service inputs and outputs. I decided to create a REST service that would take an Ensembl gene ID and output pairs of gene ID/species name. The robotmaker application is intuitive to use for anyone with a moderate experience with HTML. The robot is created by setting up the steps that should occur to transform the input into the desired output. For example, we have to define were the input should be entered by clicking on the search box:

From here there are a set of loops and conditional statements that you can include to get the list of orthologs:

We can run through the robot steps with a test input and debug it graphically. Once the robot is running it is possible to host it on the openKapow web page, apparently also free of charge. Here is the link for this simple robot (this link might go down in the future). Of course it is also possible to build new robots that use robots that are published on openKapow. Also this example uses a single webpage but it would be more interesting to use this to mash up different services together.

Systems and Synthetic biology RSS pipe

Here is the RSS feed for a Yahoo pipe combining and filtering papers mostly about synthetic and systems biology. There are three systems biology journals directly combined into the feed. Unfortunately I could not find the RSS feeds for IET Systems Biology so it is not included. On top of these are added selected papers from Nature tittles, PLoS titles, Cell, PNAS, Science and Genome Biology. The filtering is done using some typical key words that might be associated to Systems and Synthetic biology. Here is a simple illustration of how it works:

I still have to test the pipe for some time and tweak the filters, but it is enough to get an idea of the things that can be done with these pipes. Like the pipe before you can clone this and change the filters and journals as you like.

Community filtered journal RSS feeds

I was trying out Yahoo pipes today to see how much we can actually program with it. It has some loop functions and regex filters but otherwise it is currently a bit limited. One thing that it is very good for is to combine and filter RSS feeds. Imagine that you want to get all the papers of a journal (or a group of journals) but only if someone else has for some reason found them interesting. This was what I tried to do with this pipe. I piped the RSS feed for MSB through a Yahoo query restricted to Connotea or Citeulike and in return I get a feed for the papers that have been tagged by other people in these sites. The problem is that this relies on the yahoo search, so it has to wait for yahoo to crawl those sites before it identifies the a new tagged paper and it is also possible that the paper title is to ambiguous and therefore incorrectly matched.

To add/change the input journals copy the pipe from here and edit the top fetch box.
(disclamer: I am currently working for MSB)