Thursday, November 17, 2005

Google Base and Bioinformatics II

The Google Base service is officially open in beta (as usual). Is is mostly disappointing because you can do nothing with it really (read previous post). You can load tons of data, very rapidly although they take a lot of time to process the bulk uploads. Maybe this will speed up in the future. The problem is once you have your structured data in Google Base you cannot do anything with it apart from searching and looking at it with the browser. I uploaded a couple of protein sequences just for fun. I called the item "biological sequence" and I gave it very simple attributes like sequence, id, and type. The upload failed because I did not have a title so I added title and just copied the id field. Not very exciting right.

I guess you can scrape the data off it automatically but that is not very nice. This for example gets the object ids for the biological sequences I uploaded:

use LWP::UserAgent;
use HTTP::Request;
my $url = "http://base.google.com/base/search?q=biological+sequence";
my $ua = new LWP::UserAgent();
my $req = HTTP::Request->new('GET',$url);
my $res = $ua->request($req);
open(DATA, ">google.base.temp") || die "outputfile didn't open $!";
print DATA $res->content;
close DATA;
open (IN,"<google.base.temp")|| die "Error in input $!";
grep(/oid=([0-9]+)\">(\S+)</ && ($data{$1}=$2) ,<IN>);
close IN;
foreach $id (keys %data) {print $id,"\n";}

With the object ids then you can do the same to get the sequences.

Anyway, everybody is half expecting that one day google will release an API to do this properly. So coming back to scientific research, is this useful for anything ? Even with a proper API this is just a database. It will make it easy for people to rapidly set up a database and maybe google can make a simple template webpage service do display the content of the structured database. It would be a nice add-on to blogger for example. You could get a tile to put in your blog with an easy way to display the content of your structured database.

For virtual online collaborative research (aka science 2.0 :)?) this is potentially useful because you get a free tool to set up a database for a given project. Apart from this I don't see potential applications but like the name says it is just the base for something.