Cellular Consequences of Genetic variation: databases

Showing posts with label databases. Show all posts

Friday, March 21, 2008

The structured abstract experiment at FEBS letters

The journal "FEBS letters" is starting a publishing experiment on structured abstracts. As described in the editorial the experiment is aimed at:
"integrating each manuscript with a structured summary precisely reporting, with database identifiers and predefined controlled vocabularies, the protein interactions reported in the manuscript."

The experiment will be a collaboration between FEBS letters and the interaction database MINT, it has started in the beginning of this year and it will last 6 months. It will try to evaluate the necessary tools and the authors's "degree of interest (and competence) to invest" in this annotation process.

It will be very interesting to see the results of this experiment to see if authors are willing to do this extra bit of work and how much this might facilitate the annotation efforts.

Tuesday, March 27, 2007

Google Base API

While we are waiting for freebase to give us a chance to preview their service we can go ahead and try something that probably is very similar in spirit to freebase. Google Base has been up a long time but only recently have they opened it up for automatic access (see Google Base API). There are some restrictions but in the end we can think of it as a free online database that we can use remotely.

How easy is it to use ? If you like Java, C# or PHP you are in luck because they have client libraries to help you get started.

I also found this Google Base code in CPAN and decided to give this a try. After reading some of the information in the API website and having a look at the code it comes down to 3 main tasks: 1)authentication; 2)insert/delete/update; 3)query

Having installed the above mentioned CPAN module the authentication step is easy:


use WWW::Google::API::Base;
my $api_user = "username"; #Google user name
my $api_pass = "pass";     #Google pass
my $api_key =  "API_KEY";
            #any program using the API must get a key


my $gbase = WWW::Google::API::Base->new(
                             { auth_type => 'ProgrammaticLogin',
                               api_key   => $api_key,
                               api_user  => $api_user,
                               api_pass  => $api_pass  },
                                     { } );

That's it, $gbase is authorized to use that google account in Gbase.

Now to insert something useful in the database requires a bit more effort. The CPAN module comes with an example on how to insert recipes. I am not that great a cook so I created a new function in Base.pm that comes with the module. I called it insertSequence


sub insertSequence {
  my $self= shift;
  my $id = shift;
  my $seq_string = shift;
  my $seq_type = shift;
  my $spp = shift;
  my $l=shift;
  $self->client->ua->default_header('content-type', 
                                    'application/atom+xml');
  my $xml = <<EOF;

<?xml version='1.0'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:g='http://base.google.com/ns/1.0'
xmlns:c='http://base.google.com/cns/1.0'>
<author>
<name>API insert</name>
</author>
<category scheme='http://www.google.com/type' term='googlebase.item'/>
<title type='text'>$id</title>
<content type='text'>$seq_string</content>
<g:item_type>sequence</g:item_type>
<g:spp type='text'>$spp</g:spp>
<g:length type='int' >$l</g:length >
<g:sequence_type type='text'>$seq_type</g:sequence_type>
</entry>


EOF
 
  my $insert_request = HTTP::Request->new( 
                        POST => 'http://www.google.com/base/feeds/items',
                                      $self->client->ua->default_headers,
                                      $xml);
  my $response;
  eval {
    $response = $self->client->do($insert_request);
  };
  if ($@) {
    my $error = $@;
    die $error;
  }
  my $atom = $response->content;
  my $entry = XML::Atom::Entry->new(\$atom);
  return $entry
}

The function takes in information on the sequence like the ID, the sequence string, type , species, length and creates and XML entry to submit to Google Base according to the specifications they provide in the website. In this case it will be an entry of type "sequence" (that is non standard for GBase). The only detail in this was that I could not get the sequence string into an item attribute of type text because there seams to be a size limit in these. This is why the sequence is in the description.

Ok, with this new function adding a sequence to the database is easy. After the authentication code as above we just need to do:

$gbase->insertSequence($simple ,$seq_str,
                           "protein","s.cerevisiae",$l);

After getting the information from somewhere to populate the variables. According to the Google API faq, there is a limit of 5 queries per second. In about 25 lines we can get a FASTA to GBase pipe. Here is an example of protein sequence in Gbase (it might get deleted in time).

Now I guess one of the interesting parts is that we can use Google to filter results using the Google Base query language. The CPAN module above already has a query tool. It is still very simple but it gets the results of a search into an ATOM object. Here is a query that returns items from S.cerevisiae that have length between 350 and 400:

my $new_id="http://www.google.com/base/feeds/items
        ?bq=[spp:s.cerevisiae][length(int) : 350..400]";
my $select_inserted_entry;
eval {
$select_inserted_entry =$gbase->select($new_id);
print $select_inserted_entry->as_xml;# The output in XML format
};
if ($@) {
my $e = $@;
die $e->status_line;  # HTTP::Response
}

I am not sure yet if these items are available to other users to query not the code that would do it. I think this example here only gets the items in my account. This was as far as I got with this. The last step would be to have an XML parser turn the returned ATOM object into something more useful.

Tuesday, March 20, 2007

BIND bought by Thomson Scientific

(via email and Open Access News) Thomson Scientific has acquired Unleashed Informatics the owner of, among other resources, the Biomolecular Interaction Network Database (BIND). In the email announcing the transaction it is claimed that the information will remain freely available:
"We will continue to provide you with open access to BOND — there will be
no change to the way you obtain and use the information you currently
have available — and we will work to ensure that the database remains knowledgeable, up-to-date and within the current high editorial standards."

Hopefully this will mean a better chance of survival for the database. BIND was created in 1998 under the Blueprint initiative, lead by Chris Hogue in Mount Sinai Hospital in Toronto. It is one of the best examples of the difficulties of financing scientific databases with current grant schemes. In 2005, having burned trough something like $25 million and with a staff of 68 curators, the database started having trouble securing funding to keep up with the costs. Finally in November 2005, the database stopped its curation activities and Unleashed Informatics (a spin-off also created by Chris Hogue) bought the rights to BIND and kept the data available.

In December 2006 Chris Hogue wrote in his blog:
"I am no longer employed by Mount Sinai Hospital and the Blueprint Initiative is now, obsolete. (...) For now I am self-employed as a writer and working part-time at Unleashed Informatics."

According to the public announcement posted in the site of Unleashed Informatics the management team of UI will now be part of Thomson Scientific, so it is possible that Chris Hogue is now heading Unleashed under the safer umbrella of Thomson.

Previous posts regarding BIND and database funding:
BIND database runs out of funding
BIND in the news
Stable scientific databases