Annotating a genome using the SEED servers (via the command line or Perl code)

Suppose that you have a newly sequenced bacterial or archaeal genome.
You have a number of options if you want to get a rapid, fairly
accurate annotation:

1. You can run the genome through RAST, which will give you
   an attempt at a comprehensive annotation that is
   browsable.

2. You can use our Mac-App or PC-App to get a very rapid
   estimate.

3. You can run a few command line scripts that call the
   RNA-encoding genes and protein-encoding genes, asserts
   functions for the gene products, and produces an initial
   metabolic reconstruction.

4. You can programmatically call the RNAs, PEGs, identify the
   functions, and generate an initial metabolic
   reconstruction.

This short tutorial will describe the last two options.



Calling RNA-Encoding Genes Using a Command Line Script

Suppose that you have the contigs of a newly-sequence genome in a file
called DNA.fasta.  You need to know the genus, species, and whether
the genome is bacterial or archaeal.  To be concrete, suppose that the
file contigs contains the genome of a Buchnera aphidicola (which is a
bacteria).  Then,

   svr_find_rnas Buchnera aphidicola bacteria < contigs > RNA.fasta 2> RNA.tbl

A fasta file of the DNA encoding the RNA genes is written to
RNA.fasta, and a 5-column tab-separated table describing the location
and function of each of the identified genes is written to RNA.tbl.
This tool was originally coded by Niels Larsen and we thank him for
making it available.

To identify the protein-encoding genes (PEGs) and get the translations
of the identified genes, one would use

   svr_call_genes < contigs > translations.fasta 2> PEG.tbl

The is a 4-column tab-separated table containing 

    1. an arbitary ID of the form "prot_nnnnn"

    2. the contig containing the gene

    3. the position of the first character of the gene

    4. the position of the last character of the gene

If the beginning position is less than the ending position, the gene
is on the '+' strand; otherwise, it is on the '-' strand.  The
coordinates include both the start and stop codons, unless the gene
runs off the end of a contig.  The genes are called by Glimmer3 [see
http://www.cbcb.umd.edu/software/glimmer/], and again we are thankful
for being able to use such a fine tool (yes, eventually there will be
better gene caller's, but Glimmer was certainly a major contribution
to the community).

Note that you do not have functions for the protein-encoding genes.
To get those, you might use

   svr_assign_using_figfams < translations.fasta > estimates.of.function 2> failed.to assign.function

This will write a 3-column tab-separated table.  The columns are as follows:

     1. A "score" that relates to reliability.  The higher, the more reliable.  
        Values usually range from 1 to several hundred.  We suggest that you run 
some benchmark genomes through (genomes for which you have the annotations),
and you can acquire some feel for the correspondence of reliability and score.

     2. An ID from the input file.

     3. The function we would assign to the protein.

This will fairly rapidly assign functions to PEGs that clearly match one of our
protein families, and it will tell you which translations could not be accurately
assigned a function using FIGfams.  If you want the server to try to assign a function
to all translations (assinging "hypothetical protein" in cases it could not make any
other assignment) add the -all option:

   svr_assign_using_figfams -all < translations.fasta > estimates.of.function

       
Finally, to place the PEGs into subsystems, you can use

   svr_metabolic_reconstruction < estimates.of.function > metabolic.reconstruction 2> pegs.not.used.in.reconstruction

This will write two files.  the file metabolic reconstruction will contain input lines
that can be placed in subsystems.  Each line will have two appended columns: the subsystem
name and the variant code.  The second file, pegs.not.used.in.reconstruction, will contain
the lines from the input file corresponding to PEGs that could not be placed into any
subsystem.


Accessing Annotation Services from a Perl Program

The actual Perl code to perform the steps illustrated above is fairly straightforward.
We will cover each of the steps using short snippets of Perl code.

First, we recommend starting with something like

use SeedEnv;
use ANNOserver;
my $annO = ANNOserver->new;
.
.
.

Then, the following code can be used to scan the contents of the file $contigsF.
It writes out the fasta sequences of the detected RNAs to the file $RNAseqs, and writes 
a description of where the genes are located in the file $RNAtbl.

open(CONTIGS,"<$contigsF") || die "could not open $contigsF";
my $results          = $annO->find_rnas( -input   => \*CONTIGS,
 -genus   => $genus,
 -species => $species,
 -domain  => $domain
                                       );

open(RNASEQS,">$RNAseqs") || die "could not open $RNAseqs";
open(RNATBL,">$RNAtbl")   || die "could not open $RNAtbl";
my($fasta,$genes) = @$results;
print RNASEQS $fasta;
close(RNASEQS);
foreach $_ (sort { $a->[0] cmp $b->[0] } @$genes)
{
    print RNATBL join("\t",@$_),"\n";
}
close(RNATBL);       
close(CONTIGS);

To locate the protein-encoding genes, write their translations to
the file $PEGtranslations, their locations to $PEGtbl, and the functions
that could be assigned to $functions.

open(TRANS,">$PEGtranslations") || die "could not open $PEGtranslations";
open(PEGTBL,">$PEGtbl")         || die "could not open $PEGtbl";
open(FUNCTIONS,">$functions")   || die "could not open $functions";
open(CONTIGS,"<$contigsF")      || die "could not open $contigsF";

$results = $annO->call_genes( -input => \*CONTIGS );
($fasta,$genes) = @$results;
print TRANS $fasta;
close(TRANS);

foreach $_ (sort { $a->[0] cmp $b->[0] } @$genes)
{
    print PEGTBL join("\t",@$_),"\n";
}
close(PEGTBL);       

open(TRANS,"<$PEGtranslations") || die "could not open $PEGtranslations";
my $handle = $annO->assign_function_to_prot( -input => \*TRANS, 
     -assignToAll => 1,
     -kmer => 8,
     -scoreThreshhold => 3,
     -hitThreshhold => 3,
     -normalizeScores => 0,
     -detailed => 0
                                           );
my @roles = ();
while (my $tuple = $handle->get_next())
{
    my($peg,$function,undef,$score) = @$tuple;
    foreach my $role (&SeedUtils::roles_of_function($function))
    {
push(@roles,[$role,$peg]);
    }
    print FUNCTIONS join("\t",($score,$peg,$function)),"\n";
}
close(TRANS);
close(FUNCTIONS);


Finally, the code to produce the metabolic reconstruction estimate from
the list of [$role,$peg] pairs would be as follows:

open(MR,">$metabolic_recon") || die "could not open $metabolic_recon";
my $recon = $annO->metabolic_reconstruction( -roles => \@roles );
foreach $_ (@$recon)
{
    print MR join("\t",@$_),"\n";
}
close(MR);

These three snippets of code are nontrivial, but when you realize that
they actually support 

     1. the calling of genes, 
     2. assignment of functions to the genes, and 
     3. the derivation of a metabolic reconstruction, 

we hope that you agree that they are actually quite minimal.