Annotating a genome using the SEED servers (via the command line or Perl code)

Suppose that you have a newly sequenced bacterial or archaeal genome.

You have a number of options if you want to get a rapid, fairly

accurate annotation:

1. You can run the genome through RAST, which will give you

an attempt at a comprehensive annotation that is

browsable.

2. You can use our Mac-App or PC-App to get a very rapid

estimate.

3. You can run a few command line scripts that call the

RNA-encoding genes and protein-encoding genes, asserts

functions for the gene products, and produces an initial

metabolic reconstruction.

4. You can programmatically call the RNAs, PEGs, identify the

functions, and generate an initial metabolic

reconstruction.

This short tutorial will describe the last two options.

Calling RNA-Encoding Genes Using a Command Line Script

Suppose that you have the contigs of a newly-sequence genome in a file

called DNA.fasta. You need to know the genus, species, and whether

the genome is bacterial or archaeal. To be concrete, suppose that the

file contigs contains the genome of a Buchnera aphidicola (which is a

bacteria). Then,

svr_find_rnas Buchnera aphidicola bacteria < contigs > RNA.fasta 2> RNA.tbl

A fasta file of the DNA encoding the RNA genes is written to

RNA.fasta, and a 5-column tab-separated table describing the location

and function of each of the identified genes is written to RNA.tbl.

This tool was originally coded by Niels Larsen and we thank him for

making it available.

To identify the protein-encoding genes (PEGs) and get the translations

of the identified genes, one would use

svr_call_genes < contigs > translations.fasta 2> PEG.tbl

The is a 4-column tab-separated table containing

1. an arbitary ID of the form "prot_nnnnn"

2. the contig containing the gene

3. the position of the first character of the gene

4. the position of the last character of the gene

If the beginning position is less than the ending position, the gene

is on the '+' strand; otherwise, it is on the '-' strand. The

coordinates include both the start and stop codons, unless the gene

runs off the end of a contig. The genes are called by Glimmer3 [see

http://www.cbcb.umd.edu/software/glimmer/], and again we are thankful

for being able to use such a fine tool (yes, eventually there will be

better gene caller's, but Glimmer was certainly a major contribution

to the community).

Note that you do not have functions for the protein-encoding genes.

To get those, you might use

svr_assign_using_figfams < translations.fasta > estimates.of.function 2> failed.to assign.function

This will write a 3-column tab-separated table. The columns are as follows:

1. A "score" that relates to reliability. The higher, the more reliable.

Values usually range from 1 to several hundred. We suggest that you run

some benchmark genomes through (genomes for which you have the annotations),

and you can acquire some feel for the correspondence of reliability and score.

2. An ID from the input file.

3. The function we would assign to the protein.

This will fairly rapidly assign functions to PEGs that clearly match one of our

protein families, and it will tell you which translations could not be accurately

assigned a function using FIGfams. If you want the server to try to assign a function

to all translations (assinging "hypothetical protein" in cases it could not make any

other assignment) add the -all option:

svr_assign_using_figfams -all < translations.fasta > estimates.of.function

Finally, to place the PEGs into subsystems, you can use

svr_metabolic_reconstruction < estimates.of.function > metabolic.reconstruction 2> pegs.not.used.in.reconstruction

This will write two files. the file metabolic reconstruction will contain input lines

that can be placed in subsystems. Each line will have two appended columns: the subsystem

name and the variant code. The second file, pegs.not.used.in.reconstruction, will contain

the lines from the input file corresponding to PEGs that could not be placed into any

subsystem.

Accessing Annotation Services from a Perl Program

The actual Perl code to perform the steps illustrated above is fairly straightforward.

We will cover each of the steps using short snippets of Perl code.

First, we recommend starting with something like

use SeedEnv;

use ANNOserver;

my $annO = ANNOserver->new;

Then, the following code can be used to scan the contents of the file $contigsF.

It writes out the fasta sequences of the detected RNAs to the file $RNAseqs, and writes

a description of where the genes are located in the file $RNAtbl.

open(CONTIGS,"<$contigsF") || die "could not open $contigsF";

my $results = $annO->find_rnas( -input => \*CONTIGS,

-genus => $genus,

-species => $species,

-domain => $domain

);

open(RNASEQS,">$RNAseqs") || die "could not open $RNAseqs";

open(RNATBL,">$RNAtbl") || die "could not open $RNAtbl";

my($fasta,$genes) = @$results;

print RNASEQS $fasta;

close(RNASEQS);

foreach $_ (sort { $a->[0] cmp $b->[0] } @$genes)

{

print RNATBL join("\t",@$_),"\n";

}

close(RNATBL);

close(CONTIGS);

To locate the protein-encoding genes, write their translations to

the file $PEGtranslations, their locations to $PEGtbl, and the functions

that could be assigned to $functions.

open(TRANS,">$PEGtranslations") || die "could not open $PEGtranslations";

open(PEGTBL,">$PEGtbl") || die "could not open $PEGtbl";

open(FUNCTIONS,">$functions") || die "could not open $functions";

open(CONTIGS,"<$contigsF") || die "could not open $contigsF";

$results = $annO->call_genes( -input => \*CONTIGS );

($fasta,$genes) = @$results;

print TRANS $fasta;

close(TRANS);

foreach $_ (sort { $a->[0] cmp $b->[0] } @$genes)

{

print PEGTBL join("\t",@$_),"\n";

}

close(PEGTBL);

open(TRANS,"<$PEGtranslations") || die "could not open $PEGtranslations";

my $handle = $annO->assign_function_to_prot( -input => \*TRANS,

-assignToAll => 1,

-kmer => 8,

-scoreThreshhold => 3,

-hitThreshhold => 3,

-normalizeScores => 0,

-detailed => 0

);

my @roles = ();

while (my $tuple = $handle->get_next())

{

my($peg,$function,undef,$score) = @$tuple;

foreach my $role (&SeedUtils::roles_of_function($function))

{

push(@roles,[$role,$peg]);

}

print FUNCTIONS join("\t",($score,$peg,$function)),"\n";

}

close(TRANS);

close(FUNCTIONS);

Finally, the code to produce the metabolic reconstruction estimate from

the list of [$role,$peg] pairs would be as follows:

open(MR,">$metabolic_recon") || die "could not open $metabolic_recon";

my $recon = $annO->metabolic_reconstruction( -roles => \@roles );

foreach $_ (@$recon)

{

print MR join("\t",@$_),"\n";

}

close(MR);

These three snippets of code are nontrivial, but when you realize that

they actually support

1. the calling of genes,

2. assignment of functions to the genes, and

3. the derivation of a metabolic reconstruction,

we hope that you agree that they are actually quite minimal.

Annotating a genome using the SEED servers (via the command line or Perl code)

Categories:

Search

API Listings

SEED Utilities

About this Entry

Categories

Pages

Presentations