Annotating a Genome

A Quick, Inexpensive Approach to Annotating a Prokaryotic Genome

As it becomes possible to quickly and cheaply acquire the genomes of

organisms, the need to produce accurate annotations quickly has become

more pressing. This short tutorial is designed to enable a user to

produce relatively accurate annotations quite quickly (under a week

for most prokaryotic genomes). The steps we will describe are as

follows:

1. First, you submit the contigs representing the sequence of

the organism to the RAST server (or any similar server),

which produces an initial annotation.

2. Then, we advocate "walking your genome" rapidly to gain a

sense of how closely it matches existing (previously

sequenced and annotated) genomes, to delete clearly

miscalled genes, and to gain an understanding of the number

of potential problems (e.g., frameshifts) that exist.

We suggest correcting any clearly improvable functions that

may have been assigned incorrectly in step 1 as you walk

through the genome.

3. You automatically place the genes into subsystems, giving

an overview of the cellular machinery that has been

successfully identified.

These three steps are just the start of extracting information from a

new genome, but they do offer a technology that will give you a

reasonably annotated genome that can be used effectively by the

research community.

Running Your Genome Through RAST

--------------------------------

The first step involves acquiring an initial annotation. We suggest

that you use RAST or our MacApp for doing so, but there are other

services and approaches to getting an initial annotation.

Go here to see a tutorial on how to get a RAST account and submit

a genome for annotation.

"Walking" Your Genome

---------------------

However you decide to manually annotate your genome, we suggest using

an environment that supports efficiently "walking through the genome"

comparing regions against those in previously sequenced and annotated

genomes. This can be done quite rapidly if you use a suitable

framework. Here we are talking about visually inspecting all of the

genes in about 1 to 3 work days. This can be somewhat tedious, but

what emerges is a reasonably annotated genome for which you have a

pretty good overview of what is there.

Building a Metabolic Reconstruction

-----------------------------------

It is useful to group the recognized genes into the recognized

pathways, complexes, and nonmetabolic molecular machines. Here is how

we view this process:

1. Our annotation team has constructed sets of functional roles

that are annotated simultaneously because the functional roles

are related. The roles may be distinct subunits of a complex

(e.g., the subunits of the ATP synthase or the ribosomal

proteins), a set of functional roles that constitute a pathway

(e.g., Histidine Degradation) or the genes may make up a

nonmetabolic molecular machine (e.g., a repair machine, a

transport cassette, or a 2-component regulatory system). We

call each of these sets of roles a "subsystem". Our annotators

have carefully assembled the functional roles that make up a

subsystem and for each one constructed a spreadsheet in which

each row is a genome and each column is a distinct functional

role. The cells of the spreadsheet contain the genes from the

specific genome that implement the specific functional role.

For example (SEE POWERPOINT PICTURES OF HISTIDINE DEGRADATION).

2. We automatically, using the examples contained in the manually

curated set of subsystems, try to locate the appropriate genes

within the newly-sequenced genome and identify a new instance

(i.e., a new row in the spreadsheet) of the subsystem. When we

can identify all of the genes needed to implement an operational

version of the subsystem, it substantially increases the

confidence we have in the assigned functions, and it forms a

critical piece of information needed to support the generation

of metabolic models.

3. Where we recognize a portion of a subsystem, we may have failed

to accurately identify some genes, we may have misannotated

genes, or we may have a new variant of the subsystem (e.g., a

new variant of a common pathway),

4. We consider a metabolic reconstruction to simply be the set of

recognized, operational instances of our subsystem collection.

This is distinct from an actual initial estimate of the

metabolic network (which we provide, as well). The metabolic

reconstruction includes information about the nonmetabolic

machinery supported by the genome. We are not completely happy

with the term "metabolic reconstruction", but that is the term

that has stuck and the one in common usage within our group.

Summary

-------

The 3-step process we outline for acquiring reasonably good

annotations and an initial annotation for a prokaryotic genome works

well for genomes that are "close" to well-annotated existing genomes.

For truly divergent genomes, it is a good starting point, but much

more effort is required to achieve what one might think of as an

"acceptable annotation". The virtue of our approach is that, in most

cases, you can acquire a usable annotation in 1-3 days. We have

invited groups that have spent man-years annotating specific genomes,

and for the most part our annotations were very close to the carefully

done manual efforts.

Annotating a Genome

Search

API Listings

SEED Utilities

About this Archive

Categories

Pages

Presentations