Annotating a Genome
A Quick, Inexpensive Approach to Annotating a Prokaryotic Genome
As it becomes possible to quickly and cheaply acquire the genomes of
organisms, the need to produce accurate annotations quickly has become
more pressing. This short tutorial is designed to enable a user to
produce relatively accurate annotations quite quickly (under a week
for most prokaryotic genomes). The steps we will describe are as
1. First, you submit the contigs representing the sequence of
the organism to the RAST server (or any similar server),
which produces an initial annotation.
2. Then, we advocate "walking your genome" rapidly to gain a
sense of how closely it matches existing (previously
sequenced and annotated) genomes, to delete clearly
miscalled genes, and to gain an understanding of the number
of potential problems (e.g., frameshifts) that exist.
We suggest correcting any clearly improvable functions that
may have been assigned incorrectly in step 1 as you walk
through the genome.
3. You automatically place the genes into subsystems, giving
an overview of the cellular machinery that has been
These three steps are just the start of extracting information from a
new genome, but they do offer a technology that will give you a
reasonably annotated genome that can be used effectively by the
Running Your Genome Through RAST
The first step involves acquiring an initial annotation. We suggest
that you use RAST or our MacApp for doing so, but there are other
services and approaches to getting an initial annotation.
Go here to see a tutorial on how to get a RAST account and submit
a genome for annotation.
"Walking" Your Genome
However you decide to manually annotate your genome, we suggest using
an environment that supports efficiently "walking through the genome"
comparing regions against those in previously sequenced and annotated
genomes. This can be done quite rapidly if you use a suitable
framework. Here we are talking about visually inspecting all of the
genes in about 1 to 3 work days. This can be somewhat tedious, but
what emerges is a reasonably annotated genome for which you have a
pretty good overview of what is there.
Building a Metabolic Reconstruction
It is useful to group the recognized genes into the recognized
pathways, complexes, and nonmetabolic molecular machines. Here is how
we view this process:
1. Our annotation team has constructed sets of functional roles
that are annotated simultaneously because the functional roles
are related. The roles may be distinct subunits of a complex
(e.g., the subunits of the ATP synthase or the ribosomal
proteins), a set of functional roles that constitute a pathway
(e.g., Histidine Degradation) or the genes may make up a
nonmetabolic molecular machine (e.g., a repair machine, a
transport cassette, or a 2-component regulatory system). We
call each of these sets of roles a "subsystem". Our annotators
have carefully assembled the functional roles that make up a
subsystem and for each one constructed a spreadsheet in which
each row is a genome and each column is a distinct functional
role. The cells of the spreadsheet contain the genes from the
specific genome that implement the specific functional role.
For example (SEE POWERPOINT PICTURES OF HISTIDINE DEGRADATION).
2. We automatically, using the examples contained in the manually
curated set of subsystems, try to locate the appropriate genes
within the newly-sequenced genome and identify a new instance
(i.e., a new row in the spreadsheet) of the subsystem. When we
can identify all of the genes needed to implement an operational
version of the subsystem, it substantially increases the
confidence we have in the assigned functions, and it forms a
critical piece of information needed to support the generation
of metabolic models.
3. Where we recognize a portion of a subsystem, we may have failed
to accurately identify some genes, we may have misannotated
genes, or we may have a new variant of the subsystem (e.g., a
new variant of a common pathway),
4. We consider a metabolic reconstruction to simply be the set of
recognized, operational instances of our subsystem collection.
This is distinct from an actual initial estimate of the
metabolic network (which we provide, as well). The metabolic
reconstruction includes information about the nonmetabolic
machinery supported by the genome. We are not completely happy
with the term "metabolic reconstruction", but that is the term
that has stuck and the one in common usage within our group.
The 3-step process we outline for acquiring reasonably good
annotations and an initial annotation for a prokaryotic genome works
well for genomes that are "close" to well-annotated existing genomes.
For truly divergent genomes, it is a good starting point, but much
more effort is required to achieve what one might think of as an
"acceptable annotation". The virtue of our approach is that, in most
cases, you can acquire a usable annotation in 1-3 days. We have
invited groups that have spent man-years annotating specific genomes,
and for the most part our annotations were very close to the carefully
done manual efforts.