Annotating a Genome

A Quick, Inexpensive Approach to Annotating a Prokaryotic Genome

As it becomes possible to quickly and cheaply acquire the genomes of
organisms, the need to produce accurate annotations quickly has become
more pressing.  This short tutorial is designed to enable a user to
produce relatively accurate annotations quite quickly (under a week
for most prokaryotic genomes).  The steps we will describe are as
follows:

1. First, you submit the contigs representing the sequence of
  the organism to the RAST server (or any similar server),
  which produces an initial annotation.

        2. Then, we advocate "walking your genome" rapidly to gain a
  sense of how closely it matches existing (previously
  sequenced and annotated) genomes, to delete clearly
  miscalled genes, and to gain an understanding of the number
  of potential problems (e.g., frameshifts) that  exist.
  We suggest correcting any clearly improvable functions that
  may have been assigned incorrectly in step 1 as you walk
  through the genome.

3. You automatically place the genes into subsystems, giving
  an overview of the cellular machinery that has been
  successfully identified.

These three steps are just the start of extracting information from a
new genome, but they do offer a technology that will give you a
reasonably annotated genome that can be used effectively by the
research community.

Running Your Genome Through RAST
--------------------------------

The first step involves acquiring an initial annotation.  We suggest
that you use RAST or our MacApp for doing so, but there are other
services and approaches to getting an initial annotation.

Go here to see a tutorial on how to get a RAST account and submit
a genome for annotation.

"Walking" Your Genome
---------------------

However you decide to manually annotate your genome, we suggest using
an environment that supports efficiently "walking through the genome"
comparing regions against those in previously sequenced and annotated
genomes.  This can be done quite rapidly if you use a suitable
framework.  Here we are talking about visually inspecting all of the
genes in about 1 to 3 work days.  This can be somewhat tedious, but
what emerges is a reasonably annotated genome for which you have a
pretty good overview of what is there.

Building a Metabolic Reconstruction
-----------------------------------

It is useful to group the recognized genes into the recognized
pathways, complexes, and nonmetabolic molecular machines.  Here is how
we view this process:

   1. Our annotation team has constructed sets of functional roles
      that are annotated simultaneously because the functional roles
      are related.  The roles may be distinct subunits of a complex
      (e.g., the subunits of the ATP synthase or the ribosomal
      proteins), a set of functional roles that constitute a pathway
      (e.g., Histidine Degradation) or the genes may make up a
      nonmetabolic molecular machine (e.g., a repair machine, a
      transport cassette, or a 2-component regulatory system).  We
      call each of these sets of roles a "subsystem".  Our annotators
      have carefully assembled the functional roles that make up a
      subsystem and for each one constructed a spreadsheet in which
      each row is a genome and each column is a distinct functional
      role.  The cells of the spreadsheet contain the genes from the
      specific genome that implement the specific functional role.  
      For example (SEE POWERPOINT PICTURES OF HISTIDINE DEGRADATION).

   2. We automatically, using the examples contained in the manually
      curated set of subsystems, try to locate the appropriate genes
      within the newly-sequenced genome and identify a new instance
      (i.e., a new row in the spreadsheet) of the subsystem.  When we
      can identify all of the genes needed to implement an operational
      version of the subsystem, it substantially increases the
      confidence we have in the assigned functions, and it forms a
      critical piece of information needed to support the generation
      of metabolic models.

   3. Where we recognize a portion of a subsystem, we may have failed
      to accurately identify some genes, we may have misannotated
      genes, or we may have a new variant of the subsystem (e.g., a
      new variant of a common pathway),

   4. We consider a metabolic reconstruction to simply be the set of
      recognized, operational instances of our subsystem collection.
      This is distinct from an actual initial estimate of the
      metabolic network (which we provide, as well).  The metabolic
      reconstruction includes information about the nonmetabolic
      machinery supported by the genome.  We are not completely happy
      with the term "metabolic reconstruction", but that is the term
      that has stuck and the one in common usage within our group.

Summary
-------

The 3-step process we outline for acquiring reasonably good
annotations and an initial annotation for a prokaryotic genome works
well for genomes that are "close" to well-annotated existing genomes.
For truly divergent genomes, it is a good starting point, but much
more effort is required to achieve what one might think of as an
"acceptable annotation".  The virtue of our approach is that, in most
cases, you can acquire a usable annotation in 1-3 days.  We have
invited groups that have spent man-years annotating specific genomes,
and for the most part our annotations were very close to the carefully
done manual efforts.