The Update Protocol for Maintenance of Annotations

This document describes the update protocol used to maintain annotations within the SEED Project.

There are now two SEEDs used as components of the maintenance process:

  1. The Annotators SEED (A-SEED) remains the SEED used to support manual curations, which are largely cast as creation and maintenance of subsystems.
  2. The PATRIC-SEED (P-SEED)  contains all of the complete prokaryotic genomes from RefSeq, as well as all of the complete genomes from the A-SEED. 

The need for two distinct SEEDs arises because the A-SEED represents a much more productive framework for manually constructing subsystems.  There the annotators are not faced with hundreds of almost identical genomes or with genomes of very low quality.  The A-SEED is intended to contain representative genomes of high quality.  On the other hand, construction of "pan-genomes" or development of assertions of the form "gene X and gene Y play identical functions" are best done in a more comprehensive collection of genomes.

The need to utilize two versions of the SEED necessarily leads to a somewhat more complex update protocol.  The steps in the current protocol are now discussed.  For purposes of this discussion, we will reference FIGfam collections used for a variety of purposes.  The version of the FIGfams corresponding to the current A-SEED will be called A-FIGfams.  A-FIGfams contain proteins from only genomes present in the A-SEED.  The current version of the FIGfams used in the P-SEED will be called P-FIGfams, and these will contain proteins from genomes in the P-SEED.

Each SEED maintains a set of representative complete genomes.  These representatives are thought of as members of Operational Taxonomic Units (OTUs).  An OTU contains genomes that are "very similar" (technically, we group genomes that have SSU rRNAs that are at least 97% identical).   The set of similar genomes constitutes the OTU, and the representative genome name is used as the "name" of the OTU.  It should be noted that OTUs must be maintained for each copy of the SEED.  The protocol implemented in the steps below assumes that the OTUs are current.  If currency is not ensured by the protocol for adding genomes, then it is important to update OTUs in both SEEDs before proceeding. 

Similarly, this protocol assumes that the flow of genomes into the P-SEED has been successful.   It is critical that the flow of genomes into the PSEED include all complete prokaryotic genomes maintained in GenBank.

To understand the following protocol, you must be familiar with our use of Kmers to annotate genomes.  For the purposes of this discussion, the Kmers are 8-character sequences in the amino acid alphabet.  We maintain a collection of signature Kmers, which are Kmers that have been detected only in protein sequences that have a common function (i.e.,  the only known proteins containing the Kmer all have been assigned the same function).   Thus, when we detect an instant of one of the signature Kmers in a new protein sequence, it is evidence of function.  If we detect a large number (usually, more than 3), it becomes very likely that we can say that the protein probably shares the same function as the protein sequences from which the Kmer was derived.   Currently, we derive signature Kmers for each of the FIGfam protein families.

These are the steps required to update the annotations of the P-SEED genomes:

  1. Create A-FIGfams-1, the restriction of P-FIGfams to A-SEED Genomes

    The first step is to take the current P-FIGfams, which may have numerous assertions relating to the extended set of genomes maintained in P-SEED, and restricting those protein sets to genomes from A-SEED.  Protein sets that condense to singleton sets are removed in the process.
  2. Create A-FIGfams-2 by Updating the A-FIGfams-1 in the A-SEED

    This step (in effect) uses the current subsystems in the A-SEED, as well as any non-subsystem clustering done in the P-SEED, to create a reconciled set of FIGfams for the A-SEED.  A kmer-update is generated from the contents of A-FIGfams-2.
  3. Recall Protein Functions and Subsystems in P-SEED

    This, in effect, imports the manual subsystem maintenance into the annotations of P-SEED.
  4. Update P-FIGfams to P-FIGfams-1 in the Updated P-SEED

    Since this is an update of P-FIGfams using the updated subsystem   annotations, it retains the clustering work done in P-SEED, while keeping the annotations imported from A-SEED.  Create an updated set of kmers from P-FIGfams-1.
  5. Install the new Kmers into the Kmer Server

    This leaves one with the largest set of Kmers (suitable for recognition of pan-genomes, and offering a larger set of signature kmers than is possible using just genomes from the A-SEED).
  6. Recall the Functions of  the Proteins in P-SEED using the New Kmers

    The change should be minor, but will induce consistency.  The subsystem assertions can then be finally updated.

This completes the steps needed to update the A-FIGfams (to A-FIGfams-2), the annotations in the P-SEED, P-FIGfams (to P-FIGfams-1) and the Kmers.

It remains essential that we maintain the list of induced changes and do manual inspections to enforce quality control on the process.