What SEED Do I Use?

What SEED Do I Use?

March 17, 2011

Ross Overbeek

A number of distinct SEEDs have emerged over the years.  Almost all users will find they wish to use just the following two:

 

1.     The PubSEED will rapidly become the central SEED for use by the research community.  It will support access to the largest collection of genomes.  The constant influx of new genomes will be to the PubSEED first.  The PubSEED will support the ability for registered users to make annotations and subsystems (unfortunately, that implies that they will be able to overwrite the work of others, too).  We will support the ability for registered users to install genomes from RAST directly into the PubSEED.  Access and update capabilities, by genome, will require establishing a notion of ownership of genomes.  Users will be allowed to copy a genome creating a version with ownership rights that they control.  We will architect a few basic rules, and then we will do our best to develop tools that support reconciliation of conflicts, backup and recovery to specific points in time, and so forth.  This SEED will be the center of much of our work.

 

2.     The UC-SEED (i.e., the University of Chicago SEED) will be rebuilt periodically  as a copy of the PubSEED.  It is a place where classes can be held, students can do annotations and build subsystems, and so forth.  Subsystems built in the UC-SEED can be exported to the Clearinghouse, and then imported to the PubSEED.  This procedure allows users to save work and make it available, if they wish.  However, the whole system is rebuilt and the existing contents destroyed on a periodic basis (usually between semesters, and after several weeks in which there will be a posted notice on the front page).

 

 

Ownership of Genomes

I am now discussing a basic position relating to genomes that is not yet fully implemented.  I believe that it will move to the position I describe within 4-6 months.

 

There will soon be hundreds of thousands of genomes.  Many of these genomes will be either identical or almost identical.  In some cases the genomic sequence data will be identical, but distinct user groups will insist on the ability to annotate isolated copies that are protected from unauthorized updates.  Determination of a protocol that effectively supports both sharing and isolated annotation will require support for effectively managing privileges and interactions in a way that minimally constrains experts attempting to contribute.

 

It will rapidly become critical that we be able to talk about genomes, contigs, genes, and proteins and to easily detect whether two references are to the "same" entity.  As we move into a world with hundreds of thousands of genomes, some with identical sequence, and others with sequences that differ by only a few characters, it will become critical that we support basic ID Correspondence Services in a consistent manner.

 

We suggest employing the following set of definitions for what it means to be "the same" for genomes, contigs, genes, and proteins.

 

1.     Two sequences are the same if the MD5 functions of the uppercase versions of the sequences are identical.

 

2.     Two contigs are considered the same if their DNA sequences are the same. 

 

3.     Two genomes are considered the same iff

a.     They have the same number of contigs.

b.     The MD5 function of the sorted and concatenated contig MD5s match.  We call the MD5 function of the sorted and concatenated contigs the MD5 of the genome.

 

4.     Two genes are considered identical if they are in genomes that are the same, and

a.     They occur in contigs that are the same,

b.     They have identical start and stop positions in the two contigs. 

 

5.     Two proteins are the same if their sequences are the same (note that this is not a notion that is equivalent to saying that they are the gene products of two genes that are the same).

 

We will support the ability to rapidly determine which genomes, genes, and proteins are identical.  Further, we will support the capability of users defining sets of representative genomes and limiting displays to any selected set.

 

We are architecting the SEED environment as a framework that will be able to effectively integrate initially thousands, and within a short period millions, of distinct genomes.   Genomes will enter the collection from a growing number of sources.   Registerying a Genome will amount to claiming unique IDs for the genome and the features that occur within the genome.  This will inevitably lead to multiple registrations for identical genomes.  Further, while we will not support alteration of the sequence of a genome (i.e., such a change would lead to the creation and registration of a new genome), we will support addition and deletion of features on a genome.  A deletion will lead to recording a change in status (retaining a complete record of the deleted feature indefinitely).  The addition of a feature would require the acquisition on a new ID.  Changing the start location of a gene would cause deletion of the existing feature and addition of a new feature, which would inherit the appropriate attributes from the deleted feature.

 

The SEED environment will support the maintenance of genomes and features via a set of services that will include:

 

1.     acquire_a_genome_ID            returns a genome ID to a registered user

2.     acquire_a_feature_ID(Genome,Contig,Start,Stop)  returns a feature ID

3.     delete_a_feature(ID)               requires a update privileges

4.     reactivate a deleted genome(ID)    requires update privileges    

 

Registered users will be able to make any of these operations against genomes for which they have the required privileges.  Users owning genomes will have the ability to restrict access to a specified set of users.  That is, we will support private genomes that are not seen by everyone, and we will support the ability of owners to change the status of a genome (from private to public and vice versa).

 

Perhaps a short summary of the decision procedures on access/update rights would be as follows:

 

1.     We have registered users.  Users are either superusers or normal users.

2.     We have genomes.  Genomes are either private or public.

3.     Anyone attempting to access a genome or a feature of a genome will be given access if and only if

a.     the genome is public, or

b.     the user is a superuser, or

c.     the user either owns the genome or has been granted access to the genome.

4.     Anyone attempting to update a genome (which includes annotating features, deleting features, and adding features) will be allowed to make the update iff and only if

a.     the genome is public, or

b.     the user is a superuser, or

c.     the user either owns the genome or has been granted write privileges to the genome.

 

Setting Up an Annotation Group

If a group wishes to use the SEED Environment as a resource for supporting annotation and analysis of their genome, they would begin by registering each member of the group, and then establishing a group containing those members.

They would select the genomes they wish to annotate (probably by importing a newly-annotated genome from RAST into the PubSEED).  They would decide whether access and update privileges should be restricted to the group or not.

 

Then, they would use the framework we currently use to support our annotators to examine and edit annotations, construct metabolic models, or whatever.  The set of genomes that would be simultaneously be edited could all be public or all be private.  If private, they would be imported from RAST or as copies of existing genomes.

 

Access of Data Via the Servers

Most users of the SEED will use a web browser.  However, a growing body of users will also start using our SEED Servers, which support a well-defined API to access and update data from a SEED.  We will run servers for the PubSEED.  See

 

http://servers.nmpdr.org/servers

 

 for a discussion of the servers, the APIs used to access them, and the command-line services supported via the servers.  We believe that research groups may wish to use or help extend this growing confederation of servers.

 

Maintenance of Subsystems

The PubSEED, and UC-SEED both support development of subsystems.  From any of these platforms, subsystems can be exported to the Clearinghouse, and they can be imported into any other SEED (if you have the appropriate privileges).  We anticipate that students in classes would use the UC-SEED to avoid destroying the work of others.  Users wishing to make a more permanent contribution would use the PubSEED.

 

We will try to install a few basic rules to prevent bloodshed in instances in which incompatible annotations must be reconciled.  They would be something like

 

1.     You may overwrite any annotation that is not in a subsystem or is a duplicate in a subsystem (i.e., a case in which two genes currently have the same assigned functional role).

 

2.     Before overwriting a function in which a gene plays a unique role in someone else's subsystem, email them and ask for permission.  If they do not respond with a few days, proceed.

 

As the number of genomes grows rapidly, we believe that fewer and fewer annotators will actually construct and maintain comprehensive subsystems.  Rather, there will be a growing number of subsystems that contain only a subset of the actual genomes that have the machinery.  To handle this situation, we will periodically produce estimates, for each genome, of the subsystems that should contain the genome.  These will not impact any of the subsystems, but will allow users to have a reasonable estimate of the molecular machinery that can be identified.  This mimics what is now done in RAST, where a new genome contains estimates of which genes go into which subsystems, but these estimates do not actually impact the subsystems themselves.

 

The real point is that subsystems will no longer be thought of as comprehensive.  Up to this point the goal was to provide the tools needed to support manual curation of subsystems that were to contain as many genomes as possible from the existing collection.   The goal will shift.  We will think of subsystems as containing a diverse collection of instances needed to support accurate projection over the entire collection. The PubSEED will be used to house as complete a collection as possible, to support experimentation, and will inevitably lead to conflicts (that, hopefully, enrich the overall collection and get resolved peaceably).