Support for Exploring Pangenomes

As we find it possible to sequence and annotate many closely related
genomes, a number of users have begun exploring "pangenomes".  The
computational support we offer is the ability to form protein families
for a specified set of genomes, where the genomes are either in the
main collection within the SEED or are private (i.e., kept within RAST
or myRAST).

We construct protein families in three steps:

   1. First, we compute the raw data needed to try to determine which
      genes in a genome G1 correspond to specific genes in a second
      genome.  At this point, we do not try to determine which genes
      actually correspond, we simply create a save blast scores,
      evaluation of bidirectional best hits (BBHs), which genes seem
      to have very similar contexts (neighboring genes) in each
      genome, and so forth.  Computing this raw data is time
      consuming, so we do it once for each pair of genomes, and we
      save it.

   2. Second, we compute all binary correspondences between genomes.
      That is, we try to match up the genes within two genomes and
      retain a record that "gene X from G1 corresponds to gene Y from
      G2".  There are many options in forming such binary
      correspondences.  You may wish to only form correspondences
      between genes that you feel are clearly "the same gene", or you
      may wish to group many similar genes (consider what you might wish to
      do with 10 copies of a single prophage in a genome -- that is,
      with 10 genes that are 100% identical).

      There is no "right way" to form these correspondences (in our
      opinion), but we do try to offer you the tools for connecting
      genes using very loose or very strict criteria.  When we form
      our own "pangenome protein families", we tend to use very
      strict criteria.

   3. Once we have computed all of the correspondences, we place the
      genes into sets using single-linkage clustering (if X
      corresponds to Y, and Y corresponds to Z, then X corresponds to
      Z).  This leads to the sets that you can use to study the "pan
      genomes". 

To see how this works, suppose that we have about 25 E.coli genomes in
the P-SEED, that you have 5 of your own, and that you wish to compute
protein families for the genes in the 30 genomes, matching the genes
up fairly strictly.  

You would do this by preparing an input file describing the 30
genomes.  You could start by building a file describing the genomes
from the P-SEED using

  export SAS_SERVER=PSEED
  svr_all_genomes | grep 'Escherichia coli' | grep -v -i plasmid > tmp

You should manually inspect each of the lines to make sure that it
looks OK and then use

  cut -f2 tmp > genomes.to.use

to construct a file in which each line contains just a genome ID for a
genome from the P-SEED.

Now, to include your private RAST genomes in the analysis, you should
create a directory (called, say, "MyEcoli") and place the SEED
directories generated by RAST or myRAST.  You can export them from
RAST and save them in MyEcoli, or if you have built them locally using
myRAST, you can copy them from there.  Suppose that you have five such
genomes then MyEcoli might contain genome directories labeled
83333.518, 83333.519, 83333.520, 83333.521 and 83333.522 (or something
like that). 

Now you edit genomes to use to include five line, each containing the
full path to one of your genomes.

At this point the file genomes.to.use should include 30 lines -- 25
with the IDs of the P-SEED genomes and 5 describing the paths to your
private genomes.  

Once you have created the complete genomes.to.use, you could run

     svr_make_pan_genome_prot_families < genomes.to.use | svr_function_of > protein.families

to make a set of protein families.  Note that numerous messages are
logged to STDERR indicating why specific correspondences were rejected
(we have found these pretty useful).  If you wish to suppress them,
use

  svr_make_pan_genome_prot_families < genomes.to.use 2> /dev/null | svr_function_of > protein.families

Note that this can run for quite a while.  The command causes 

     (30 * 29) / 2

pairwise comparisons to be made between the genomes.  Each of these
can take 1-5 minutes or more (depending on the speed of your
computer).  Many of the comparisons may have already been computed and
added to the P-SEED, but then again, they may not.  As a rough
estimate, I would say, plan on being able to compute around 200-300
binary comparisons per day.  This means that the above command might
well run a few days.  If you are running on a machine with more than
one processor (say, an 8-processsor machine), you could use

 svr_make_pan_genome_prot_families -p=8 < genomes.to.use | svr_function_of > protein.families

and things would probably go up to eight times faster.

The set of protein families you get back will not contain any
singleton sets.  You should note that these singletons are often
significant genes; but they are also often miscalled genes, truncated
genes (due to incomplete genomes or errors in sequencing), or
frameshifted genes.

Once you have computed the initial protein families, you might begin
by getting some simple reports on both the families and the genomes using

   svr_summarize_protein_families SizeReport GenomeReport InCommon < protein.families

This command produces produces three simple tables.  The first has one
line for each "size of set" and a count of the number of sets having
that size (i.e., that number of members).  When I ran this against a
set of families that I had generated for 5 Brucella genomes, it
contained

===========================
Size of Set     Number of Sets
8       1
5       2300
4       453
3       479
2       711
===========================

That is, there were 711 sets with exactly 2 entries, 479 with 3,
and so forth.  The fact that there was a set containing 8 genes
appears suspicious.  This is especially troubling, since the
default comparison parameters require sequences to be 80% similar and
to be bidirectional-best-hits (BBHs).  So, if we required BBHs, how
did we end up with sets containing more than 5 members?

The answer is important and people working with these types of
comparisons should all think about it a bit.  There are at least two
common causes:

       1. If the genome compares multiple copies of a gene that are
          almost identical (consider the case where a genome contains
          4 insertions of a prophage, and the prophage genes all
          appear identical), then you can easily construct cases in
          which BBHs take you through a cycle from one copy through a
          set of other genomes and back to the other copy.  This case
          does not seem so pernicious, since the sequences are almost
          identical and the functions are almost certainly identical,
          as well.

       2. The second case is more disturbing.  Consider a case in
          which a set of genomes usually have two similar
          transporters, T1 and T2.  However, one of the genomes has
          just a T1, another only a T2.  In this case, BBHs may start
          with a T1, move through a sequence of genomes containing
          both genes,  reach the genome with just a T1 followed by the
          genome with just a T2, and that T1 in that genome may well
          be a BBH to the T2 in the other genome with no T1.  This,
          causes the train of BBHs to go through a set of T1s, flip to
          a T2, and then keep connecting to pick up the remaining T2s.

If you consider sets containing two or more genes from the same genome
as objectional, you can use

       svr_split_fams <  prot.families > one.genome.per.fam 2> multiple.per.fam

which writes out families that do not have multiple genes from a
single genome to STDOUT and those that do to STDERR.