Support for Exploring Pangenomes
As we find it possible to sequence and annotate many closely related
genomes, a number of users have begun exploring "pangenomes". The
computational support we offer is the ability to form protein families
for a specified set of genomes, where the genomes are either in the
main collection within the SEED or are private (i.e., kept within RAST
or myRAST).
We construct protein families in three steps:
1. First, we compute the raw data needed to try to determine which
genes in a genome G1 correspond to specific genes in a second
genome. At this point, we do not try to determine which genes
actually correspond, we simply create a save blast scores,
evaluation of bidirectional best hits (BBHs), which genes seem
to have very similar contexts (neighboring genes) in each
genome, and so forth. Computing this raw data is time
consuming, so we do it once for each pair of genomes, and we
save it.
2. Second, we compute all binary correspondences between genomes.
That is, we try to match up the genes within two genomes and
retain a record that "gene X from G1 corresponds to gene Y from
G2". There are many options in forming such binary
correspondences. You may wish to only form correspondences
between genes that you feel are clearly "the same gene", or you
may wish to group many similar genes (consider what you might wish to
do with 10 copies of a single prophage in a genome -- that is,
with 10 genes that are 100% identical).
There is no "right way" to form these correspondences (in our
opinion), but we do try to offer you the tools for connecting
genes using very loose or very strict criteria. When we form
our own "pangenome protein families", we tend to use very
strict criteria.
3. Once we have computed all of the correspondences, we place the
genes into sets using single-linkage clustering (if X
corresponds to Y, and Y corresponds to Z, then X corresponds to
Z). This leads to the sets that you can use to study the "pan
genomes".
To see how this works, suppose that we have about 25 E.coli genomes in
the P-SEED, that you have 5 of your own, and that you wish to compute
protein families for the genes in the 30 genomes, matching the genes
up fairly strictly.
You would do this by preparing an input file describing the 30
genomes. You could start by building a file describing the genomes
from the P-SEED using
export SAS_SERVER=PSEED
svr_all_genomes | grep 'Escherichia coli' | grep -v -i plasmid > tmp
You should manually inspect each of the lines to make sure that it
looks OK and then use
cut -f2 tmp > genomes.to.use
to construct a file in which each line contains just a genome ID for a
genome from the P-SEED.
Now, to include your private RAST genomes in the analysis, you should
create a directory (called, say, "MyEcoli") and place the SEED
directories generated by RAST or myRAST. You can export them from
RAST and save them in MyEcoli, or if you have built them locally using
myRAST, you can copy them from there. Suppose that you have five such
genomes then MyEcoli might contain genome directories labeled
83333.518, 83333.519, 83333.520, 83333.521 and 83333.522 (or something
like that).
Now you edit genomes to use to include five line, each containing the
full path to one of your genomes.
At this point the file genomes.to.use should include 30 lines -- 25
with the IDs of the P-SEED genomes and 5 describing the paths to your
private genomes.
Once you have created the complete genomes.to.use, you could run
svr_make_pan_genome_prot_families < genomes.to.use | svr_function_of > protein.families
to make a set of protein families. Note that numerous messages are
logged to STDERR indicating why specific correspondences were rejected
(we have found these pretty useful). If you wish to suppress them,
use
svr_make_pan_genome_prot_families < genomes.to.use 2> /dev/null | svr_function_of > protein.families
Note that this can run for quite a while. The command causes
(30 * 29) / 2
pairwise comparisons to be made between the genomes. Each of these
can take 1-5 minutes or more (depending on the speed of your
computer). Many of the comparisons may have already been computed and
added to the P-SEED, but then again, they may not. As a rough
estimate, I would say, plan on being able to compute around 200-300
binary comparisons per day. This means that the above command might
well run a few days. If you are running on a machine with more than
one processor (say, an 8-processsor machine), you could use
svr_make_pan_genome_prot_families -p=8 < genomes.to.use | svr_function_of > protein.families
and things would probably go up to eight times faster.
The set of protein families you get back will not contain any
singleton sets. You should note that these singletons are often
significant genes; but they are also often miscalled genes, truncated
genes (due to incomplete genomes or errors in sequencing), or
frameshifted genes.
Once you have computed the initial protein families, you might begin
by getting some simple reports on both the families and the genomes using
svr_summarize_protein_families SizeReport GenomeReport InCommon < protein.families
This command produces produces three simple tables. The first has one
line for each "size of set" and a count of the number of sets having
that size (i.e., that number of members). When I ran this against a
set of families that I had generated for 5 Brucella genomes, it
contained
===========================
Size of Set Number of Sets
8 1
5 2300
4 453
3 479
2 711
===========================
That is, there were 711 sets with exactly 2 entries, 479 with 3,
and so forth. The fact that there was a set containing 8 genes
appears suspicious. This is especially troubling, since the
default comparison parameters require sequences to be 80% similar and
to be bidirectional-best-hits (BBHs). So, if we required BBHs, how
did we end up with sets containing more than 5 members?
The answer is important and people working with these types of
comparisons should all think about it a bit. There are at least two
common causes:
1. If the genome compares multiple copies of a gene that are
almost identical (consider the case where a genome contains
4 insertions of a prophage, and the prophage genes all
appear identical), then you can easily construct cases in
which BBHs take you through a cycle from one copy through a
set of other genomes and back to the other copy. This case
does not seem so pernicious, since the sequences are almost
identical and the functions are almost certainly identical,
as well.
2. The second case is more disturbing. Consider a case in
which a set of genomes usually have two similar
transporters, T1 and T2. However, one of the genomes has
just a T1, another only a T2. In this case, BBHs may start
with a T1, move through a sequence of genomes containing
both genes, reach the genome with just a T1 followed by the
genome with just a T2, and that T1 in that genome may well
be a BBH to the T2 in the other genome with no T1. This,
causes the train of BBHs to go through a set of T1s, flip to
a T2, and then keep connecting to pick up the remaining T2s.
If you consider sets containing two or more genes from the same genome
as objectional, you can use
svr_split_fams < prot.families > one.genome.per.fam 2> multiple.per.fam
which writes out families that do not have multiple genes from a
single genome to STDOUT and those that do to STDERR.