Support for Exploring Pangenomes

As we find it possible to sequence and annotate many closely related

genomes, a number of users have begun exploring "pangenomes". The

computational support we offer is the ability to form protein families

for a specified set of genomes, where the genomes are either in the

main collection within the SEED or are private (i.e., kept within RAST

or myRAST).

We construct protein families in three steps:

1. First, we compute the raw data needed to try to determine which

genes in a genome G1 correspond to specific genes in a second

genome. At this point, we do not try to determine which genes

actually correspond, we simply create a save blast scores,

evaluation of bidirectional best hits (BBHs), which genes seem

to have very similar contexts (neighboring genes) in each

genome, and so forth. Computing this raw data is time

consuming, so we do it once for each pair of genomes, and we

save it.

2. Second, we compute all binary correspondences between genomes.

That is, we try to match up the genes within two genomes and

retain a record that "gene X from G1 corresponds to gene Y from

G2". There are many options in forming such binary

correspondences. You may wish to only form correspondences

between genes that you feel are clearly "the same gene", or you

may wish to group many similar genes (consider what you might wish to

do with 10 copies of a single prophage in a genome -- that is,

with 10 genes that are 100% identical).

There is no "right way" to form these correspondences (in our

opinion), but we do try to offer you the tools for connecting

genes using very loose or very strict criteria. When we form

our own "pangenome protein families", we tend to use very

strict criteria.

3. Once we have computed all of the correspondences, we place the

genes into sets using single-linkage clustering (if X

corresponds to Y, and Y corresponds to Z, then X corresponds to

Z). This leads to the sets that you can use to study the "pan

genomes".

To see how this works, suppose that we have about 25 E.coli genomes in

the P-SEED, that you have 5 of your own, and that you wish to compute

protein families for the genes in the 30 genomes, matching the genes

up fairly strictly.

You would do this by preparing an input file describing the 30

genomes. You could start by building a file describing the genomes

from the P-SEED using

export SAS_SERVER=PSEED

svr_all_genomes | grep 'Escherichia coli' | grep -v -i plasmid > tmp

You should manually inspect each of the lines to make sure that it

looks OK and then use

cut -f2 tmp > genomes.to.use

to construct a file in which each line contains just a genome ID for a

genome from the P-SEED.

Now, to include your private RAST genomes in the analysis, you should

create a directory (called, say, "MyEcoli") and place the SEED

directories generated by RAST or myRAST. You can export them from

RAST and save them in MyEcoli, or if you have built them locally using

myRAST, you can copy them from there. Suppose that you have five such

genomes then MyEcoli might contain genome directories labeled

83333.518, 83333.519, 83333.520, 83333.521 and 83333.522 (or something

like that).

Now you edit genomes to use to include five line, each containing the

full path to one of your genomes.

At this point the file genomes.to.use should include 30 lines -- 25

with the IDs of the P-SEED genomes and 5 describing the paths to your

private genomes.

Once you have created the complete genomes.to.use, you could run

svr_make_pan_genome_prot_families < genomes.to.use | svr_function_of > protein.families

to make a set of protein families. Note that numerous messages are

logged to STDERR indicating why specific correspondences were rejected

(we have found these pretty useful). If you wish to suppress them,

use

svr_make_pan_genome_prot_families < genomes.to.use 2> /dev/null | svr_function_of > protein.families

Note that this can run for quite a while. The command causes

(30 * 29) / 2

pairwise comparisons to be made between the genomes. Each of these

can take 1-5 minutes or more (depending on the speed of your

computer). Many of the comparisons may have already been computed and

added to the P-SEED, but then again, they may not. As a rough

estimate, I would say, plan on being able to compute around 200-300

binary comparisons per day. This means that the above command might

well run a few days. If you are running on a machine with more than

one processor (say, an 8-processsor machine), you could use

svr_make_pan_genome_prot_families -p=8 < genomes.to.use | svr_function_of > protein.families

and things would probably go up to eight times faster.

The set of protein families you get back will not contain any

singleton sets. You should note that these singletons are often

significant genes; but they are also often miscalled genes, truncated

genes (due to incomplete genomes or errors in sequencing), or

frameshifted genes.

Once you have computed the initial protein families, you might begin

by getting some simple reports on both the families and the genomes using

svr_summarize_protein_families SizeReport GenomeReport InCommon < protein.families

This command produces produces three simple tables. The first has one

line for each "size of set" and a count of the number of sets having

that size (i.e., that number of members). When I ran this against a

set of families that I had generated for 5 Brucella genomes, it

contained

===========================

Size of Set Number of Sets

8 1

5 2300

4 453

3 479

2 711

===========================

That is, there were 711 sets with exactly 2 entries, 479 with 3,

and so forth. The fact that there was a set containing 8 genes

appears suspicious. This is especially troubling, since the

default comparison parameters require sequences to be 80% similar and

to be bidirectional-best-hits (BBHs). So, if we required BBHs, how

did we end up with sets containing more than 5 members?

The answer is important and people working with these types of

comparisons should all think about it a bit. There are at least two

common causes:

1. If the genome compares multiple copies of a gene that are

almost identical (consider the case where a genome contains

4 insertions of a prophage, and the prophage genes all

appear identical), then you can easily construct cases in

which BBHs take you through a cycle from one copy through a

set of other genomes and back to the other copy. This case

does not seem so pernicious, since the sequences are almost

identical and the functions are almost certainly identical,

as well.

2. The second case is more disturbing. Consider a case in

which a set of genomes usually have two similar

transporters, T1 and T2. However, one of the genomes has

just a T1, another only a T2. In this case, BBHs may start

with a T1, move through a sequence of genomes containing

both genes, reach the genome with just a T1 followed by the

genome with just a T2, and that T1 in that genome may well

be a BBH to the T2 in the other genome with no T1. This,

causes the train of BBHs to go through a set of T1s, flip to

a T2, and then keep connecting to pick up the remaining T2s.

If you consider sets containing two or more genes from the same genome

as objectional, you can use

svr_split_fams < prot.families > one.genome.per.fam 2> multiple.per.fam

which writes out families that do not have multiple genes from a

single genome to STDOUT and those that do to STDERR.

Support for Exploring Pangenomes

Search

API Listings

SEED Utilities

About this Archive

Categories

Pages

Presentations