An Etude Relating to a Metagenomics Sample

In another short note, I suggested using

svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

to get summaries of function and population from a sample in the file

MG.sample.

A participant in one of our tutorials took a sample and ran it through

both the two commands above and did a more thorough analysis using

MG-RAST. The estimates of the most highly represented phylogenetic

groups differed somewhat, and the participant asked me to look into

it.

I am going to use this request as a motivation for showing how one

might use the server scripts effectively.

To begin with, I suggest getting a feel for how the server scripts

might assign function and OTUs to data we think we understand. I

decicded to take the contigs of E.coli K12 and run them through

the svr_assign_to_dna_using_figfams tool to see what came out.

This requires that I get a fasta file of the E.coli contigs. To do

that, I used

svr_contigs_in_genome < e.coli.id | svr_dna_seq -fasta > ecoli.contigs

This constructs a file of the contigs that make up the E.coli genome.

Then I ran

svr_assign_to_dna_using_figfams -by_location < ecoli.contigs > kmer.dna.output.default

This produces the output of the tool on a piece of DNA that is well anotated.

I asked to have the proposed hits displayed in order by

location so that I could compare the proposed hits (and corresponding

OTUs) against the usual annotations of the E.coli K12 genome.

The first few lines of output were as follows:

Contig # hits location (region) proposed function representative of an OTU

NC_000913 14 NC_000913_190_253 Thr operon leader peptide Escherichia coli K12

NC_000913 815 NC_000913_248_2797 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) Escherichia coli K12

NC_000913 5 NC_000913_572_1625 hypothetical protein Anaeromyxobacter sp. K

NC_000913 305 NC_000913_2801_3728 Homoserine kinase (EC 2.7.1.39) Escherichia coli K12

NC_000913 4 NC_000913_3213_3180 Glycerol kinase (EC 2.7.1.30) Mycobacterium smegmatis str. MC2 155

NC_000913 420 NC_000913_3734_5018 Threonine synthase (EC 4.2.3.1) Escherichia coli K12

NC_000913 136 NC_000913_5088_6022 hypothetical protein Escherichia coli K12

NC_000913 5 NC_000913_5600_5564 Ribonucleotide reductase of class Ia (aerobic), beta subunit (EC 1.17.4.1) Escherichia coli K12

NC_000913 100 NC_000913_6245_2551 hypothetical protein Escherichia coli K12

NC_000913 249 NC_000913_6459_5685 UPF0246 protein YaaA Escherichia coli K12

NC_000913 5 NC_000913_6967_7223 hypothetical protein Azoarcus sp. EbN1

NC_000913 3 NC_000913_7893_8614 hypothetical protein Tsukamurella paurometabola DSM 20162

----------------------

To get the standard annotations for E.coli, I ran

svr_all_features 83333.1 peg | svr_function_of | svr_gene_data -c 1 location > ecoli.pegs

The first few lines of output were

fig|83333.1.peg.1 Thr operon leader peptide 83333.1:NC_000913_190+66

fig|83333.1.peg.2 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) 83333.1:NC_000913_337+2463

fig|83333.1.peg.3 Homoserine kinase (EC 2.7.1.39) 83333.1:NC_000913_2801+933

fig|83333.1.peg.4 Threonine synthase (EC 4.2.3.1) 83333.1:NC_000913_3734+1287

fig|83333.1.peg.5 hypothetical protein 83333.1:NC_000913_5234+297

fig|83333.1.peg.6 UPF0246 protein YaaA 83333.1:NC_000913_6459-777

fig|83333.1.peg.7 Putative alanine/glycine transport protein 83333.1:NC_000913_7959-1431

fig|83333.1.peg.8 Transaldolase (EC 2.2.1.2) 83333.1:NC_000913_8238+954

----------------------

If you compare the two outputs briefly, it becomes apparent that regions with hits of 5 or less

represent spurious hits. It is clear that the default parameters are producing matches that are not real.

You have several parameters that determine the behavior of the "kmer searches":

1. The most obvious is the size of the kmers. In protein sequences we use a default size of

8 amino acids. An 8-mer is a "signature", if it has only been seen in proteins with a

given function. In DNA searches, if we use kmers of length 8, we search for 24-character

sequences that translate to a "signature kmer". Looking at the short piece of output relating

to E.coli, it is clear that using kmers of length 8 is (in that case) leading to a situation in

which there is a low-level (say, 1%) of the hits that are "spurious" in the sense that the

matches are almost certainly not instances of the specified function (which is usually "hypothetical

protein".

2. A minimum number of hits (occurrences of a "signature kmer" within the region). In the short piece

of output displayed above, we were using a minimum number of hits of 3. It is clear from the

output that, with kmers of size 8, we were getting hits of as many as five due to random

occurrences within the DNA strings.

3. A "maxGap" parameter requires that the gap between occurrences of "signature kmers" be less than

or equal to the specified value for the hits to be grouped into a single region. We were using a

length of 600, which is probably too long.

4. Finally, we set a minimum size for a detected region. Note that you could recognize 5 occurrences

of 8-mers within a DNA sequence of length 28, which might or might not be desirable.

I reran the experiment using kmers of length 9, a minimum number of hits of 3, a maximum gap of 300,

and a minimum size of 48. I did this by running

svr_assign_to_dna_using_figfams -by_location -kmer 9 -maxGap 300 -minSize 48 -reliability 3 < ecoli.contigs > better

This produced the following lines for the corresponding section of the E.coli genome:

NC_000913 13 NC_000913_190_253 Thr operon leader peptide Escherichia coli K12

NC_000913 797 NC_000913_248_2797 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) Escherichia coli K12

NC_000913 290 NC_000913_2801_3731 Homoserine kinase (EC 2.7.1.39) Escherichia coli K12

NC_000913 407 NC_000913_3734_5018 Threonine synthase (EC 4.2.3.1) Escherichia coli K12

NC_000913 122 NC_000913_5088_5674 hypothetical protein Escherichia coli K12

NC_000913 76 NC_000913_5666_5312 hypothetical protein Escherichia coli K12

NC_000913 238 NC_000913_6459_5685 UPF0246 protein YaaA Escherichia coli K12

NC_000913 425 NC_000913_7959_6531 Putative alanine/glycine transport protein Escherichia coli K12

NC_000913 310 NC_000913_8178_9189 Transaldolase (EC 2.2.1.2) Escherichia coli K12

This is substantially better. There is one bad call, but I suspect that it reflects actual

annotations in other close E.coli genomes.

Now let us see whether the differences in parameters effects our ability to estimate populations in a real

metagenomic sample. To see I ran

svr_assign_to_dna_using_figfams -kmer 8 -maxGap 600 -minSize 30 -reliability 3 < MG.sample | svr_summarize_MG_output > function.summary.loose 2> otu.summary.loose

and

svr_assign_to_dna_using_figfams -kmer 9 -maxGap 300 -minSize 54 -reliability 3 < MG.sample | svr_summarize_MG_output > function.summary.tighter 2> otu.summary.tighter

The differences in estimates of population (i.e., a tabulation of the counts of hits against distinct OTUs) was

loose:

1714 0.110860 Staphylococcus aureus subsp. aureus COL

780 0.050450 Pseudomonas fluorescens SBW25

568 0.036738 Acinetobacter baumannii AB307-0294

510 0.032986 Acidovorax avenae subsp. citrulli AAC00-1

473 0.030593 Natranaerobius thermophilus JW/NM-WN-LF

443 0.028653 Streptococcus pneumoniae TIGR4

412 0.026648 Escherichia coli K12

351 0.022702 Stackebrandtia nassauensis DSM 44728

331 0.021409 Verminephrobacter eiseniae EF01-2

287 0.018563 Burkholderia cepacia R1808

269 0.017399 Xanthomonas campestris pv. campestris ATCC 33913

268 0.017334 Propionibacterium acnes KPA171202

234 0.015135 Delftia acidovorans SPH-1

231 0.014941 Synechococcus sp. WH 8102

227 0.014682 Serratia marcescens Db11

222 0.014359 Sphingomonas wittichii RW1

204 0.013194 Burkholderia fungorum

193 0.012483 Bradyrhizobium japonicum USDA 110

188 0.012160 Bacillus anthracis str. 'Ames Ancestor'

165 0.010672 Bordetella avium 197N

162 0.010478 Sulfurospirillum deleyianum DSM 6946

=================

tighter:

1251 0.242301 Staphylococcus aureus subsp. aureus COL

442 0.085609 Pseudomonas fluorescens SBW25

329 0.063723 Streptococcus pneumoniae TIGR4

325 0.062948 Acidovorax avenae subsp. citrulli AAC00-1

319 0.061786 Acinetobacter baumannii AB307-0294

232 0.044935 Propionibacterium acnes KPA171202

227 0.043967 Verminephrobacter eiseniae EF01-2

196 0.037962 Escherichia coli K12

158 0.030602 Serratia marcescens Db11

130 0.025179 Delftia acidovorans SPH-1

107 0.020724 Herminiimonas arsenicoxydans

69 0.013364 Polaromonas sp. JS666

=================

I have included counts of the OTUs that registered at least 1% of the total tabulated hits.

What does this mean?

--------------------

I suppose that the obvious lessons are as follows:

1. I am writing this short note largely to illustrate the power of the server scripts -- I

think that it should be clear that a user can learn a number of important lessons without writing custom Perl code.

2. If one wished to do a serious analysis of a metagenomic sample using our simple tools, I would

encourage them to "calibrate" the tools on data they have already characterized.

3. Using loose parameters to extract weak hits from lower-quality sequence should be viewed critically.

It seems to me that one might consider a 2-stage approach in which the population is first estimated

fairly conservatively (pulling out the clear matches), and then a second less stringent pass could be made

using the estimated population to "weight the results" (i.e., weak hits against OTUs that were shown to be

present by the first pass would be taken more seriously in the second pass).

In any event, I hope that these comments make the tools somewhat more useful.

An Etude Relating to a Metagenomics Sample

Categories:

Search

API Listings

SEED Utilities

About this Entry

Categories

Pages

Presentations