An Etude Relating to a Metagenomics Sample

In another short note, I suggested using

 svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

to get summaries of function and population from a sample in the file
MG.sample.

A participant in one of our tutorials took a sample and ran it through
both the two commands above and did a more thorough analysis using
MG-RAST.  The estimates of the most highly represented phylogenetic
groups differed somewhat, and the participant asked me to look into
it.

I am going to use this request as a motivation for showing how one
might use the server scripts effectively.



To begin with, I suggest getting a feel for how the server scripts
might assign function and OTUs to data we think we understand.  I
decicded to take the contigs of E.coli K12 and run them through
the svr_assign_to_dna_using_figfams tool to see what came out.

This requires that I get a fasta file of the E.coli contigs.  To do
that, I used

  svr_contigs_in_genome < e.coli.id | svr_dna_seq -fasta > ecoli.contigs

This constructs a file of the contigs that make up the E.coli genome.

Then I ran

     svr_assign_to_dna_using_figfams -by_location < ecoli.contigs > kmer.dna.output.default

This produces the output of the tool on a piece of DNA that is well anotated.
I asked to have the proposed hits displayed in order by
location so that I could compare the proposed hits (and corresponding
OTUs) against the usual annotations of the E.coli K12 genome.

The first few lines of output were as follows:

Contig        # hits    location (region)       proposed function               representative of an OTU

NC_000913 14 NC_000913_190_253 Thr operon leader peptide Escherichia coli K12
NC_000913 815 NC_000913_248_2797 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) Escherichia coli K12
NC_000913 5 NC_000913_572_1625 hypothetical protein Anaeromyxobacter sp. K
NC_000913 305 NC_000913_2801_3728 Homoserine kinase (EC 2.7.1.39) Escherichia coli K12
NC_000913 4 NC_000913_3213_3180 Glycerol kinase (EC 2.7.1.30) Mycobacterium smegmatis str. MC2 155
NC_000913 420 NC_000913_3734_5018 Threonine synthase (EC 4.2.3.1) Escherichia coli K12
NC_000913 136 NC_000913_5088_6022 hypothetical protein Escherichia coli K12
NC_000913 5 NC_000913_5600_5564 Ribonucleotide reductase of class Ia (aerobic), beta subunit (EC 1.17.4.1) Escherichia coli K12
NC_000913 100 NC_000913_6245_2551 hypothetical protein Escherichia coli K12
NC_000913 249 NC_000913_6459_5685 UPF0246 protein YaaA Escherichia coli K12
NC_000913 5 NC_000913_6967_7223 hypothetical protein Azoarcus sp. EbN1
NC_000913 3 NC_000913_7893_8614 hypothetical protein Tsukamurella paurometabola DSM 20162

----------------------

To get the standard annotations for E.coli, I ran

    svr_all_features 83333.1 peg | svr_function_of | svr_gene_data -c 1 location > ecoli.pegs

The first few lines of output were

fig|83333.1.peg.1 Thr operon leader peptide 83333.1:NC_000913_190+66
fig|83333.1.peg.2 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) 83333.1:NC_000913_337+2463
fig|83333.1.peg.3 Homoserine kinase (EC 2.7.1.39) 83333.1:NC_000913_2801+933
fig|83333.1.peg.4 Threonine synthase (EC 4.2.3.1) 83333.1:NC_000913_3734+1287
fig|83333.1.peg.5 hypothetical protein 83333.1:NC_000913_5234+297
fig|83333.1.peg.6 UPF0246 protein YaaA 83333.1:NC_000913_6459-777
fig|83333.1.peg.7 Putative alanine/glycine transport protein 83333.1:NC_000913_7959-1431
fig|83333.1.peg.8 Transaldolase (EC 2.2.1.2) 83333.1:NC_000913_8238+954

----------------------

If you compare the two outputs briefly, it becomes apparent that regions with hits of 5 or less 
represent spurious hits.  It is clear that the default parameters are producing matches that are not real.

You have several parameters that determine the behavior of the "kmer searches":

    1. The most obvious is the size of the kmers.  In protein sequences we use a default size of
       8 amino acids.  An 8-mer is a "signature", if it has only been seen in proteins with a 
       given function.  In DNA searches, if we use kmers of length 8, we search for 24-character
       sequences that translate to a "signature kmer".  Looking at the short piece of output relating
       to E.coli, it is clear that using kmers of length 8 is (in that case) leading to a situation in
       which there is a low-level (say, 1%) of the hits that are "spurious" in the sense that the
       matches are almost certainly not instances of the specified function (which is usually "hypothetical
       protein".

    2. A minimum number of hits (occurrences of a "signature kmer" within the region).  In the short piece 
       of output displayed above, we were using a minimum number of hits of 3.  It is clear from the
       output that, with kmers of size 8, we were getting hits of as many as five due to random
       occurrences within the DNA strings.

    3. A "maxGap" parameter requires that the gap between occurrences of "signature kmers" be less than
       or equal to the specified value for the hits to be grouped into a single region.  We were using a
       length of 600, which is probably too long.

    4. Finally, we set a minimum size for a detected region.  Note that you could recognize 5 occurrences
       of 8-mers within a DNA sequence of length 28, which might or might not be desirable.

I reran the experiment using kmers of length 9, a minimum number of hits of 3, a maximum gap of 300, 
and a minimum size of 48.  I did this by running

    svr_assign_to_dna_using_figfams -by_location -kmer 9 -maxGap 300 -minSize 48 -reliability 3 < ecoli.contigs > better

This produced the following lines for the corresponding section of the E.coli genome:

NC_000913 13 NC_000913_190_253 Thr operon leader peptide Escherichia coli K12
NC_000913 797 NC_000913_248_2797 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3) Escherichia coli K12
NC_000913 290 NC_000913_2801_3731 Homoserine kinase (EC 2.7.1.39) Escherichia coli K12
NC_000913 407 NC_000913_3734_5018 Threonine synthase (EC 4.2.3.1) Escherichia coli K12
NC_000913 122 NC_000913_5088_5674 hypothetical protein Escherichia coli K12
NC_000913 76 NC_000913_5666_5312 hypothetical protein Escherichia coli K12
NC_000913 238 NC_000913_6459_5685 UPF0246 protein YaaA Escherichia coli K12
NC_000913 425 NC_000913_7959_6531 Putative alanine/glycine transport protein Escherichia coli K12
NC_000913 310 NC_000913_8178_9189 Transaldolase (EC 2.2.1.2) Escherichia coli K12


This is substantially better.  There is one bad call, but I suspect that it reflects actual
annotations in other close E.coli genomes.  

Now let us see whether the differences in parameters effects our ability to estimate populations in a real
metagenomic sample.  To see I ran

   svr_assign_to_dna_using_figfams -kmer 8 -maxGap 600 -minSize 30 -reliability 3 < MG.sample | svr_summarize_MG_output > function.summary.loose  2> otu.summary.loose
and
   svr_assign_to_dna_using_figfams -kmer 9 -maxGap 300 -minSize 54 -reliability 3 < MG.sample | svr_summarize_MG_output > function.summary.tighter  2> otu.summary.tighter

The differences in estimates of population (i.e., a tabulation of the counts of hits against distinct OTUs) was

loose:

1714 0.110860 Staphylococcus aureus subsp. aureus COL
780 0.050450 Pseudomonas fluorescens SBW25
568 0.036738 Acinetobacter baumannii AB307-0294
510 0.032986 Acidovorax avenae subsp. citrulli AAC00-1
473 0.030593 Natranaerobius thermophilus JW/NM-WN-LF
443 0.028653 Streptococcus pneumoniae TIGR4
412 0.026648 Escherichia coli K12
351 0.022702 Stackebrandtia nassauensis DSM 44728
331 0.021409 Verminephrobacter eiseniae EF01-2
287 0.018563 Burkholderia cepacia R1808
269 0.017399 Xanthomonas campestris pv. campestris ATCC 33913
268 0.017334 Propionibacterium acnes KPA171202
234 0.015135 Delftia acidovorans SPH-1
231 0.014941 Synechococcus sp. WH 8102
227 0.014682 Serratia marcescens Db11
222 0.014359 Sphingomonas wittichii RW1
204 0.013194 Burkholderia fungorum
193 0.012483 Bradyrhizobium japonicum USDA 110
188 0.012160 Bacillus anthracis str. 'Ames Ancestor'
165 0.010672 Bordetella avium 197N
162 0.010478 Sulfurospirillum deleyianum DSM 6946

=================
tighter:

1251 0.242301 Staphylococcus aureus subsp. aureus COL
442 0.085609 Pseudomonas fluorescens SBW25
329 0.063723 Streptococcus pneumoniae TIGR4
325 0.062948 Acidovorax avenae subsp. citrulli AAC00-1
319 0.061786 Acinetobacter baumannii AB307-0294
232 0.044935 Propionibacterium acnes KPA171202
227 0.043967 Verminephrobacter eiseniae EF01-2
196 0.037962 Escherichia coli K12
158 0.030602 Serratia marcescens Db11
130 0.025179 Delftia acidovorans SPH-1
107 0.020724 Herminiimonas arsenicoxydans
69 0.013364 Polaromonas sp. JS666

=================

I have included counts of the OTUs that registered at least 1% of the total tabulated hits.


What does this mean?
--------------------

I suppose that the obvious lessons are as follows:

  1.  I am writing this short note largely to illustrate the power of the server scripts -- I
      think that it should be clear that a user can learn a number of important lessons without writing custom Perl code.
  
  2.  If one wished to do a serious analysis of a metagenomic sample using our simple tools, I would
      encourage them to "calibrate" the tools on data they have already characterized.

  3.  Using loose parameters to extract weak hits from lower-quality sequence should be viewed critically.
      It seems to me that one might consider a 2-stage approach in which the population is first estimated
      fairly conservatively (pulling out the clear matches), and then a second less stringent pass could be made
      using the estimated population to "weight the results" (i.e., weak hits against OTUs that were shown to be
      present by the first pass would be taken more seriously in the second pass).

In any event, I hope that these comments make the tools somewhat more useful.