Getting a List of Genomes and Their Taxonomies

The Sapling Server has several methods for retrieving information about genomes. In this tutorial, we'll discuss how to get a list of all the genomes and pull out basic data and metrics.

Listing All Genomes

The all_genomes method returns a complete list of the genomes in the database, returning a reference to a hash that maps each genome ID to its scientific name. The following code gets the hash and prints it out.

   use SeedEnv;

   my $sapObject = SAPserver->new();
   my $genomeHash = $sapObject->all_genomes();
   for my $genomeID (sort keys %$genomeHash) {
      print "$genomeID: $genomeHash->{$genomeID}\n";
   }

The initial output from the above program looks like this:

100226.1: Streptomyces coelicolor A3(2)
100226.8: Streptomyces coelicolor A3(2) plasmid SCP1
100226.9: Streptomyces coelicolor A3(2) plasmid SCP2
100379.3: Onion yellows phytoplasma NIM plasmid extrachromosomal DNA
100379.4: Onion yellows phytoplasma plasmid EcOYW1
100379.5: Onion yellows phytoplasma plasmid pOYM
100379.6: Onion yellows phytoplasma plasmid pOYNIM

The -complete option can be used in the all_genomes call to return only complete genomes, as follows.

   use SeedEnv;

   my $sapObject = SAPserver->new();
   my $genomeHash = $sapObject->all_genomes(-complete => 1);
   for my $genomeID (sort keys %$genomeHash) {
      print "$genomeID: $genomeHash->{$genomeID}\n";
   }

This also eliminates the plasmids, as you can see from the output fragment below.

100226.1: Streptomyces coelicolor A3(2)
10090.3: Mus musculus (House mouse)
101031.3: Bacillus B-14905
10116.3: Rattus norvegicus (Norway rat)
101510.15: Rhodococcus jostii RHA1
103690.1: Nostoc sp. PCC 7120
106370.11: Frankia sp. Ccl3

Once you have genome IDs, there are numerous things you can do to get more information. The following program prints a full taxonomy for each complete genome in the system.

   use SeedEnv;

   my $sapObject = SAPserver->new();
   my $genomeHash = $sapObject->all_genomes(-complete => 1);
   my $taxHash = $sapObject->taxonomy_of(-ids => [keys %$genomeHash]);
   for my $genomeID (sort keys %$genomeHash) {
      print "$genomeID: " . join(", ", @{$taxHash->{$genomeID}}) . "\n";
   }

In the above fragment, the keys of the initial genome hash specify the list of genomes whose taxonomies are desired. The taxonomy_of method computes the taxonomies and puts them in $taxHash in the form of lists so that they can be printed by the for loop. The output looks like this.

00226.1: Bacteria, Actinobacteria, Actinobacteria (class), Actinobacteridae, Actinomycetales, Streptomycineae, Streptomycetaceae, Streptomyces, Streptomyces coelicolor, Streptomyces coelicolor A3(2)
10090.3: Eukaryota, Fungi/Metazoa group, Metazoa, Eumetazoa, Bilateria, Coelomata, Deuterostomia, Chordata, Craniata, Vertebrata, Gnathostomata, Teleostomi, Euteleostomi, Sarcopterygii, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Euarchontoglires, Glires, Rodentia, Sciurognathi, Muroidea, Muridae, Murinae, Mus, Mus musculus
101031.3: Bacteria, Firmicutes, Bacilli, Bacillales, Bacillaceae, Bacillus, Bacillus sp. B-14905

If full taxonomy information is excessive, you can ask for just the domain using genome_domain.

   use SeedEnv;

   my $sapObject = SAPserver->new();
   my $genomeHash = $sapObject->all_genomes(-complete => 1);
   my $domHash = $sapObject->genome_domain(-ids => [keys %$genomeHash]);
   for my $genomeID (sort keys %$genomeHash) {
      print "$genomeID: $domHash->{$genomeID}\n";
   }
100226.1: Bacteria
10090.3: Eukaryota
101031.3: Bacteria
10116.3: Eukaryota
101510.15: Bacteria
103690.1: Bacteria