Retrieving Features and Functions for a Genome

Once you have a genome ID, there are numerous Sapling Server methods for processing the genes and features of the genome. In this article, we will show how to get all the protein sequences, features, and annotations of a single genome.

Given a single genome ID, the all_proteins method will return a reference to a hash that maps each gene ID from the genome to its protein sequence. The following program lists all the genes and protein sequences for the Campylobacter genome 360108.3.

    use SeedEnv;

   my $sapObject = SAPserver->new();
   my $protHash = $sapObject->all_proteins(-id => 360108.3);
   for my $geneID (sort { by_fig_id($a,$b) } keys %$protHash) {
      print "$geneID: $protHash->{$geneID}\n";
   }

The by_fig_id method in the sort clause orders the gene IDs in natural sequence rather than alphabetically. The first few lines of the output look like this.

fig|360108.3.peg.1: MLEFVFIILILGIVFNLGSLYLKKDNLLEGAIQILNDIQYTQSLAMMQEGIRVDELAIAKREWFKSRWQIYFIKSAATGYDQTYTIFLDKNGDGNANLGKTEINIDREIAVDVINHNKLMNSGQSGVISKDDEKTTQRFNLTKRFGIEKVEFKGSCSGFTRLVFDEMGRVYSPLKNANYAYEKTLAKNNSDCIIRLLSKKHALCIVIDTLSGYAYIPDFKTLKSQFVNIKNKNYECSKI
fig|360108.3.peg.2: MEKIKNYKLIIILLSLDLLALLYGTSTLSISADEADIYFGEQGKSLIFSYSLLYYISHFGTFIFGQNDFGLRLPFLFFHFLSCLLLYLLALKYTKTKIDAFFSLLLFVLLPGTVASALLINAASLVIFLTLAILCAYEYEKKWLFYILLIMVLFVDKSFNILFLTFFFFGIYKRNAILFTLSLVLFGVSISFYGFDTGGRPRGYFLDTLGIFAACFSPLVFVYFFYTIYRLTFQKYKNLLWFLMSVTFVFCLLLSLRQKLFLDDFLPFCVICTPLLIKTLMQSYRVRLPVFRLRYKIFIECSIIFLIFCYFLIVANQLLYYFINNPNRHFANNYHFAKELALELKKQDVLELATAPSLQKRLRFYGIKNSNKFYLKALKQADKYDMDKKIVKVKLGKYEKVYQILNYD
fig|360108.3.peg.3: MQTIDQIFQTQIDIKKSTFLSFLCPFEDFKFLIETLKKEHPKAVHFVYAYRVLNDFNQIAEDKSDDGEPKGTSGMPTLNVLRGYDLINAALITVRYFGGIKLGTGGLVRAYSDAANAVINNSSLLSFELKKNITIAIDLKNLNRFEHFLKTYSFNFTKDFKDCKAILHIKLDEKEEQEFEIFCKNFAPFEIEKL
fig|360108.3.peg.4: MQVNYRTISSYEYDAISGQYKQVDKQIEDYSSSGDSDFMDMLNKADEKSSGDALNSSSSFQSNAQNSNSNLSNYAQMSNVYAYRFRQNEGELSMRAQSASVHNDLTQQGVNEQSKNNTLLNDLLNAI

If all you want is the list of gene IDs, you can use the all_features method. This method allows you to specify multiple genomes (the -ids parameter) and what types of gene IDs you want (the -types parameter). For example, this program asks for the RNA features in the two Streptococcus pyogenes genomes 160490.1 and 198466.1.

    use SeedEnv;

    my $sapObject = SAPserver->new();
    my $rnaHash = $sapObject->all_features(-ids => [160490.1, 198466.1],
                                           -type => 'rna');
    for my $genome (keys %$rnaHash) {
        print "RNAs for $genome\n";
        for my $rna (@{$rnaHash->{$genome}}) {
            print "   $rna\n";
        }
    }

Given a list of gene IDs, the ids_to_functions method can be used to retrieve the functional assignments of the genes. For example, the following program displays the functional assignment for each of the RNAs in the two selected genomes.

    use SeedEnv;

    my $sapObject = SAPserver->new();
    my $rnaHash = $sapObject->all_features(-ids => [160490.1, 198466.1],
                                           -type => 'rna');
    for my $genome (keys %$rnaHash) {
        print "RNAs for $genome\n";
        my $rnaData = $sapObject->ids_to_functions(-ids => $rnaHash->{$genome});
        for my $rna (sort { by_fig_id($a,$b) } keys %$rnaData) {
            print "   $rna: $rnaData->{$rna}\n";
        }
    }

A fragment of the output for the above program is shown below.

RNAs for 198466.1
   fig|198466.1.rna.1: SSU rRNA
   fig|198466.1.rna.2: tRNA-Ala
   fig|198466.1.rna.3: LSU rRNA
   fig|198466.1.rna.4: tRNA-Val
   fig|198466.1.rna.5: tRNA-Asp
   fig|198466.1.rna.6: tRNA-Lys