Recently in Coding Examples Category

Some simple Sapling examples

At a brief session at the last workshop we built several simple applications using the Sapling API.

The first is a simple replica of the svr_all_genomes script:


The second allows one to recall all the proteins in the vibrio genomes:

The third allows one to recall the proteins in a given fasta file, and compare to a file of annotations for the proteins in that genome:

    compare_functions fasta-file function-file

The last does a potentially dubious translation of the hits from a myRAST metagenomics output run into proteins.

    hits-to-protein dna-fasta hits-file genetic-code

Enhanced by Zemanta

Retrieving Features and Functions for a Genome

Once you have a genome ID, there are numerous Sapling Server methods for processing the genes and features of the genome. In this article, we will show how to get all the protein sequences, features, and annotations of a single genome.

Getting a List of Genomes and Their Taxonomies

The Sapling Server has several methods for retrieving information about genomes. In this tutorial, we'll discuss how to get a list of all the genomes and pull out basic data and metrics.

Services to Support Annotation of Genes

Identifying of Genes

If one builds an annotation pipeline, one of the first steps involves identifying the putative genes. Example 5 illustrates some basic functions that can be invoked via the servers to identify protein-encoding genes and rRNA-encoding genes, and tRNA-encoding genes. These services utilize tools made available by JCVI, Niels Larsen, Gary Olsen, and Sean Eddy. They offer reasonably accurate, easily-invoked services to locate genes in prokaryotic genomes.

Access to Functional Coupling (Conserved Contiguity) Data

A great deal has been learned from studying genes that tend to occur close to one another in diverse genomes [PMID: 11471247 - change date to 1998, PMID: 9787636, PMID: 10077608, PMID: 11230160, PMID: 18712303].

Example 4 accesses the SEED server that offers access to the data we use to compute co-occurrence scores. The program illustrates the potential for constructing custom tools by going through all of the protein-encoding genes in all of the complete prokaryotic genomes maintained within the SEED looking for "hypothetical proteins" that tend to co-occur with genes encoding functions that can be connected to subsystems. The program constructs a table showing

  1. the gene,
  2. the function of the gene,
  3. the genome id containing such a gene,
  4. the description of the genome,
  5. the non-hypothetical gene in a subsystem that appears to have the strongest co-occurrence score,
  6. the co-occurrence score, and
  7. the function assigned to the co-occurring gene contained in a subsystem.
We believe that there are many variations to this basic data mining capability that could be implemented on top of this basic co-occurrence data.

Creating Custom Interfaces

Suppose that you had substantial expertise in graphical interfaces, understood the power of comparative analysis, and wished to support the ability to graphically display the chromosomal regions around a set of genes (normally from distinct genomes). The SEED offers one alternative for doing this (see the region displayed here for an example), but suppose that you did not like forcing users to find appropriate SEED IDs and you thought that you could develop a superior display.

Example 3 illustrates the functions required to determine the location of a SEED gene encoding a specific protein and to acquire the genes from a given region centered on that location. If you were to create a program to accept arbitrary protein IDs, use the conversion capabilities demonstrated in example1, and display the regions graphically around these genes, you would have the core of a useful tool. If you shaded genes from the same subsystem (determined using the capabilities described in example2), you could enhance the supported functionality. Of course, you could also compute which genes could be connected to literature or structures and encode that data as well.

Given a set of functional roles, one often wishes to understand what subsystems can be inferred from the set. This example reads as input a set of functional roles and constructs a table of subsystems, along with their variation codes, that can be identified. The data displayed in this simple example could form the start of a research project to gather the functional roles not connected to subsystems, to determine whether they were not connected because a small set of functional roles were not present in the input, and to seek candidates for such "missing functional roles". The ability to easily map functional roles into subsystems will improve over time, as the SEED annotation effort improves its collection of encoded subsystems [PMID: 16214803].

Conversion of Gene and Protein IDs

In example1 we illustrate some basic capabilities that relate to determining the set of IDs attached to specific protein sequences. The program accepts a protein ID as input. The ID may be one of several that are maintained by the SEED, UniProt, RefSeq, KEGG and other groups. The program first accesses all IDs attached to identical protein sequences. This can be a fairly large set in cases in which many very similar genomes have been sequenced.