Recently in Tutorials Category

Some simple Sapling examples

At a brief session at the last workshop we built several simple applications using the Sapling API.

The first is a simple replica of the svr_all_genomes script:

     allgenomes

The second allows one to recall all the proteins in the vibrio genomes:


The third allows one to recall the proteins in a given fasta file, and compare to a file of annotations for the proteins in that genome:

    compare_functions fasta-file function-file

The last does a potentially dubious translation of the hits from a myRAST metagenomics output run into proteins.

    hits-to-protein dna-fasta hits-file genetic-code

Enhanced by Zemanta

An Exercise to Get a List of Homologs of Virulence Factors

An Exercise to Get a List of Homologs of Virulence Factors


Suppose that you have a file containing data relating to known virulence factors. For example, here are the first few lines of a file given to me by a participant in one of our tutorials (These columns are separated by tab characters):


VFG_id gi_id Gene_name Product Organism VF_id


VFG1293 21282773        hla     Alpha-Hemolysin precursor       Staphylococcus aureus MW2       VF0001

VFG1798 46588   hlb     beta-hemolysin  Staphylococcus aureus   VF0002


The second column contains GI numbers (without the usual "gi|" prefix).  The goal was to extract the set of FIGfams containing these genes, along with a list of the PEGs (and their functions).


We can do these easily in two steps. We'll use the "svr" command line tools in this example. If you are using Windows, be sure to remember to use the myRAST shell,  or, if you are using a Mac,  ensure that the svr commands are in your path with this command in your shell window:


export PATH=$PATH:/Applications/myRAST.app/bin


Our solution requires that the servers use the data from the PSEED, so be sure to run this command in your shell window:


Mac users:  export SAS_SERVER=PSEED

Windows Users: set SAS_SERVER=PSEED


 Now, to begin,  let's try


svr_ids_to_figfams -c=2 -source=NCBI < known.virulence.factors 2> no.matches | svr_figfams_to_ids > with.figfams


This produces two files.  The file "no.matches" contains a list of the lines from the input file ("known.virulence.factors") that have not yet been placed into FIGfams. The second file ("with.figfams") is a file in which two columns have been added: a FIGfam number and a PEG from that FIGfam.  Now, if we wished to add a function of the PEG, as well as a readable version of the genome it is from, we would use


svr_gene_data function genome-name < with.figfams > desired.table.txt


If the user needs to see the "aliases" for the PEGs that have been found, they might try


svr_aliases_of -c=9 < desired.table.txt > desired.table.aliases.txt


but be warned: we return all of the IDs of protein sequences that we know about that have the same protein sequence (from a potentially large collection of very similar genomes).


Finally, we should note that the initial virulence factors may have multiple entries leading to the same FIGfam (and, hence, to the same sets of PEGs).  This can lead to the situation  where multiple roles, while not identical, would have information on the same PEG.  To collapse the table to unique, you need to sort on the 9th field, assuming tab-delimited fields.

This can be done using


sort -k 9 -t $'\t' -u < desired.table.aliases.txt  > sorted.txt



Let us consider an somewhat more useful alternative.


We begin by just connecting the aliases in column 2 to PEGs in the SEED:


svr_aliases_to_pegs -c=2 -source=NCBI -protein=1 < known.virulence.factors 2> no.matches.1 > with.peg


This produces a file (no.matches.1) of lines that could not be connected to PEGs.


Now,


svr_ids_to_figfams < with.peg > with.ff 2> no.matches.to.ff


can be used to add a FIGfam function and FIGfam ID to the end of each line

for which the identified PEG is in a FIGfam.  Those lines that do not have a PEG

get written to a file (no.matches.to.ff).


Now, it is worth noting that we may have many lines pointing at the same FIGfam (i.e.,

many lines with the same FIGfam ID in the last column, which is column 9).

By using


   sort -u -k 9 -t$'\t' with.ff > with.ff.sorted


we sort the file on the FIGfam ID, deleting all but one of each set that share

a common FIGfam ID.


Now,


svr_figfams_to_ids < with.ff.sorted > with.ff.peg


can be used to expand each input line to a set of lines, each one containing a 

distinct PEG from the FIGfam.  Using 


svr_gene_data function genome-name < with.ff.peg > complete


we add the PEG function and the genome-name to the end of each row.


Finally, we run


svr_aliases_of -c 10 < complete > complete.with.aliases


to add known aliases for each PEG as the last column in the table.





An Etude Relating to a Metagenomics Sample

In another short note, I suggested using

 svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

to get summaries of function and population from a sample in the file
MG.sample.

A participant in one of our tutorials took a sample and ran it through
both the two commands above and did a more thorough analysis using
MG-RAST.  The estimates of the most highly represented phylogenetic
groups differed somewhat, and the participant asked me to look into
it.

I am going to use this request as a motivation for showing how one
might use the server scripts effectively.


A Short Note on Mapping PEGs to Subsystems

A Short Note on Mapping PEGs to Subsystems

You can use

 svr_all_features 83333.1 peg | svr_ids_to_subsystems 2> /dev/null > E.coli.pegs.in.subsystems

to get a 2-column file [PEG,Subsystem] for the PEGs in the genome with ID 83333.1
(which happens to be E.coli).  If you also want the function of the PEGs, use

 svr_all_features 83333.1 peg | svr_ids_to_subsystems 2> /dev/null | svr_function_of -c 1 > E.coli.pegs.in.subsystems


The 

    svr_function_of -c 1

command takes as input a file in which each line is a tab-separated set of fields.  The

-c 1

says that the PEG ids for which functions are to be appended occur in the first column.

This produces output like

fig|83333.1.peg.2       CBSS-216591.1.peg.168   Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)
fig|83333.1.peg.2       Lysine Biosynthesis DAP Pathway Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)
fig|83333.1.peg.2       Threonine and Homoserine Biosynthesis   Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)
fig|83333.1.peg.2       Threonine and Homoserine Biosynthesis   Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)
fig|83333.1.peg.2       Methionine Biosynthesis Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)
fig|83333.1.peg.3       Threonine and Homoserine Biosynthesis   Homoserine kinase (EC 2.7.1.39)
fig|83333.1.peg.3       Methionine Biosynthesis Homoserine kinase (EC 2.7.1.39)
fig|83333.1.peg.3       CBSS-269482.1.peg.1294  Homoserine kinase (EC 2.7.1.39)
fig|83333.1.peg.4       Threonine and Homoserine Biosynthesis   Threonine synthase (EC 4.2.3.1)
fig|83333.1.peg.6       YaaA    UPF0246 protein YaaA

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

This short note will discuss the following command and what it accomplishes:


svr_all_features 83333.1 peg | svr_ids_to_figfams | svr_fc_figfams -MinSc 100 | svr_figfams_to_ids 83333.1 | svr_function_of > EC.data

It chains together 5 svr scripts, which I will now discuss.

RAST Tutorial Links

Getting Summaries of Functional Content and OTUs for an Metagenomic Sample

(Note: in order to use the svr commands, you must have installed the myRAST app and set your environment correctly, see this post for instructions)

It is worth mentioning that two of the svr functions provide a means
of getting quick summaries of content for a newly-sequenced metagenomic sample.

 svr_assign_to_dna_using_figfams < MG.sample

takes as input a set of DNA sequences in fasta format.  It outputs a 5-column, tab-separated table containing:

  1. The ID of one of the sequences
  2.  The number of Kmer hits against the sequence
  3.  The region identified as potentially supporting the function  (in the form of a contig, begin, and end coordinates separated by underscores),
  4.  The function associated with the region (which may just be "hypothetical protein"), 
  5.   A genome name that represents an "operational taxonomic unit" that appears to be the source of the hit.

This tab-separated table can be summarized using

  svr_summarize_MG_output < table > function.summary 2> otu.summary

Normally, these are just pipelined using

 svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

The pipeline will usually process roughly 6-8 megabases of data per minute.

Finally, you can use

svr_metabolic_reconstruction < function.summary | cut -f 4,5 | sort -u

to get a quick metabolic reconstruction summarizing the active subsystems that could be determined (along with the appropriate variant code).

Video Tutorial: Creating a RAST Account

This tutorial contains two videos demonstrating how to create a new account on the RAST Annotation Service.

Downloading the FIGfams

In this tutorial we will show how to use the command-line server scripts to access the complete set of FIGfams.

Scan For Matches

scan_for_matches is a utility written in C for locating patterns in DNA or protein FASTA files.

You can download the source and installation instructions at http://www.theseed.org/servers/downloads/scan_for_matches.tgz.

The utility was written by Ross Overbeek; David Joerg and Morgan Price wrote sections of an earlier version. It is worth noting that it was strongly influenced by the elegant tools developed and distributed by David Searls.