The SEED Servers: Tutorials Archives

Some simple Sapling examples

By Robert Olson on November 16, 2010 4:14 PM

At a brief session at the last workshop we built several simple applications using the Sapling API.

The first is a simple replica of the svr_all_genomes script:

The second allows one to recall all the proteins in the vibrio genomes:

The third allows one to recall the proteins in a given fasta file, and compare to a file of annotations for the proteins in that genome:

compare_functions fasta-file function-file

The last does a potentially dubious translation of the hits from a myRAST metagenomics output run into proteins.

hits-to-protein dna-fasta hits-file genetic-code

An Exercise to Get a List of Homologs of Virulence Factors

By The SEED Team on October 5, 2010 9:05 AM

An Exercise to Get a List of Homologs of Virulence Factors

Suppose that you have a file containing data relating to known virulence factors. For example, here are the first few lines of a file given to me by a participant in one of our tutorials (These columns are separated by tab characters):

VFG_id gi_id Gene_name Product Organism VF_id

VFG1293 21282773 hla Alpha-Hemolysin precursor Staphylococcus aureus MW2 VF0001

VFG1798 46588 hlb beta-hemolysin Staphylococcus aureus VF0002

The second column contains GI numbers (without the usual "gi|" prefix). The goal was to extract the set of FIGfams containing these genes, along with a list of the PEGs (and their functions).

We can do these easily in two steps. We'll use the "svr" command line tools in this example. If you are using Windows, be sure to remember to use the myRAST shell, or, if you are using a Mac, ensure that the svr commands are in your path with this command in your shell window:

export PATH=$PATH:/Applications/myRAST.app/bin

Our solution requires that the servers use the data from the PSEED, so be sure to run this command in your shell window:

Mac users: export SAS_SERVER=PSEED

Windows Users: set SAS_SERVER=PSEED

Now, to begin, let's try

svr_ids_to_figfams -c=2 -source=NCBI < known.virulence.factors 2> no.matches | svr_figfams_to_ids > with.figfams

This produces two files. The file "no.matches" contains a list of the lines from the input file ("known.virulence.factors") that have not yet been placed into FIGfams. The second file ("with.figfams") is a file in which two columns have been added: a FIGfam number and a PEG from that FIGfam. Now, if we wished to add a function of the PEG, as well as a readable version of the genome it is from, we would use

svr_gene_data function genome-name < with.figfams > desired.table.txt

If the user needs to see the "aliases" for the PEGs that have been found, they might try

svr_aliases_of -c=9 < desired.table.txt > desired.table.aliases.txt

but be warned: we return all of the IDs of protein sequences that we know about that have the same protein sequence (from a potentially large collection of very similar genomes).

Finally, we should note that the initial virulence factors may have multiple entries leading to the same FIGfam (and, hence, to the same sets of PEGs). This can lead to the situation where multiple roles, while not identical, would have information on the same PEG. To collapse the table to unique, you need to sort on the 9th field, assuming tab-delimited fields.

This can be done using

sort -k 9 -t $'\t' -u < desired.table.aliases.txt > sorted.txt

Let us consider an somewhat more useful alternative.

We begin by just connecting the aliases in column 2 to PEGs in the SEED:

svr_aliases_to_pegs -c=2 -source=NCBI -protein=1 < known.virulence.factors 2> no.matches.1 > with.peg

This produces a file (no.matches.1) of lines that could not be connected to PEGs.

Now,

svr_ids_to_figfams < with.peg > with.ff 2> no.matches.to.ff

can be used to add a FIGfam function and FIGfam ID to the end of each line

for which the identified PEG is in a FIGfam. Those lines that do not have a PEG

get written to a file (no.matches.to.ff).

Now, it is worth noting that we may have many lines pointing at the same FIGfam (i.e.,

many lines with the same FIGfam ID in the last column, which is column 9).

By using

sort -u -k 9 -t$'\t' with.ff > with.ff.sorted

we sort the file on the FIGfam ID, deleting all but one of each set that share

a common FIGfam ID.

Now,

svr_figfams_to_ids < with.ff.sorted > with.ff.peg

can be used to expand each input line to a set of lines, each one containing a

distinct PEG from the FIGfam. Using

svr_gene_data function genome-name < with.ff.peg > complete

we add the PEG function and the genome-name to the end of each row.

Finally, we run

svr_aliases_of -c 10 < complete > complete.with.aliases

to add known aliases for each PEG as the last column in the table.

An Etude Relating to a Metagenomics Sample

By The SEED Team on September 27, 2010 11:10 AM

In another short note, I suggested using

svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

to get summaries of function and population from a sample in the file

MG.sample.

A participant in one of our tutorials took a sample and ran it through

both the two commands above and did a more thorough analysis using

MG-RAST. The estimates of the most highly represented phylogenetic

groups differed somewhat, and the participant asked me to look into

it.

I am going to use this request as a motivation for showing how one

might use the server scripts effectively.

Continue reading An Etude Relating to a Metagenomics Sample.

A Short Note on Mapping PEGs to Subsystems

By The SEED Team on September 14, 2010 11:32 AM

A Short Note on Mapping PEGs to Subsystems

You can use

svr_all_features 83333.1 peg | svr_ids_to_subsystems 2> /dev/null > E.coli.pegs.in.subsystems

to get a 2-column file [PEG,Subsystem] for the PEGs in the genome with ID 83333.1

(which happens to be E.coli). If you also want the function of the PEGs, use

svr_all_features 83333.1 peg | svr_ids_to_subsystems 2> /dev/null | svr_function_of -c 1 > E.coli.pegs.in.subsystems

The

svr_function_of -c 1

command takes as input a file in which each line is a tab-separated set of fields. The

-c 1

says that the PEG ids for which functions are to be appended occur in the first column.

This produces output like

fig|83333.1.peg.2 CBSS-216591.1.peg.168 Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)

fig|83333.1.peg.2 Lysine Biosynthesis DAP Pathway Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)

fig|83333.1.peg.2 Threonine and Homoserine Biosynthesis Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)

fig|83333.1.peg.2 Methionine Biosynthesis Aspartokinase (EC 2.7.2.4) / Homoserine dehydrogenase (EC 1.1.1.3)

fig|83333.1.peg.3 Threonine and Homoserine Biosynthesis Homoserine kinase (EC 2.7.1.39)

fig|83333.1.peg.3 Methionine Biosynthesis Homoserine kinase (EC 2.7.1.39)

fig|83333.1.peg.3 CBSS-269482.1.peg.1294 Homoserine kinase (EC 2.7.1.39)

fig|83333.1.peg.4 Threonine and Homoserine Biosynthesis Threonine synthase (EC 4.2.3.1)

fig|83333.1.peg.6 YaaA UPF0246 protein YaaA

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

By The SEED Team on September 14, 2010 11:28 AM

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

This short note will discuss the following command and what it accomplishes:

svr_all_features 83333.1 peg | svr_ids_to_figfams | svr_fc_figfams -MinSc 100 | svr_figfams_to_ids 83333.1 | svr_function_of > EC.data

It chains together 5 svr scripts, which I will now discuss.

Continue reading A Short Note on Use of Server Scripts to Access Functional Coupling Scores.

RAST Tutorial Links

By The SEED Team on September 2, 2010 11:49 AM

These links open in new windows, allowing you to follow the presentations in sequence or explore topics of interest.

Getting Summaries of Functional Content and OTUs for an Metagenomic Sample

By The SEED Team on August 29, 2010 2:04 PM

Getting Summaries of Functional Content and OTUs for an Metagenomic Sample

(Note: in order to use the svr commands, you must have installed the myRAST app and set your environment correctly, see this post for instructions)

It is worth mentioning that two of the svr functions provide a means

of getting quick summaries of content for a newly-sequenced metagenomic sample.

svr_assign_to_dna_using_figfams < MG.sample

takes as input a set of DNA sequences in fasta format. It outputs a 5-column, tab-separated table containing:

The ID of one of the sequences
The number of Kmer hits against the sequence
The region identified as potentially supporting the function (in the form of a contig, begin, and end coordinates separated by underscores),
The function associated with the region (which may just be "hypothetical protein"),
A genome name that represents an "operational taxonomic unit" that appears to be the source of the hit.

This tab-separated table can be summarized using

svr_summarize_MG_output < table > function.summary 2> otu.summary

Normally, these are just pipelined using

svr_assign_to_dna_using_figfams < MG.sample | svr_summarize_MG_output > function.summary 2> otu.summary

The pipeline will usually process roughly 6-8 megabases of data per minute.

Finally, you can use

svr_metabolic_reconstruction < function.summary | cut -f 4,5 | sort -u

to get a quick metabolic reconstruction summarizing the active subsystems that could be determined (along with the appropriate variant code).

Video Tutorial: Creating a RAST Account

By The SEED Team on August 19, 2010 2:22 PM

This tutorial contains two videos demonstrating how to create a new account on the RAST Annotation Service.

Continue reading Video Tutorial: Creating a RAST Account.

Downloading the FIGfams

By The SEED Team on August 10, 2010 3:17 PM

In this tutorial we will show how to use the command-line server scripts to access the complete set of FIGfams.

Continue reading Downloading the FIGfams.

Scan For Matches

By The SEED Team on July 16, 2010 2:07 PM

scan_for_matches is a utility written in C for locating patterns in DNA or protein FASTA files.

You can download the source and installation instructions at http://www.theseed.org/servers/downloads/scan_for_matches.tgz.

The utility was written by Ross Overbeek; David Joerg and Morgan Price wrote sections of an earlier version. It is worth noting that it was strongly influenced by the elegant tools developed and distributed by David Searls.

Continue reading Scan For Matches.

Recently in Tutorials Category

Some simple Sapling examples

An Exercise to Get a List of Homologs of Virulence Factors

An Etude Relating to a Metagenomics Sample

A Short Note on Mapping PEGs to Subsystems

A Short Note on Use of Server Scripts to Access Functional Coupling Scores

RAST Tutorial Links

Getting Summaries of Functional Content and OTUs for an Metagenomic Sample

Video Tutorial: Creating a RAST Account

Downloading the FIGfams

Scan For Matches

Search

API Listings

SEED Utilities

About this Archive

Categories

Pages

Presentations