Recently in Tutorials Category
An Exercise to Get a List of Homologs of Virulence Factors
Suppose that you have a file containing data relating to known virulence factors. For example, here are the first few lines of a file given to me by a participant in one of our tutorials (These columns are separated by tab characters):
VFG_id gi_id Gene_name Product Organism VF_id
VFG1293 21282773 hla Alpha-Hemolysin precursor Staphylococcus aureus MW2 VF0001
VFG1798 46588 hlb beta-hemolysin Staphylococcus aureus VF0002
The second column contains GI numbers (without the usual "gi|" prefix). The goal was to extract the set of FIGfams containing these genes, along with a list of the PEGs (and their functions).
We can do these easily in two steps. We'll use the "svr" command line tools in this example. If you are using Windows, be sure to remember to use the myRAST shell, or, if you are using a Mac, ensure that the svr commands are in your path with this command in your shell window:
export PATH=$PATH:/Applications/myRAST.app/bin
Our solution requires that the servers use the data from the PSEED, so be sure to run this command in your shell window:
Mac users: export SAS_SERVER=PSEED
Windows Users: set SAS_SERVER=PSEED
Now, to begin, let's try
svr_ids_to_figfams -c=2 -source=NCBI < known.virulence.factors 2> no.matches | svr_figfams_to_ids > with.figfams
This produces two files. The file "no.matches" contains a list of the lines from the input file ("known.virulence.factors") that have not yet been placed into FIGfams. The second file ("with.figfams") is a file in which two columns have been added: a FIGfam number and a PEG from that FIGfam. Now, if we wished to add a function of the PEG, as well as a readable version of the genome it is from, we would use
svr_gene_data function genome-name < with.figfams > desired.table.txt
If the user needs to see the "aliases" for the PEGs that have been found, they might try
svr_aliases_of -c=9 < desired.table.txt > desired.table.aliases.txt
but be warned: we return all of the IDs of protein sequences that we know about that have the same protein sequence (from a potentially large collection of very similar genomes).
Finally, we should note that the initial virulence factors may have multiple entries leading to the same FIGfam (and, hence, to the same sets of PEGs). This can lead to the situation where multiple roles, while not identical, would have information on the same PEG. To collapse the table to unique, you need to sort on the 9th field, assuming tab-delimited fields.
This can be done using
sort -k 9 -t $'\t' -u < desired.table.aliases.txt > sorted.txt
Let us consider an somewhat more useful alternative. We begin by just connecting the aliases in column 2 to PEGs in the SEED: svr_aliases_to_pegs -c=2 -source=NCBI -protein=1 < known.virulence.factors 2> no.matches.1 > with.peg This produces a file (no.matches.1) of lines that could not be connected to PEGs. Now, svr_ids_to_figfams < with.peg > with.ff 2> no.matches.to.ff can be used to add a FIGfam function and FIGfam ID to the end of each line for which the identified PEG is in a FIGfam. Those lines that do not have a PEG get written to a file (no.matches.to.ff). Now, it is worth noting that we may have many lines pointing at the same FIGfam (i.e., many lines with the same FIGfam ID in the last column, which is column 9). By using sort -u -k 9 -t$'\t' with.ff > with.ff.sorted we sort the file on the FIGfam ID, deleting all but one of each set that share a common FIGfam ID. Now, svr_figfams_to_ids < with.ff.sorted > with.ff.peg can be used to expand each input line to a set of lines, each one containing a distinct PEG from the FIGfam. Using svr_gene_data function genome-name < with.ff.peg > complete we add the PEG function and the genome-name to the end of each row. Finally, we run svr_aliases_of -c 10 < complete > complete.with.aliases to add known aliases for each PEG as the last column in the table.
- The ID of one of the sequences
- The number of Kmer hits against the sequence
- The region identified as potentially supporting the function (in the form of a contig, begin, and end coordinates separated by underscores),
- The function associated with the region (which may just be "hypothetical protein"),
- A genome name that represents an "operational taxonomic unit" that appears to be the source of the hit.
In this tutorial we will show how to use the command-line server scripts to access the complete set of FIGfams.
You can download the source and installation instructions at http://www.theseed.org/servers/downloads/scan_for_matches.tgz.
The utility was written by Ross Overbeek; David Joerg and Morgan Price wrote sections of an earlier version. It is worth noting that it was strongly influenced by the elegant tools developed and distributed by David Searls.