An Exercise to Get a List of Homologs of Virulence Factors

An Exercise to Get a List of Homologs of Virulence Factors

Suppose that you have a file containing data relating to known virulence factors. For example, here are the first few lines of a file given to me by a participant in one of our tutorials (These columns are separated by tab characters):

VFG_id gi_id Gene_name Product Organism VF_id

VFG1293 21282773        hla     Alpha-Hemolysin precursor       Staphylococcus aureus MW2       VF0001

VFG1798 46588   hlb     beta-hemolysin  Staphylococcus aureus   VF0002

The second column contains GI numbers (without the usual "gi|" prefix).  The goal was to extract the set of FIGfams containing these genes, along with a list of the PEGs (and their functions).

We can do these easily in two steps. We'll use the "svr" command line tools in this example. If you are using Windows, be sure to remember to use the myRAST shell,  or, if you are using a Mac,  ensure that the svr commands are in your path with this command in your shell window:

export PATH=$PATH:/Applications/

Our solution requires that the servers use the data from the PSEED, so be sure to run this command in your shell window:

Mac users:  export SAS_SERVER=PSEED

Windows Users: set SAS_SERVER=PSEED

 Now, to begin,  let's try

svr_ids_to_figfams -c=2 -source=NCBI < known.virulence.factors 2> no.matches | svr_figfams_to_ids > with.figfams

This produces two files.  The file "no.matches" contains a list of the lines from the input file ("known.virulence.factors") that have not yet been placed into FIGfams. The second file ("with.figfams") is a file in which two columns have been added: a FIGfam number and a PEG from that FIGfam.  Now, if we wished to add a function of the PEG, as well as a readable version of the genome it is from, we would use

svr_gene_data function genome-name < with.figfams > desired.table.txt

If the user needs to see the "aliases" for the PEGs that have been found, they might try

svr_aliases_of -c=9 < desired.table.txt > desired.table.aliases.txt

but be warned: we return all of the IDs of protein sequences that we know about that have the same protein sequence (from a potentially large collection of very similar genomes).

Finally, we should note that the initial virulence factors may have multiple entries leading to the same FIGfam (and, hence, to the same sets of PEGs).  This can lead to the situation  where multiple roles, while not identical, would have information on the same PEG.  To collapse the table to unique, you need to sort on the 9th field, assuming tab-delimited fields.

This can be done using

sort -k 9 -t $'\t' -u < desired.table.aliases.txt  > sorted.txt

Let us consider an somewhat more useful alternative.

We begin by just connecting the aliases in column 2 to PEGs in the SEED:

svr_aliases_to_pegs -c=2 -source=NCBI -protein=1 < known.virulence.factors 2> no.matches.1 > with.peg

This produces a file (no.matches.1) of lines that could not be connected to PEGs.


svr_ids_to_figfams < with.peg > with.ff 2>

can be used to add a FIGfam function and FIGfam ID to the end of each line

for which the identified PEG is in a FIGfam.  Those lines that do not have a PEG

get written to a file (

Now, it is worth noting that we may have many lines pointing at the same FIGfam (i.e.,

many lines with the same FIGfam ID in the last column, which is column 9).

By using

   sort -u -k 9 -t$'\t' with.ff > with.ff.sorted

we sort the file on the FIGfam ID, deleting all but one of each set that share

a common FIGfam ID.


svr_figfams_to_ids < with.ff.sorted > with.ff.peg

can be used to expand each input line to a set of lines, each one containing a 

distinct PEG from the FIGfam.  Using 

svr_gene_data function genome-name < with.ff.peg > complete

we add the PEG function and the genome-name to the end of each row.

Finally, we run

svr_aliases_of -c 10 < complete > complete.with.aliases

to add known aliases for each PEG as the last column in the table.