The program determines the set of IDs associated with proteins that have identical sequence to the input protein ID. We call these sequence-equivalent proteins. The program displays the associated protein sequence, and then it computes a table describing these proteins. There will be at least one row for each of the computed IDs. There will be several rows if multiple groups have used the same ID and attached functional assignments to the ID. The columns in the table are as follows:
- The first column is the ID of the protein.
- The second is the genus, species and strain of the associated genome (which we sometimes call "the scientific name of the genome")
- If we believe that the ID corresponds precisely to the gene associated with the input ID, this field will contain a 1. In this case, we speak of the two IDs as "precisely equivalent".
Otherwise, if we only know that this ID is for a protein sequence identical to the sequence designated by the input ID, it will contain 0. In this case, we say "the IDs are sequence-equivalent, but not necessarily precisely equivalent." - The fourth column gives the functional assignment associated with the protein ID.
- The fifth column gives the source of the assignment.
- If we believe that this assertion was made by an expert, this field will contain 1. Otherwise, it will contain 0.
The need to map IDs between groups and compare asserted functions of proteins is quite basic. It would be straightforward for any group to write a short CGI script using the capabilities illustrated in example1 that supported connecting protein IDs to literature (via PubMed), to structure data (when present), and so forth. Every annotation team needs this class of functionality.
Example 1 Discussion
With the server packages installed, the code in example1 can be run as follows> perl server_paper_example1.pl "fig|83333.1.peg.145"The work of this routine is in two parts. Here, we use the SAP server to build a hash of identifiers for precisely equivalent genes used in determining the contents of column 3 later on.
my %preciseHash;Here, we again use the SAP server to get all sequence equivalent IDs and produce the output table shown below (truncated for this example).
my $precise_assertions_list = $sapObject->equiv_precise(-ids => $id);
$precise_assertions_list = $precise_assertions_list->{$id};
if (@$precise_assertions_list > 0) {
my $inputID = $id;
for my $precise_assertion (@$precise_assertions_list) {
my ($newID, $function, $source, $expert) = @$precise_assertion;
$preciseHash{$newID} = 1;
}
my $assertions = $sapObject->equiv_sequence(-ids => $id);
my $assertions = $assertions->{$id};
if (@$assertions < 1) {
print STDERR "No results found.\n";
} else {
for my $assertion (@$assertions) {
my ($newID, $function, $source, $expert, $genomeName) = @$assertion;
$genomeName = '' if ! defined $genomeName;
my $column3 = ($preciseHash{$newID} ? 1 : 0);
print join("\t", $newID, $genomeName, $column3, $function, $source,
$expert) . "\n";
}
}
Output Table
cmr|NT01SF0150 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT02EC0154 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT02SB0136 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SD0166 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SF0144 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT03EC0181 0 DksA homolog CMR 0
cmr|NT04EC0178 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT04SF0143 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT04SS0162 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT10EC0149 0 RNA polymerase-binding protein DksA CMR 0
.
.
.