Conversion of Gene and Protein IDs

In example1 we illustrate some basic capabilities that relate to determining the set of IDs attached to specific protein sequences. The program accepts a protein ID as input. The ID may be one of several that are maintained by the SEED, UniProt, RefSeq, KEGG and other groups. The program first accesses all IDs attached to identical protein sequences. This can be a fairly large set in cases in which many very similar genomes have been sequenced.



The program determines the set of IDs associated with proteins that have identical sequence to the input protein ID. We call these sequence-equivalent proteins. The program displays the associated protein sequence, and then it computes a table describing these proteins. There will be at least one row for each of the computed IDs. There will be several rows if multiple groups have used the same ID and attached functional assignments to the ID. The columns in the table are as follows:

  1. The first column is the ID of the protein.
  2. The second is the genus, species and strain of the associated genome (which we sometimes call "the scientific name of the genome")
  3.  If we believe that the ID corresponds precisely to the gene associated with the input ID, this field will contain a 1. In this case, we speak of the two IDs as "precisely equivalent".
    Otherwise, if we only know that this ID is for a protein sequence identical to the sequence designated by the input ID, it will contain 0.    In this case, we say "the IDs are sequence-equivalent, but not necessarily precisely equivalent."
  4. The fourth column gives the functional assignment associated with the protein ID.
  5. The fifth column gives the source of the assignment.
  6. If we believe that this assertion was made by an expert, this field will contain 1.  Otherwise, it will contain 0.

The need to map IDs between groups and compare asserted functions of proteins is quite basic. It would be straightforward for any group to write a short CGI script using the capabilities illustrated in example1 that supported connecting protein IDs to literature (via PubMed), to structure data (when present), and so forth. Every annotation team needs this class of functionality.

Example 1 Discussion

With the server packages installed, the code in example1 can be run as follows
> perl server_paper_example1.pl "fig|83333.1.peg.145"
The work of this routine is in two parts. Here, we use the SAP server to build a hash of identifiers for precisely equivalent genes used in determining the contents of column 3 later on.
	    my %preciseHash;
my $precise_assertions_list = $sapObject->equiv_precise(-ids => $id);
$precise_assertions_list = $precise_assertions_list->{$id};
if (@$precise_assertions_list > 0) {
my $inputID = $id;
for my $precise_assertion (@$precise_assertions_list) {
my ($newID, $function, $source, $expert) = @$precise_assertion;
$preciseHash{$newID} = 1;
}
Here, we again use the SAP server to get all sequence equivalent IDs and produce the output table shown below (truncated for this example).
	    my $assertions = $sapObject->equiv_sequence(-ids => $id);
my $assertions = $assertions->{$id};
if (@$assertions < 1) {
print STDERR "No results found.\n";
} else {
for my $assertion (@$assertions) {
my ($newID, $function, $source, $expert, $genomeName) = @$assertion;
$genomeName = '' if ! defined $genomeName;
my $column3 = ($preciseHash{$newID} ? 1 : 0);
print join("\t", $newID, $genomeName, $column3, $function, $source,
$expert) . "\n";
}
}

Output Table

cmr|NT01SF0150		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT02EC0154 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT02SB0136 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SD0166 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SF0144 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT03EC0181 0 DksA homolog CMR 0
cmr|NT04EC0178 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT04SF0143 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT04SS0162 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT10EC0149 0 RNA polymerase-binding protein DksA CMR 0
.
.
.