Formulating Conjectures: Using the Browser and Atomic Regulons

Formulating Conjectures:

Using the Browser and Atomic Regulons

 

By Ross Overbeek,  March 26, 2011

 

We think of an atomic regulon as a set of genes that are tightly regulated as a unit - that is, they all are expressed at the same points in time.  This is obviously a gross simplification, and maybe even an over simplification.  We think that it is a useful concept, and we employ it regularly in our attempts to make sense of the genomic data and its integration with expression data.

 

If you are looking at a gene in the browser, you may see a little link indicating that the gene is in something we call an atomic regulon.  Look at fig|300852.3.peg.1726 as an example.  The line

 

            atomic regulon membership Atomic regulon 6 of size 13 in 300852.3

 

should appear, and if you click on the link, it should take you here.  This is a display that has two basic tables.  The first shows Pearson Correlation Coefficients between genes in the atomic regulon (based on expression data that has been loaded into the SEED).  The second table describes the genes and their current function assignments.  At the time of this writing, there appear to be 10-11 genes for which the functional role is at least partially understood and 2-3 for which it is not yet clear.  We suggest looking at The SoxYZ Complex Carries Sulfur Cycle Intermediates on a Peptide Swinging Arm or Structural Basis for the Oxidation of Protein-bound Sulfur by the Sulfur Cycle Molybdohemo-Enzyme Sulfane to get some insight into what is known about the genes in this cluster.  It is a critical cluster including a set of genes that catalyze a key reaction in the global sulfur cycle.  The annotations could almost certainly be improved by a domain expert.

 

Now, let me point you at another interesting atomic regulon.  Go to the PEG page for fig|300852.3.peg.2046.  Note that this is a hypothetical protein.  If you go to the atomic regulon page containing it the situation becomes much clearer.  The hypothetical protein probably relates to cobalamine (co-enzyme B12) synthesis.

 

Or again, look at the PEG page for fig|300852.3.peg.1074.  The encoded protein is annotated as NADH-FMN oxidoreductase.  When you look at the atomic regulon, things begin to get clearer: this gene (and a second gene annotated as a hypothetical protein) both participate in an aromatic amino acid degradation pathway.  Again, more could almost certainly be said by a domain expert.

 

The ability to peruse these atomic regulons looking for such interesting cases is provided from the Navigate option on the toolbar.  Pick atomic regulons,  and then a genome with expression data and simply think about what is being displayed.

 

One of my earlier efforts to bring this wealth of data to the attention of some of my colleagues is displayed in http://www.theseed.org/TheBook/HTML/atomic_regulons.html

 

That document was produced in October, 2010, and the exact atomic regulons have changed (due to more available data).  It also reflects a tool that was never released within the SEED Project and existed only as a prototype.  However, my mindset (exuberance at this wealth of data) still exists, and you might find sections interesting.

 

Formulating Conjectures:

Using the Browser and Atomic Regulons

Part 2

 

by Ross Overbeek,  March 26, 2011

 

This should be thought of as a continuation of my previous short note giving examples of genes in which the notion of atomic regulon, along with expression data, supported the formulation of problems/hypotheses relating to the functions of gene products.

 

1.  fig|300852.3.peg.2216: an Example Relating to CRISPRs

 

Try looking at peg.2216 in the SEED browser.  If you expand the "compare regions" to include a few more genomes (and, you might make it a bit wider too), you see a cluster of 5 genes that occurs in several distinct genomes (note that we have duplicate versions of several of these genomes).  If you look at the atomic regulon containing the gene, it looks a little bleak.  There is only a regulatory protein, what appears to be a CRISPR-related protein and 3 hypotheticals.  What do you make of that?  First, note that some of the genes in Cyanothece sp. PCC 8802 are clear remnants of a CRISPR event.

 

First, I recommend using psi-blast (see the link above) to verify that peg.2218 really is probably CRISPR-related.  Once you have done that, work through the remaining genes using psi-blast.  What you will find is that peg. 2217 is also CRISPR-associated.  After a few iterations, you will also see a very distant relationship between peg.2215 and a protein annotated as CRISPR-associated.

 

I think that it is likely that one could put together a coherent picture of the genes in this atomic regulon, and it would center on CRISPRs.  These types of examples - those in which the cluster of genes is very local to just a few genomes - are less interesting to most of our annotators.  However, it is clear that at least part of the story can be revealed from just the 3-4 different genomes that are relevant.

2. fig|224911.1.peg.1749 in Bradyrhizobium japonicum USDA 110

 

Consider fig|224911.1.peg.1749.  Now go to the atomic regulon containing it.  It seems completely clear that this atomic regulon relates to nitrogen fixation in Bradyrhizobium and that at least two hypothetical proteins can be directly related to that process.

 

3. fig|224911.1.peg.1443 in Bradyrhizobium japonicum USDA 110

 

Consider fig|224911.1.peg.1443, and go to the atomic regulon containing it..    It seems to me that the evidence supports the hypothesis that this gene relates to a secretion system that is spread over two loci.  When I showed it to one of our annotators, she commented

 

This family is imbedded in// associated with // this adhesion cluster only in 2 genomes out of many that contain it - only in Bradyrhizobium - a plant symbiont (this can be seen by repositioning yourself on any other gene of the cluster and by  expanding it to 35 genomes or so).  Hence, it is probably not the part of core machinery. Interestingly, annotation of this family in Pfam associates it with plant cell surface was - exactly what they need to adhere to ...

 

"PF04116:  This superfamily includes fatty acid and carotene hydroxylases and sterol desaturases. Beta-carotene hydroxylase is involved in zeaxanthin synthesis by hydroxylating beta-carotene, but the enzyme may be involved in other pathways [1]. This family includes C-5 sterol desaturase and C-4 sterol methyl oxidase. Members of this family are involved in cholesterol biosynthesis and biosynthesis a plant cuticular wax..."

 

4. fig|211586.9.peg.2693 in Shewanella oneidensis MR-1

 

fig|211586.9.peg.2693 encodes what seems to be a completely uninterpretable protein, at least until you look at the associated atomic regulon.  Once you see that the clues point directly at a prophage.  In fact, it is very common to run into a sequence of hypotheticals that appear to be quite unlike most proteins in the existing databanks, and later find via detailed analysis that you have a prophage.  The speed with which phages evolve, and there abundance in the environment, lead to many, many hypotheticals that are difficult to pin down.

 

5. fig|211586.9.peg.2892 in Shewanella oneidensis MR-1

 

Look at fig|211586.9.peg.2892, and then look at the associated atomic regulon.  It seems clear that peg.2891 and peg.2892 both relate to chemotaxis and motility.  When I showed this to an annotator, the response was

 

This family ONLY occurs in Shewanella - too narrow...   Not even represented in Pfams.   Wouldn't it be better to select widely distributed hypotheticals for examples? They are of more interest and also - have many more potential clues associated with them - for further digging into possible function

 

She is right in the sense that machinery that is very local is harder to pin down.  Yet, it does seem to me that we could reasonably annotate the gene now as "related to flagellar motility", and it might be worth pondering how one might substantiate that claim.

 

Note that you have a similar situation with fig|208964.1.peg.4298 in which you can conjecture a role in pilus assembly, even though the gene is fairly local (in this case to the Pseudomonads).

6. fig|211586.9.peg.4166 in Shewanella oneidensis MR-1

 

Look at fig|211586.9.peg.4166.  If you just do psi-blast, you find that the protein is very poorly characterized:

 

DUF1234: Alpha/Beta hydrolase family of unknown function (DUF1234).  The Crystal Structure Of The Yden Gene Product from B. Subtilis has been

solved. The structure shows an alpha-beta hydrolase fold suggesting an

enzymatic function for these proteins.

 

That seems like fairly minimal information.  However, when you look at the atomic regulon, the picture starts to fill in.  We still think that the clustering on the chromosome is probably the richest source of insight, and in this case, if you expand the compare regions to many genomes, it is worth looking at the rearrangement shown in Acinetobacter,  and a second (less compelling) rearrangement in one of the Burkholderia genomes.  The conjecture would be that the protein plays a role in inorganic sulfur assimilation.

 

7. fig|100226.1.peg.4182 in Streptomyces coelicolor A3(2)

 

The protein encoded by this gene is annotated as  "Putative uncharacterized protein" by UniProt and as "2SCD46.40c, unknown, len: 82aa" by the SEED (really an awful estimate, but it does convey a certain lack of clues).   Anyway, looking at the psi-blast output suggests that it is a "gualylate cyclase protein".  Looking at the atomic regulon suggests a role in phosphate metabolism.

 

I do not claim that we can pin this down easily, but I do believe that the suggestions made (in this case) by psi-blast and atomic regulons are useful.  The atomic regulons were computed from a fairly small set of experiments, so you cannot take them too seriously.