Curating a Model


Introduction
Step 1: Inspect the biomass reaction
Step 2: Inspect the auto-completion media and solution
Step 3: Examine and eliminate internal flux loops
Step 4: Inspect the pathway utilization
Step 5: Curation of electron transport chain reactions
Step 6: Inspect cofactor utilization in model reactions
Step 7: Curate reaction localization
Step 8: Inspect gene-protein-reaction associations
Step 9: Identify and fill in distinctive organism pathways
Step 10: Validate model against available experimental data
Final comments and suggestions

Introduction

The genome-scale metabolic models generated by the Model SEED are draft models that still require significant manual curation to approach the level of quality and accuracy found in most published models. Here we describe in detail many of the manual curation steps required once a model has been generated by the Model SEED. We also explain how the tools built into the Model SEED may be used to facilitate some of these steps. First and for most, we recommend a thorough reading of the published reconstruction protocol 1, as this protocol provides a very complete list of all that must be done in the reconstruction of a genome-scale metabolic model. We use this protocol in part to develop this model curation tutorial.

top

Step 1: Inspect the biomass reaction

One of the most critical components of a genome-scale metabolic model is the biomass reaction. This reaction determines what small molecules are essential for a model to synthesize in order to simulate growth, and as such, this reaction impacts nearly every model prediction. We recommend the user start with the biomass reaction when curating the SEED models. One of the first things to check in this reaction is whether or not the correct metabolites have been included as reactants. Of course DNA, RNA, and protein should always be present as these components are universal in life. However, the essential lipids, cell wall components, and cofactors that should be included in the biomass reaction vary widely for different organisms. Check first that the organism you’re modeling has been correctly asserted as gram negative, gram positive, or neither. Then inspect each cofactor, lipid, and cell wall component to ensure that some evidence of a biosynthesis pathway or transport mechanism exists in your genome for each component. The KEGG maps and genomic data included in the Model SEED help a great deal in this process. Go to each compound in the compound tab of the Model SEED. Look to see if it has a KEGG map listed in the KEGG map column. If so, click on it and go to the KEGG map tab to see the pathways to the compound. Ensure that at least some portion of the pathway is colored and active (outline in bold). These techniques are useful to remove incorrect biomass components.

To identify missing biomass components, one option is to compare the biomass reaction for your model with the biomass reactions for models of similar organisms. See what the biomass of other organisms contains that your biomass doesn’t. Most useful is to compare with biomass reactions of published models, as these are the most heavily curated and often based on experimental data. To compare to biomass reactions in the Model SEED, simply select your model and the model that includes the biomass reaction you want to compare to. Then go to the biomass tab in the model viewer, and both biomass reactions should appear side by side in a table of biomass components. Another option is to search the literature for information on essential cofactors, lipids, cell wall components in your organism. Also, numerous experiments may be performed to discover more about biomass composition. See the published reconstruction protocol to learn more about these techniques. One final method is to identify dead pathways in your model that culminate in a lipid, cell wall component, or cofactor. If an organism has a nearly complete biosynthesis pathway for one of these components, there’s a good chance the component belongs in the biomass reaction. Again, the pathway diagrams in Model SEED are a great resource for this curation technique.

Once the list of reactants in the biomass reaction has been qualitatively curated as described above, it is time to quantitatively curate the coefficients for those reactants in the biomass. These coefficients typically represent the number of mols of each component that is represented in 1 gram of biomass. In the SEED models, they always represent this. Ideally, experiments can be performed to measure in detail the fraction of the biomass composition represented by each component. See the published protocol for details on this. However, without this data, coefficients can still be adjusted to dramatically improve accuracy of predicted growth yields. Predicted growth yields are largely dependant upon the coefficient for the energy portion of the biomass reaction, which is represented by ATP and H2O in the reactants and ADP, Pi, and H+ in the products. We recommend adjusting the coefficients on these terms in the biomass reaction while fixing uptake fluxes for primary carbon and electron sources at measured values until the predicted biomass production rates match measured values. Note, even this analysis involves having measured doubling times and uptake rates. Note, before this curation has been done, these models cannot be accurately used to predict organism growth rates or yields.

top

Step 2: Inspect the auto-completion media and solution

The reactions added by the auto-completion process depend a great deal on the media the organism is auto-completed in. By default, complete media is used where any compound that is transportable by the organism is allowed to be used as a nutrient. Note, for fastidious organisms (e.g. M. genitalium), this is not a bad assumption to make, and this approach is useful to determine the metabolites the organism likely needs to consume to survive. However, if you know one or more media conditions in which your organism grows, we recommend auto-completing your model in the simplest known culture condition instead. For example, E. coli is auto-completed in glucose minimal media. This ensures that as many essential biosynthesis pathways as possible are filled in during the auto-completion. Note, the auto-completion algorithm only works on a single media condition, but the gapfilling procedure described in the model optimization portion of the manuscript can be used for additional media conditions after the auto-completion process is complete. Thus it is useful to compile as large a list as possible of the culture conditions in which your organism grows.

If the model has been auto-completed in the desired media, the next step is to examine each reaction added by the auto-completion and to attempt to determine (using the KEGG maps and notes in the reaction tab) why each reaction was added. Often the reaction added may not be correct, but it will clearly point to the gap/error/problem in the model that caused the reaction to be added. This is why inspection of the auto-completed reactions is the second step of the curation process. If a reaction added by the auto-completion is believed to be correct, then the SEED/RAST bioinformatics tools may be used to rapidly identify a gene candidate to associate with the reaction. Simply identify the functional roles associated with the reaction (listed in the functional roles column of the reaction table), identify a gene in another genome associated with the functional role, then blast that gene against your organism’s genome to identify a homolog. Finally, check for clustering with neighboring genes in the pathway. If a reaction added by the auto-complete appears to be incorrect, two options exist to fix the problem: (i) identify an alternative correct reaction that fulfills the same role played by the current reaction, or (ii) alter the model or auto-completion media such that the need for the reaction disappears and the reaction can be removed without disrupting growth. The correct action to take depends on understanding why the reaction was added and deciding if the reason is genuine or an FBA artifact.

top

Step 3: Examine and eliminate internal flux loops

While we continually curate the database and algorithms in the Model SEED to prevent it from happening, often models utilize thermodynamically infeasible internal flux loops that enable them to produce ATP in unlimited quantities. An important curation step is to identify such loops and remove them via the adjustment of reaction directionalities or the removal of reactions. To find the loops, we recommend running an MILP optimization to identify the smallest set of reactions in the model that are required to produce ATP while allowing no mass to enter the cell. The model should be adjusted until no pathways can be found that satisfy this criteria. Next, we suggest repeating the study while allowing mass to enter the cell. If the energy pathway generated is reasonable and thermodynamically feasible, then this step is complete. We also suggest eliminating other internal mass balanced loops in this curation step using a similar algorithm. The KEGG maps and reaction activity predictions available in the Model SEED are useful in this curation step. There is extensive discussion in the reconstruction protocols paper on the elimination of these loops.

top

Step 4: Inspect the pathway utilization

Once the loops have been eliminated, the flux variability analysis (FVA) output from the Model SEED is extremely useful for checking to ensure that the metabolic pathways are being utilized by the model in a biologically relevant way. Simply examine the KEGG maps and reaction tables in the model viewer focusing on the predicted reaction activity during growth in a variety of media conditions. High priority targets are dead reactions that never carry any flux. If these reactions are located in isolated islands in the metabolic network disconnected from the main network by multiple reaction steps, this it is likely these reactions should be removed. If they are part of a linear pathway culminating in biosynthesis of cofactor, lipid, or cell wall component, it’s likely that the end point of this pathway need to be added to the biomass reaction for the model. Finally, if they are part of a relatively complete pathway that includes a small number of missing steps, this points to the presence of annotation gaps that can be filled using either the gapfilling optimization the Model SEED or through the manual addition of missing reaction steps. Another target is the identification of pathways that are functioning reversibly but should not function reversibly. These pathways should be shut down via the alteration of reversibility constraints on key reactions. As the current reversibility constraints are based on thermodynamics, this should be done with caution so as not to shut down pathways that should be active. Finally, it’s important to examine pathways that include many gapfilled steps to ensure that there is sufficient biological evidence for the activity of these pathways. If not, the gapfilled steps should be removed and the auto-completion algorithm must be rerun with additional constraints.

If data from gene expression or isotopomer tracing experiments is available for your organism, this data can be useful in curating the pathway utilization of your model. In this case, you ensure that the predicted reaction activities in your model match the activities observed in the gene expression or isotopomer experiments.

top

Step 5: Curation of electron transport chain reactions

Special attention should be paid to the reactions in the electron transport chain of the model. Often these pathways are very distinctive from one organism to another, and of all the aspects of the SEED models that requires annotation, this is probably the most critical. Use the literature to identify the specific compounds that can (and cannot) be used by your organism as an electron source, and make sure these compounds are being used as electron sources in your model. Also examine the pathway used in the model to derive electrons from your source compounds. Make sure these are consistent with literature reports and experimental data. Finally, examine the localization of the reactions participating in the electron transport pathways. In many cases, some steps of these pathways take place in the perisplasm or extracellular compartments of the cell. Ensure that the compartments for the reactions in your model are correct. If you do move portions of the pathway to a new compartment, make sure the necessary intracellular transport reactions have been added to the model. Automated generation of the electron transport chain pathways is still very much a challenge, meaning it’s likely these pathways will require at least some curation in your SEED models.

top

Step 6: Inspect cofactor utilization in model reactions

Often organisms utilize distinctive cofactors in their metabolic reactions. If this is not reflected in the annotation of the genes encoding the enzyme for these reactions, the cofactors used in reactions in the SEED model are likely to be incorrect. In curation of cofactor utilization, the annotation of a gene encoding the reaction should be carefully examined. If the annotation is generic with respect to the cofactor used, the literature should be mined to identify the correct cofactor utilized by the specific organism being modeled. If correct cofactors can be identified, the reaction mapped to the gene in the SEED model should be switched (if necessary) to a form that utilizes the correct cofactors.

top

Step 7: Curate reaction localization

Currently models generated by the Model SEED include only two cellular compartments: intracellular ([c]) and extracellular ([e]). Additionally, no reactions are currently being placed entirely in the extracellular compartment. Only transport reactions involve the extracellular compartment as they specifically account for the movement of compounds between the intracellular and extracellular compartments. For this reason, SEED models must be curated to ensure that reactions have been localized to the correct cellular compartment. The reconstruction protocol publication includes recommendations on methods to identify the correct compartment for a metabolic reaction.

top

Step 8: Inspect gene-protein-reaction associations

While the gene-protein-reaction associations generated by the Model SEED are based on well-curated annotations, errors still occasionally occur, particularly when the GPR gene complexes are very large. We recommend curation of these associations with a focus on identifying the following types of errors that can occur: (i) too many genes assigned to a single reaction indicating a possible over annotation, (ii) too many reactions assigned to a single gene, indicating a possible overly generic annotation, (iii) genes improperly lumped into a complex, and (iv) genes missing from protein complexes. The reconstruction protocol provides more detail on methods for spotting and correcting these errors.

top

Step 9: Identify and fill in distinctive organism pathways

In some cases, an organism will implement a biochemical pathway that is not currently handled by the SEED subsystems and reaction mappings used to build models in the Model SEED. Photosynthesis pathways in Cyanobacteria are an excellent example of this. In these cases, such pathways must be manually added to the Model SEED models. Pathways can be typically reconstructed from literature sources as described in the reconstruction protocol manuscript.

top

Step 10: Validate model against available experimental data

Experimental data on growth phenotypes and pathway utilization in the metabolic pathways is absolutely essential to producing the most accurate and predictive model possible. While qualitative predictions of growth conditions can be predicted without data (at ~66-70% accuracy), highly accurate and quantitative predictions require the integration of experimental data. Of course, high-throughput phenotype studies like Biolog arrays and gene essentiality experiments are extremely useful for model development to improve qualitative predictions. Any growth conditions for mutant and wildtype strains are also useful. As discussed in the Model SEED manuscript, the Growth Match method may be used to fit models to all of these data types through the modification of the biomass reactions, the GPR associations, the reaction directionality constraints, and the specific reactions included in the metabolic model.

For quantitative prediction, here are some of the required data types: (i) measuring growth rates and nutrient uptake rates needed to calculate ATP maintenance requirements and P/O ratios, (ii) measuring detailed biomass compositions needed to calculate exact values for coefficients in the biomass reaction, (iii) performing transcriptomics and isotopomer tracing experiments needed to validate predicted intracellular pathway utilization, and (iv) identifying defined culture conditions needed to validate/correct predicted nutrient requirements and growth conditions.

top

Final comments and suggestions

The comparison and pathway visualization tools built into the Model SEED are very useful for curation and analysis of draft genome-scale metabolic models. We recommend making use of these resources to accelerate the curation of SEED models to the greatest extent possible prior to downloading them and using them offline. In the near future (by December, 2010), the Model SEED site will include some interfaces and mechanisms enabling users to modify and curate SEED models from within the website. Then users may rerun all associated algorithms on the modified model including: auto-completion, FBA, FVA, gene essentiality predictions, and GrowMatch. However, until these interfaces are available, users will need to download and analyze models off line.

Note that every curation step described here is described in greater detail with links to literature and available tools in the published reconstruction protocol. If any step in this description seems unclear, we recommend referring to the equivalent steps in the published protocol.

top