Updated tutorial text

salilab · Nov 20, 2024 · 02e4e2a · 02e4e2a
1 parent 200d142
commit 02e4e2a
Show file tree

Hide file tree

Showing 7 changed files with 90 additions and 63 deletions.
diff --git a/doc/images/Overview.png b/doc/images/Overview.png
diff --git a/doc/mainpage.md b/doc/mainpage.md
@@ -10,7 +10,7 @@ Biomolecules are constantly in motion; therefore, a complete depiction of their
 Our approach applies a composite workflow, consisting of three modeling problems to compute (i) heterogeneity models, (ii) snapshot models, and (iii) trajectory models.
 Heterogeneity models describe the possible biomolecular compositions of the system at each time point. Optionally, other auxiliary variables can be considered, such as the coarse location in the final state when modeling an assembly process.
 For each heterogeneity model, one snapshot model is produced. A snapshot model is a set of alternative standard static integrative structure models based on the information available for the corresponding time point.
-Then, trajectory models are created by connecting alternative snapshot models at adjacent time points. These trajectories can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.
+Then, trajectory models are created by connecting alternative snapshot models at adjacent time points. These trajectory models can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.
 
 If you use this tutorial or its accompanying method, please site the corresponding publications:
 
@@ -19,7 +19,7 @@ If you use this tutorial or its accompanying method, please site the correspondi
 
 # Integrative spatiotemporal modeling workflow {#steps}
 
-In general, integrative modeling proceeds through three steps (i. gathering information; ii. choosing the model representation, scoring alternative models, and searching for good scoring models; and iii. assessing the models). In integrative spatiotemporal modeling, these three steps are repeated for each modeling problem in the composite workflow (i. modeling of heterogeneity, ii. modeling of snapshots, and iii. modeling of trajectories).
+In general, integrative modeling proceeds through three steps (i. gathering information; ii. choosing the model representation, scoring alternative models, and searching for good scoring models; and iii. assessing the models). In integrative spatiotemporal modeling, these three steps are repeated for each modeling problem in the composite workflow (i. modeling of heterogeneity, ii. modeling of snapshots, and iii. modeling of a trajectory).
 
 \image html Overview.png width=600px
 
@@ -29,7 +29,7 @@ This tutorial will walk you through the links below, which contain a breakdown o
 
 - [modeling of snapshots] (@ref snapshots)
 
-- [modeling of trajectories] (@ref trajectories)
+- [modeling of a trajectory] (@ref trajectories)
 
 To work through this example, a variety of python packages will be necessary in addition to [IMP](https://integrativemodeling.org/). These packages are [numpy](https://numpy.org/), [os](https://docs.python.org/3/library/os.html), [warnings](https://docs.python.org/3/library/warnings.html), [sys](https://docs.python.org/3/library/sys.html), [itertools](https://docs.python.org/3/library/itertools.html), [pandas](https://pandas.pydata.org/), [matplotlib](https://matplotlib.org/), [pyRMSD](https://pypi.org/project/pyRMSD/), and [graphviz](https://graphviz.org/). Optionally, [UCSF Chimera](https://www.rbvi.ucsf.edu/chimera/), [UCSF ChimeraX](https://www.cgl.ucsf.edu/chimerax/), [gnuplot](http://www.gnuplot.info/), and [MATLAB](https://www.mathworks.com/products/matlab.html) can be used for visualizing structures or plotting data.
 

diff --git a/doc/snapshot.md b/doc/snapshot.md
@@ -13,7 +13,7 @@ The heterogeneity models inform protein copy numbers for the snapshot models. Th
 
 # Snapshot modeling step 2: representation, scoring function, and search process {#snapshots2}
 
-Next, we represent, score and search for snapshot models. To do so, navigate to the `Snapshots/Snapshots_Modeling/` folder. Here, you will find two python scripts. The first, `static_snapshot.py`, uses IMP to represent, score, and search for models of a single static snapshot. The second, `start_sim.py`, automates the creation of a snapshot model for each heterogeneity model.
+Next, we represent, score and search for snapshot models. To do so, navigate to the `Snapshots/Snapshots_Modeling/` folder. Here, you will find two python scripts. The first, `static_snapshot.py`, uses IMP to represent, score, and search for a single static snapshot model. The second, `start_sim.py`, automates the creation of a snapshot model for each heterogeneity model.
 
 ## Modeling one snapshot
 
@@ -27,7 +27,7 @@ For our model of a protein complex, we use a combination of two representations.
 
 Beads and Gaussians in our model belong to either a *rigid body* or *flexible string*. The positions of all beads and Gaussians in a single rigid body are constrained during sampling and do not move relative to each other. Meanwhile, flexible beads can move freely during sampling, but are restrained by sequence connectivity.
 
-To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies of each snapshot. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](@ref IMP::pmi::topology::TopologyReader) documentation.
+To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies of each heterogeneity model. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](@ref IMP::pmi::topology::TopologyReader) documentation.
 
 \code{.txt}
 |molecule_name | color | fasta_fn | fasta_id | pdb_fn | chain | residue_range | pdb_offset | bead_size | em_residues_per_gaussian | rigid_body | super_rigid_body | chain_of_super_rigid_bodies | 
@@ -160,9 +160,9 @@ After performing sampling, a variety of outputs will be created. These outputs i
 
 ## Generalizing modeling to all snapshots {#snapshot_combine}
 
-Next, we will describe the process of modeling a multiple static snapshots, as performed by running `start_sim.py`.
+Next, we will describe the process of computing multiple static snapshot models, as performed by running `start_sim.py`.
 
-From heterogeneity modeling, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex), each of which has a corresponding topology file in `Heterogeneity/Heterogeneity_Modeling/`. We wrote a function, `generate_all_snapshots`, which creates a directory for each snapshot, copies the python script and topology file into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.
+From heterogeneity modeling, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex), each of which has a corresponding topology file in `Heterogeneity/Heterogeneity_Modeling/`. We wrote a function, `generate_all_snapshots`, which creates a directory for each snapshot model, copies the python script and topology file into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.
 
 \code{.py}
 # 1a - parameters for generate_all_snapshots
@@ -183,7 +183,7 @@ generate_all_snapshots(state_dict, main_dir, topol_dir, items_to_copy, job_templ
 
 \endcode
 
-We note that sometimes errors such as the one below can arise during sampling. These errors are caused by issues generating forward GMM files, which is done stochastically. If such issues arrise, remove all files in the `forward_densities` folder for that snapshot and resubmit the corresponding jobs.
+We note that sometimes errors such as the one below can arise during sampling. These errors are caused by issues generating forward GMM files, which is done stochastically. If such issues arrise, remove all files in the `forward_densities` folder for that snapshot model and resubmit the corresponding jobs.
 
 \code{.py}
   File "/imp/main/20240607-af6f9d6a95/lib/release8/IMP/isd/gmm_tools.py", line 35, in decorate_gmm_from_text
@@ -197,9 +197,9 @@ Now, we have a variety of alternative snapshot models. In general, we would like
 
 ## Filtering good scoring models {#snapshot_filter}
 
-Initially, we want to filter the various alternative models to select those that meet certain parameter thresholds. In this case, we filter the structural models in each snapshot by the median cross correlation with EM data. We note that this filtering criteria is subjective, and developing a Bayesian method to objectively weigh different restraints for filtering remains an interesting future development in integrative modeling.
+Initially, we want to filter the various alternative models to select those that meet certain parameter thresholds. In this case, we filter the structural models comprising each snapshot model by the median cross correlation with EM data. We note that this filtering criteria is subjective, and developing a Bayesian method to objectively weigh different restraints for filtering remains an interesting future development in integrative modeling.
 
-The current filtering procedure involves three steps. In the first step, we look through the `stat.*.out` files to write out the cross correlation with EM data for each model, which, in this case, is labeled column `3`, `GaussianEMRestraint_None_CCC`. In other applications, the column that corresponds to each type of experimental data may change, depending on the scoring terms for each model. For each snapshot, a new file is written with this data (`{state}_{time}_stat.txt`).
+The current filtering procedure involves three steps. In the first step, we look through the `stat.*.out` files to write out the cross correlation with EM data for each model, which, in this case, is labeled column `3`, `GaussianEMRestraint_None_CCC`. In other applications, the column that corresponds to each type of experimental data may change, depending on the scoring terms for each model. For each snapshot model, a new file is written with this data (`{state}_{time}_stat.txt`).
 
 \code{.py}
 # state_dict - universal parameter
@@ -220,7 +220,7 @@ print("")
 print("")
 \endcode
 
-In the second step, we want to determine the median value of EM cross correlation for each snapshot. We wrote `general_rule_calculation` to look through the `general_rule_column` for each `{state}_{time}_stat.txt` file and determine both the median value and the number of structures generated.
+In the second step, we want to determine the median value of EM cross correlation for each snapshot model. We wrote `general_rule_calculation` to look through the `general_rule_column` for each `{state}_{time}_stat.txt` file and determine both the median value and the number of structures generated.
 
 \code{.py}
 # 2 calling general_rule_calculation and related parameters
@@ -233,7 +233,7 @@ print("")
 print("")
 \endcode
 
-In the third step, we use the `imp_sampcon select_good` tool to filter each snapshot, according to the median value determined in the previous step. For each snapshot, this function produces a file, `good_scoring_models/model_ids_scores.txt`, which contains the run, replicaID, scores, and sampleID for each model that passes filtering. It also saves RMF files with each model from two independent groups of sampling runs from each snapshot to `good_scoring_models/sample_A` and `good_scoring_models/sample_B`, writes the scores for the two independent groups of sampling runs to `good_scoring_models/scoresA.txt` and `good_scoring_models/scoresB.txt`, and writes `good_scoring_models/model_sample_ids.txt` to connect each model to its division of sampling run. More information on `imp_sampcon` is available in the analysis portion of the [actin tutorial](https://integrativemodeling.org/tutorials/actin/analysis.html).
+In the third step, we use the `imp_sampcon select_good` tool to filter each snapshot model, according to the median value determined in the previous step. For each snapshot model, this function produces a file, `good_scoring_models/model_ids_scores.txt`, which contains the run, replicaID, scores, and sampleID for each model that passes filtering. It also saves RMF files with each model from two independent groups of sampling runs from each snapshot model to `good_scoring_models/sample_A` and `good_scoring_models/sample_B`, writes the scores for the two independent groups of sampling runs to `good_scoring_models/scoresA.txt` and `good_scoring_models/scoresB.txt`, and writes `good_scoring_models/model_sample_ids.txt` to connect each model to its division of sampling run. More information on `imp_sampcon` is available in the analysis portion of the [actin tutorial](https://integrativemodeling.org/tutorials/actin/analysis.html).
 
 \code{.py}
 # 3 calling general_rule_filter_independent_samples
@@ -283,7 +283,7 @@ print("")
 print("")
 \endcode
 
-Next, we run `imp_sampcon exhaust` on each snapshot. This code performs checks on the exhaustiveness of the sampling. Specifically it analyzes the convergence of the model score, whether the two model sets were drawn from the same distribution, and whether each structural cluster includes models from each sample proportionally to its size. The output for each snapshot is written out to the `exhaust_{state}_{time}` folder.
+Next, we run `imp_sampcon exhaust` on each snapshot model. This code performs checks on the exhaustiveness of the sampling. Specifically it analyzes the convergence of the model score, whether the two model sets were drawn from the same distribution, and whether each structural cluster includes models from each sample proportionally to its size. The output for each snapshot model is written out to the `exhaust_{state}_{time}` folder.
 
 \code{.py}
 # 7 calling exhaust
@@ -293,13 +293,13 @@ print("")
 print("")
 \endcode
 
-Plots for determining the sampling precision are shown below for a single snapshot, 1_2min. (a) Tests the convergence of the lowest scoring model (`snapshot_{state}_{time}.Top_Score_Conv.pdf`). Error bars represent standard deviations of the best scores, estimated by selecting different subsets of models 10 times. The light-blue line indicates a lower bound reference on the total score. (b) Tests that the scores of two independently sampled models come from the same distribution (`snapshot_{state}_{time}.Score_Dist.pdf`). The difference between the two distributions, as measured by the KS test statistic (D) and KS test p-value (p) indicates that the difference is both statistically insignificant (p>0.05) and small in magnitude (D<0.3). (c) Determines the structural precision of a snapshot model (`snapshot_{state}_{time}.ChiSquare.pdf`). RMSD clustering is performed at 1 Å intervals until the clustered population (% clustered) is greater than 80%, and either the χ<sup>2</sup> p-value is greater than 0.05 or Cramer’s V is less than 0.1. The sampling precision is indicated by the dashed black line. (d) Populations from sample 1 and sample 2 are shown for each cluster (`snapshot_{state}_{time}.Cluster_Population.pdf`).
+Plots for determining the sampling precision are shown below for a single snapshot model, 1_2min. (a) Tests the convergence of the lowest scoring model (`snapshot_{state}_{time}.Top_Score_Conv.pdf`). Error bars represent standard deviations of the best scores, estimated by selecting different subsets of models 10 times. The light-blue line indicates a lower bound reference on the total score. (b) Tests that the scores of two independently sampled models come from the same distribution (`snapshot_{state}_{time}.Score_Dist.pdf`). The difference between the two distributions, as measured by the KS test statistic (D) and KS test p-value (p) indicates that the difference is both statistically insignificant (p>0.05) and small in magnitude (D<0.3). (c) Determines the structural precision of a snapshot model (`snapshot_{state}_{time}.ChiSquare.pdf`). RMSD clustering is performed at 1 Å intervals until the clustered population (% clustered) is greater than 80%, and either the χ<sup>2</sup> p-value is greater than 0.05 or Cramer’s V is less than 0.1. The sampling precision is indicated by the dashed black line. (d) Populations from sample 1 and sample 2 are shown for each cluster (`snapshot_{state}_{time}.Cluster_Population.pdf`).
 
 \image html Snapshot_Exhaust.png width=1200px
 
 Further structural analysis can be calculated by using the `cluster.*` files. The `cluster.*.{sample}.txt` files contain the model number for the models in that cluster, where `{sample}` indicates which round of sampling the models came from. The `cluster.*` folder contains an RMF for centroid model of that cluster, along with the localization densities for each protein. The localization densities of each protein from each independent sampling can be compared to ensure independent samplings produce the same results.
 
-Ideally, each of these plots should be checked for each snapshot. As a way to summarize the output of these checks, we can gather the results of the KS test and the sampling precision test for all snapshots. This is done by running `extract_exhaust_data` and `save_exhaust_data_as_png`, which write `KS_sampling_precision_output.txt` and `KS_sampling_precision_output.png`, respectively.
+Ideally, each of these plots should be checked for each snapshot model. As a way to summarize the output of these checks, we can gather the results of the KS test and the sampling precision test for all snapshot models. This is done by running `extract_exhaust_data` and `save_exhaust_data_as_png`, which write `KS_sampling_precision_output.txt` and `KS_sampling_precision_output.png`, respectively.
 
 \code{.py}
 # 8 calling extract_exhaust_data
@@ -327,4 +327,4 @@ Here, we plotted each centroid model (A - blue, B - orange, and C - purple) from
 
 \image html static_snapshots_noCC.png width=600px
 
-Finally, now that snapshot models have been assessed, we can perform [Modeling of trajectories.] (@ref trajectories)
+Finally, now that snapshot models have been assessed, we can perform [Modeling of a trajectory.] (@ref trajectories)