Updating tutorial

salilab · Nov 14, 2024 · e9fdfb7 · e9fdfb7
1 parent d475d61
commit e9fdfb7
Show file tree

Hide file tree

Showing 2 changed files with 26 additions and 37 deletions.
diff --git a/doc/heterogeneity.md b/doc/heterogeneity.md
@@ -1,25 +1,30 @@
-Step 1: gather information for both snapshot and trajectory modeling {#heterogeneity}
+Modeling of heterogeneity {#heterogeneity}
 ====================================
 
-# Gathering information {#gathering}
+# Heterogeneity modeling step 1: gather information {#heterogeneity1}
 
-In the first step, information about the process of interest is gathered. This information can theoretically be applied to either modeling static snapshots or modeling trajectories, and can be used for representing the model, scoring the model, sampling alternative models, filtering sampled models, or validating the models.
+# Heterogeneity modeling step 2: representation, scoring, and search process {#heterogeneity2}
 
-For this tutorial, we used the X-ray crystal structure of the complete Bmi1/Ring1b-UbcH5c complex (a), synthetically generated electron tomography (ET) density maps during the assembly process (b), synthetically generated protein copy numbers during the assembly process, which can be calculated from experiments such as fluorescence correlation spectroscopy (FCS) (c), and synthetically generated small-angle X-ray scattering (SAXS) profiles during the assembly process (d). The crystal structure of the complex informs the final state of our model as well as the structure of the individual proteins. The time-dependent ET and SAXS data give two inputs to inform the size and shape of the assembling complex. The protein copy number data informs the stoichiometry of the complex during assembly.
+We first must select which snapshots to model. Here, we choose only to model snapshots at 0 minutes, 1 minute, and 2 minutes because ET and SAXS data are only available at those time points. We know this complex has three protein chains (A, B, and C), and we choose to model these chains based on their protein copy number data. We then use `prepare_protein_library`, [documented here](https://integrativemodeling.org/nightly/doc/ref/namespaceIMP_1_1spatiotemporal_1_1prepare__protein__library.html), to calculate the protein copy numbers for each snapshot model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each of these snapshot models. Here, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure. 
 
-\image html Input.png width=600px
+\code{.py}
+# 1a - parameters for prepare_protein_library:
+times = ["0min", "1min", "2min"]
+exp_comp = {'A': '../../Input_Information/gen_FCS/exp_compA.csv',
+            'B': '../../Input_Information/gen_FCS/exp_compB.csv',
+            'C': '../../Input_Information/gen_FCS/exp_compC.csv'}
+expected_subcomplexes = ['A', 'B', 'C']
+template_topology = 'spatiotemporal_topology.txt'
+template_dict = {'A': ['Ubi-E2-D3'], 'B': ['BMI-1'], 'C': ['E3-ubi-RING2']}
+nmodels = 3
 
-These pieces of information are stored in the `Input_Information` folder. In addition to containing the raw data used for the tutorial, this folder contains the code necessary to generate the synthetic data. This code is described in `README` files in each directory, but is not the focus of our tutorial.
+# 1b - calling prepare_protein_library
+IMP.spatiotemporal.prepare_protein_library.prepare_protein_library(times, exp_comp, expected_subcomplexes, nmodels,
+                                                template_topology=template_topology, template_dict=template_dict)
+\endcode
 
-The `FASTA` folder contains `3rpg.fasta.txt`, which provides the sequence information for each protein in the Bmi1/Ring1b-UbcH5c complex. The `PDB` folder contains the PDB structure for the fully assembled Bmi1/Ring1b-UbcH5c complex, [3RPG](https://www.rcsb.org/structure/3rpg).
+From the output of `prepare_protein_library`, we see that there are 3 snapshot models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex). We then wrote `generate_all_snapshots`, which creates a directory for each snapshot, copies the necessary files into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.
 
-The `gen_FCS` folder contains protein copy number data for each protein as a function of time. Our code will use the `exp_comp{prot}.csv` files, where {prot} is the protein corresponding to that copy number data. Each csv file has 3 rows, which correspond to the time at which the data was taken ("Time"), the mean protein copy number at that time ("mean"), and the standard deviation in protein copy number at that time ("std").
 
-The `ET_data` folder contains the time-dependent ET data. Briefly, at each time point, a subset of Bmi1/Ring1b-UbcH5c complex proteins were used to compute a density map at each time point, and then random noise was added to this true density profile. The results of this computation are stored as `add_noise/{time}_noisy.mrc` and `add_noise/{time}_noisy.gmm`, where {time} is the time point in which the time dependent ET data was calculated.
-
-The `gen_SAXS` folder contains the time-dependent SAXS data. Experimental SAXS profiles are forward profiles calculated from the true structure by [FoXS](https://modbase.compbio.ucsf.edu/foxs/), and are stored as `{time}_exp.dat`, where {time} is the time point in which the time dependent ET data was calculated.
-
-In addition to the four types of data used here, a variety of data could be useful for the spatiotemporal modeling of protein complexes. IMP currently features  restraints for a variety of experimental data or prior models, including chemical cross-links, Förster resonance energy transfer, comparative structural models, and deep-learning structural models, all of which could inform spatiotemporal modeling through a procedure similar to the one presented here.
-
-Next, we will demonstrate how to perform [Snapshot modeling steps 2-4: representation, scoring, and search process](@ref snapshot1).
+# Heterogeneity modeling step 3: assessment {#heterogeneity_assess}
 
diff --git a/doc/snapshot.md b/doc/snapshot.md
@@ -157,41 +157,25 @@ After performing sampling, a variety of outputs will be created. These outputs i
 
 Next, we will describe the process of modeling a multiple static snapshots, as performed by running `start_sim.py`.
 
-We first must select which snapshots to model. Here, we choose only to model snapshots at 0 minutes, 1 minute, and 2 minutes because ET and SAXS data are only available at those time points. We know this complex has three protein chains (A, B, and C), and we choose to model these chains based on their protein copy number data. We then use `prepare_protein_library`, [documented here](https://integrativemodeling.org/nightly/doc/ref/namespaceIMP_1_1spatiotemporal_1_1prepare__protein__library.html), to calculate the protein copy numbers for each snapshot model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each of these snapshot models. Here, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure. 
+From heterogeneity modeling, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex), each of which has a corresponding topology file in `Heterogeneity/Heterogeneity_Modeling/`. We wrote a function, `generate_all_snapshots`, which creates a directory for each snapshot, copies the python script and topology file into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.
 
 \code{.py}
-# 1a - parameters for prepare_protein_library:
-times = ["0min", "1min", "2min"]
-exp_comp = {'A': '../../Input_Information/gen_FCS/exp_compA.csv',
-            'B': '../../Input_Information/gen_FCS/exp_compB.csv',
-            'C': '../../Input_Information/gen_FCS/exp_compC.csv'}
-expected_subcomplexes = ['A', 'B', 'C']
-template_topology = 'spatiotemporal_topology.txt'
-template_dict = {'A': ['Ubi-E2-D3'], 'B': ['BMI-1'], 'C': ['E3-ubi-RING2']}
-nmodels = 3
-
-# 1b - calling prepare_protein_library
-IMP.spatiotemporal.prepare_protein_library.prepare_protein_library(times, exp_comp, expected_subcomplexes, nmodels,
-                                                template_topology=template_topology, template_dict=template_dict)
-\endcode
-
-From the output of `prepare_protein_library`, we see that there are 3 snapshot models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex). We then wrote `generate_all_snapshots`, which creates a directory for each snapshot, copies the necessary files into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.
-
-\code{.py}
-# 2a - parameters for generate_all_snapshots
+# 1a - parameters for generate_all_snapshots
 # state_dict - universal parameter
 state_dict = {'0min': 3, '1min': 3, '2min': 1}
 
 main_dir = os.getcwd()
+topol_dir = os.path.join(os.getcwd(), '../../Heterogeneity/Heterogeneity_Modeling')
 items_to_copy = ['static_snapshot.py']  # additionally we need to copy only specific topology file
 # jobs script will likely depend on the user's cluster / configuration
 job_template = ("#!/bin/bash\n#$ -S /bin/bash\n#$ -cwd\n#$ -r n\n#$ -j y\n#$ -N Tutorial\n#$ -pe smp 16\n"
                 "#$ -l h_rt=48:00:00\n\nmodule load Sali\nmodule load imp\nmodule load mpi/openmpi-x86_64\n\n"
                 "mpirun -np $NSLOTS python3 static_snapshot.py {state} {time}")
 number_of_runs = 50
 
-# 2b - calling generate_all_snapshots
-generate_all_snapshots(state_dict, main_dir, items_to_copy, job_template, number_of_runs)
+# 1b - calling generate_all_snapshots
+generate_all_snapshots(state_dict, main_dir, topol_dir, items_to_copy, job_template, number_of_runs)
+
 \endcode
 
 We note that sometimes errors such as the one below can arise during sampling. These errors are caused by issues generating forward GMM files, which is done stochastically. If such issues arrise, remove all files in the `forward_densities` folder for that snapshot and resubmit the corresponding jobs.