diff --git a/Jupyter/.ipynb_checkpoints/spatiotemporal-checkpoint.ipynb b/Jupyter/.ipynb_checkpoints/spatiotemporal-checkpoint.ipynb
index 412949b87..c1f58a8d5 100644
--- a/Jupyter/.ipynb_checkpoints/spatiotemporal-checkpoint.ipynb
+++ b/Jupyter/.ipynb_checkpoints/spatiotemporal-checkpoint.ipynb
@@ -1,1303 +1,1291 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "4a3dfff8-9d60-418c-a5fd-a4adf7a121b1",
- "metadata": {},
- "source": [
- "IMP spatiotemporal tutorial\n",
- "========\n",
- "\n",
- "# Introduction\n",
- "\n",
- "Biomolecules are constantly in motion; therefore, a complete depiction of their function must include their dynamics instead of just static structures. We have developed an integrative spatiotemporal approach to model dynamic systems.\n",
- "\n",
- "Our approach applies a composite workflow, consisting of three modeling problems to compute (i) heterogeneity models, (ii) snapshot models, and (iii) trajectory models.\n",
- "Heterogeneity models describe the possible biomolecular compositions of the system at each time point. Optionally, other auxiliary variables can be considered, such as the coarse location in the final state when modeling an assembly process.\n",
- "For each heterogeneity model, one snapshot model is produced. A snapshot model is a set of alternative standard static integrative structure models based on the information available for the corresponding time point.\n",
- "Then, trajectory models are created by connecting alternative snapshot models at adjacent time points. These trajectory models can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.\n",
- "\n",
- "If you use this tutorial or its accompanying method, please site the corresponding publications:\n",
- "\n",
- "- Latham, A.P.; Tempkin, J.O.B.; Otsuka, S.; Zhang, W.; Ellenberg, J.; Sali, A. bioRxiv, 2024, https://doi.org/10.1101/2024.08.06.606842.\n",
- "- Latham, A.P.; Ro\u017ei\u010d, M.; Webb, B.M., Sali, A. in preparation. (tutorial)\n",
- "\n",
- "# Integrative spatiotemporal modeling workflow\n",
- "\n",
- "In general, integrative modeling proceeds through three steps (i. gathering information; ii. choosing the model representation, scoring alternative models, and searching for good scoring models; and iii. assessing the models). In integrative spatiotemporal modeling, these three steps are repeated for each modeling problem in the composite workflow (i. modeling of heterogeneity, ii. modeling of snapshots, and iii. modeling of a trajectory).\n",
- "\n",
- "\n",
- "\n",
- "This tutorial will walk you through the creation of a spatiotemporal model for the hypothetical assembly mechanism of the Bmi1/Ring1b-UbcH5c complex. We note that all experimental data besides the static structure used in this study is purely hypothetical, and, thus, the model should not be interpreted to be meaningful about the actual assembly mechanism of the complex.\n",
- "\n",
- "Finally, this notebook is intended to present an abbreviated version of this protocol, with the computationally expensive steps excluded. A more complete version of this tutorial can be found as a series of python scripts at https://github.com/salilab/imp_spatiotemporal_tutorial.\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "dc9d93b4-3b97-4187-867c-977f8353f4aa",
- "metadata": {},
- "source": [
- "Modeling of heterogeneity\n",
- "====================================\n",
- "\n",
- "Here, we describe the first modeling problem in our composite workflow, how to build models of heterogeneity modeling using IMP. In this tutorial, heterogeneity modeling only includes protein copy number; however, in general, other types of information, such as the coarse location in the final state, could also be included in heterogeneity models.\n",
- "\n",
- "# Heterogeneity modeling step 1: gathering of information\n",
- "\n",
- "We begin heterogeneity modeling with the first step of integrative modeling, gathering information. Heterogeneity modeling will rely on copy number information about the complex. In this case, we utilize the X-ray crystal structure of the fully assembled Bmi1/Ring1b-UbcH5c complex from the protein data bank (PDB), and synthetically generated protein copy numbers during the assembly process, which could be generated from experiments such as flourescence correlation spectroscopy (FCS).\n",
- "\n",
- "\n",
- "\n",
- "The PDB structure of the complex informs the final state of our model and constrains the maximum copy number for each protein, while the protein copy number data gives time-dependent information about the protein copy number in the assembling complex.\n",
- "\n",
- "# Heterogeneity modeling step 2: representation, scoring function, and search process\n",
- "\n",
- "Next, we represent, score and search for heterogeneity models models. A single heterogeneity model is a set of protein copy numbers, scored according to its fit to experimental copy number data at that time point. As ET and SAXS data, are only available at 0 minutes, 1 minute, and 2 minutes, we choose to create heterogeneity models at these three time points. We then use `prepare_protein_library`, to calculate the protein copy numbers for each snapshot model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each of these snapshot models. The choices made in this topology file are important for the representation, scoring function, and search process for snapshot models, and are discussed later. For heterogeneity modeling, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "1ff4d3e5-04de-4092-8cfc-2ad3018675da",
- "metadata": {},
- "outputs": [],
- "source": [
- "# General imports for the tutorial\n",
- "import sys, os, glob, shutil\n",
- "from IMP.spatiotemporal import prepare_protein_library\n",
- "import IMP.spatiotemporal as spatiotemporal\n",
- "from IMP.spatiotemporal import analysis\n",
- "import numpy as np\n",
- "import matplotlib.pyplot as plt"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6b78a015-32e5-4701-8c6f-df14863ba9ce",
- "metadata": {},
- "outputs": [],
- "source": [
- "# parameters for prepare_protein_library:\n",
- "times = [\"0min\", \"1min\", \"2min\"]\n",
- "exp_comp = {'A': '../../../modeling/Input_Information/gen_FCS/exp_compA.csv',\n",
- " 'B': '../../../modeling/Input_Information/gen_FCS/exp_compB.csv',\n",
- " 'C': '../../../modeling/Input_Information/gen_FCS/exp_compC.csv'}\n",
- "expected_subcomplexes = ['A', 'B', 'C']\n",
- "template_topology = 'spatiotemporal_topology.txt'\n",
- "template_dict = {'A': ['Ubi-E2-D3'], 'B': ['BMI-1'], 'C': ['E3-ubi-RING2']}\n",
- "nmodels = 3\n",
- "\n",
- "# calling prepare_protein_library\n",
- "prepare_protein_library.prepare_protein_library(times, exp_comp, expected_subcomplexes, nmodels,\n",
- " template_topology=template_topology, template_dict=template_dict)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b56fbe48-da12-412e-ac2e-dca673e04a43",
- "metadata": {},
- "source": [
- "From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
- "- *.config, a file with a list of proteins represented in the heterogeneity model\n",
- "- *_topol.txt, a topology file for snapshot modeling corresponding to this heterogeneity model.\n",
- "\n",
- "# Heterogeneity modeling step 3: assessment\n",
- "\n",
- "Now, we have a variety of heterogeneity models. In general, there are four ways to assess a model: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. Here, we will focus specifically on comparing the model to experimental data, as other assessments will be performed later, when the trajectory models are assessed.\n",
- "\n",
- "Next, we must plot the modeled and experimental copy numbers simultaneously for each protein."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d3fdd381-6af6-44fd-8f36-dbfc5d7e10b4",
- "metadata": {},
- "outputs": [],
- "source": [
- "#TODO: plotting script"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9cf8ee82-4d59-4279-99dd-1d87b896fa15",
- "metadata": {},
- "source": [
- "From these plots, we observe that the range of possible experimental copy numbers are well sampled by the heterogeneity models, indicating that we are prepared for snapshot modeling."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "496f2687-bb05-4f51-8eab-a1170f6ac5fa",
- "metadata": {},
- "source": [
- "Modeling of snapshots\n",
- "====================================\n",
- "\n",
- "Here, we describe the second modeling problem in our composite workflow, how to build models of static snapshot models using IMP. We note that this process is similar to previous tutorials of [actin](https://integrativemodeling.org/tutorials/actin/) and [RNA PolII](https://integrativemodeling.org/tutorials/rnapolii_stalk/).\n",
- "\n",
- "# Snapshot modeling step 1: gathering of information\n",
- "\n",
- "We begin snapshot modeling with the first step of integrative modeling, gathering information. Snapshot modeling utilizes structural information about the complex. In this case, we utilize heterogeneity models, the X-ray crystal structure of the fully assembled Bmi1/Ring1b-UbcH5c complex from the protein data bank (PDB), synthetically generated electron tomography (ET) density maps during the assembly process, and physical principles.\n",
- "\n",
- "\n",
- "\n",
- "The heterogeneity models inform protein copy numbers for the snapshot models. The PDB structure of the complex informs the structure of the individual proteins. The time-dependent ET data informs the size and shape of the assembling complex. physical principles inform connectivity and excluded volume.\n",
- "\n",
- "# Snapshot modeling step 2: representation, scoring function, and search process\n",
- "\n",
- "Next, we represent, score and search for snapshot models. This step is quite computationally expensive. Therefore, we will not run the modeling protocol in this notebook, though the scripts are available in `modeling/Snapshots/Snapshots_Modeling/`. Here, we will simply describe the important steps made by two scripts. The first, `static_snapshot.py`, uses IMP to represent, score, and search for a single static snapshot model. The second, `start_sim.py`, automates the creation of a snapshot model for each heterogeneity model.\n",
- "\n",
- "## Modeling one snapshot",
- "\n",
- "Here, we will describe the process of modeling a single snapshot model, as performed by running `static_snapshot.py`.\n",
- "\n",
- "### Representing the model\n",
- "\n",
- "We begin by representing the data and the model. In general, the *representation* of a system is defined by all the variables that need to be determined.\n",
- "\n",
- "For our model of a protein complex, we use a combination of two representations. The first is a series of *spherical beads*, which can correspond to portions of the biomolecules of interest, such as atoms or groups of atoms. The second is a series of *3D Gaussians*, which help calculate the overlap between our model and the density from ET data.\n",
- "\n",
- "Beads and Gaussians in our model belong to either a *rigid body* or *flexible string*. The positions of all beads and Gaussians in a single rigid body are constrained during sampling and do not move relative to each other. Meanwhile, flexible beads can move freely during sampling, but are restrained by sequence connectivity.\n",
- "\n",
- "To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies of each heterogeneity model. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](https://integrativemodeling.org/2.21.0/doc/ref/classIMP_1_1pmi_1_1topology_1_1TopologyReader.html) documentation.\n",
- "\n",
- "\\code{.txt}\n",
- "|molecule_name | color | fasta_fn | fasta_id | pdb_fn | chain | residue_range | pdb_offset | bead_size | em_residues_per_gaussian | rigid_body | super_rigid_body | chain_of_super_rigid_bodies | \n",
- "\n",
- "|Ubi-E2-D3|blue|3rpg.fasta.txt|Ubi-E2-D3|3rpg.pdb|A|-1,18|2|1|10|1|1||\n",
- "|Ubi-E2-D3|blue|3rpg.fasta.txt|Ubi-E2-D3|3rpg.pdb|A|19,147|2|1|10|2|1||\n",
- "|BMI-1|red|3rpg.fasta.txt|BMI-1|3rpg.pdb|B|3,83|-2|1|10|3|2||\n",
- "|BMI-1|red|3rpg.fasta.txt|BMI-1|3rpg.pdb|B|84,101|-2|1|10|4|2||\n",
- "|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|BEADS|C|16,44|-15|1|1|5|3||\n",
- "|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||\n",
- "\\endcode\n",
- "\n",
- "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which topology to use, as well as the paths to other pieces of input information.\n",
- "\n",
- "\\code{.py}\n",
- "### Running parameters to access correct path of ET_data for EM restraint",
- "### and topology file for certain {state}_{time}_topol.txt",
- "state = sys.argv[1]\n",
- "time = sys.argv[2]\n",
- "\n",
- "### Topology file",
- "topology_file = f\"../{state}_{time}_topol.txt\"\n",
- "### Paths to input data for topology file",
- "pdb_dir = \"../../../../Input_Information/PDB\"\n",
- "fasta_dir = \"../../../../Input_Information/FASTA\"\n",
- "### Path where forward gmms are created with BuildSystem (based ont topology file)",
- "### If gmms exist, they will be used from this folder",
- "forward_gmm_dir = \"../forward_densities/\"\n",
- "### Path to experimental gmms",
- "exp_gmm_dir= '../../../../Input_Information/ET_data/add_noise'\n",
- "\\endcode\n",
- "\n",
- "Next, we build the system, using the topology tile, described above.\n",
- "\\code{.py}\n",
- "### Create a system from a topology file. Resolution is set on 1.",
- "bs = IMP.pmi.macros.BuildSystem(mdl, resolutions= 1, name= f'Static_snapshots_{state}_{time}')\n",
- "bs.add_state(t)\n",
- "\\endcode\n",
- "\n",
- "Then, we prepare for later sampling steps by setting which Monte Carlo moves will be performed. Rotation (`rot`) and translation (`trans`) parameters are separately set for super rigid bodies (`srb`), rigid bodies (`rb`), and beads (`bead`).\n",
- "\\code{.py}\n",
- "### Macro execution: It gives hierarchy and degrees of freedom (dof).",
- "### In dof we define how much can each (super) rigid body translate and rotate between two adjacent Monte Carlo steps",
- "root_hier, dof = bs.execute_macro(max_rb_trans=1.0,\n",
- " max_rb_rot=0.5, max_bead_trans=2.0,\n",
- " max_srb_trans=1.0, max_srb_rot=0.5)\n",
- "\\endcode\n",
- "\n",
- "### Scoring the model\n",
- "\n",
- "After building the model representation, we choose a scoring function to score the model based on input information. This scoring function is represented as a series of restraints that serve as priors.\n",
- "\n",
- "#### Connectivity",
- "\n",
- "We begin with a connectivity restraint, which restrains beads adjacent in sequence to be close in 3D space.\n",
- "\n",
- "\\code{.py}\n",
- "#### Adding Restraints",
- "#### Empty list where the data from restraints should be collected",
- "output_objects=[]\n",
- "\n",
- "#### Two common restraints: ConnectivityRestraint and ExcludedVolumeSphere",
- "#### ConnectivityRestraint is added for each \"molecule\" separately",
- "for m in root_hier.get_children()[0].get_children():\n",
- " cr = IMP.pmi.restraints.stereochemistry.ConnectivityRestraint(m)\n",
- " cr.add_to_model()\n",
- " output_objects.append(cr)\n",
- "\\endcode\n",
- "\n",
- "#### Excluded volume",
- "\n",
- "Next is an excluded volume restraint, which restrains beads to minimize their spatial overlap.\n",
- "\n",
- "\\code{.py}\n",
- "#### Add excluded volume",
- "evr = IMP.pmi.restraints.stereochemistry.ExcludedVolumeSphere(\n",
- " included_objects=[root_hier],\n",
- " resolution=1000)\n",
- "output_objects.append(evr)\n",
- "evr.add_to_model()\n",
- "\\endcode\n",
- "\n",
- "#### Electron tomography",
- "\n",
- "Finally, we restrain our models based on their fit to ET density maps. Both the experimental map and the forward protein density are represented as Gaussian mixture models (GMMs) to speed up scoring. The score is based on the log of the correlation coefficient between the experimental density and the forward protein density.\n",
- "\n",
- "\\code{.py}\n",
- "#### Applying time-dependent EM restraint. Point to correct gmm / mrc file at each time point",
- "#### Path to corresponding .gmm file (and .mrc file)",
- "em_map = exp_gmm_dir + f\"/{time}_noisy.gmm\"\n",
- "\n",
- "#### Create artificial densities from hierarchy",
- "densities = IMP.atom.Selection(root_hier,\n",
- " representation_type=IMP.atom.DENSITIES).get_selected_particles()\n",
- "\n",
- "#### Create EM restraint based on these densities",
- "emr = IMP.pmi.restraints.em.GaussianEMRestraint(\n",
- " densities,\n",
- " target_fn=em_map,\n",
- " slope=0.000001,\n",
- " scale_target_to_mass=True,\n",
- " weight=1000)\n",
- "output_objects.append(emr)\n",
- "emr.add_to_model()\n",
- "\\endcode\n",
- "\n",
- "### Searching for good scoring models\n",
- "\n",
- "After building a scoring function that scores alternative models based on their fit to the input information, we aim to search for good scoring models. For complicated systems, stochastic sampling techniques such as Monte Carlo (MC) sampling are often the most efficient way to compute good scoring models. Here, we generate a random initial configuration and then perform temperature replica exchange MC sampling with 16 temperatures from different initial configurations. By performing multiple runs of replica exchange MC from different initial configurations, we can later ensure that our sampling is sufficiently converged.\n",
- "\n",
- "\\code{.py}\n",
- "### Generate random configuration",
- "IMP.pmi.tools.shuffle_configuration(root_hier,\n",
- " max_translation=50)\n",
- "\n",
- "### Perform replica exchange sampling",
- "rex=IMP.pmi.macros.ReplicaExchange(mdl,\n",
- " root_hier=root_hier,\n",
- " monte_carlo_sample_objects=dof.get_movers(),\n",
- " global_output_directory='output', # name 'output' is the best for imp sampcon select_good\n",
- " output_objects=output_objects,\n",
- " monte_carlo_steps=200, # Number of MC steps between writing frames.\n",
- " number_of_best_scoring_models=0,\n",
- " number_of_frames=500) # number of frames to be saved\n",
- "### In our case, for each snapshot we generated 25000 frames altogether (50*500)",
- "rex.execute_macro()\n",
- "\\endcode\n",
- "\n",
- "After performing sampling, a variety of outputs will be created. These outputs include `.rmf` files, which contain multi-resolution models output by IMP, and `.out` files which contains a variety of information about the run such as the value of the restraints and the MC acceptance rate.\n",
- "\n",
- "## Generalizing modeling to all snapshots\n",
- "\n",
- "Next, we will describe the process of computing multiple static snapshot models, as performed by running `start_sim.py`.\n",
- "\n",
- "From heterogeneity modeling, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex), each of which has a corresponding topology file in `Heterogeneity/Heterogeneity_Modeling/`. We wrote a function, `generate_all_snapshots`, which creates a directory for each snapshot model, copies the python script and topology file into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.\n",
- "\n",
- "\\code{.py}\n",
- "## 1a - parameters for generate_all_snapshots",
- "## state_dict - universal parameter",
- "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
- "\n",
- "main_dir = os.getcwd()\n",
- "topol_dir = os.path.join(os.getcwd(), '../../Heterogeneity/Heterogeneity_Modeling')\n",
- "items_to_copy = ['static_snapshot.py'] # additionally we need to copy only specific topology file\n",
- "## jobs script will likely depend on the user's cluster / configuration",
- "job_template = (\"#!/bin/bash\\n#$ -S /bin/bash\\n#$ -cwd\\n#$ -r n\\n#$ -j y\\n#$ -N Tutorial\\n#$ -pe smp 16\\n\"\n",
- " \"#$ -l h_rt=48:00:00\\n\\nmodule load Sali\\nmodule load imp\\nmodule load mpi/openmpi-x86_64\\n\\n\"\n",
- " \"mpirun -np $NSLOTS python3 static_snapshot.py {state} {time}\")\n",
- "number_of_runs = 50\n",
- "\n",
- "## 1b - calling generate_all_snapshots",
- "generate_all_snapshots(state_dict, main_dir, topol_dir, items_to_copy, job_template, number_of_runs)\n",
- "\n",
- "\\endcode\n",
- "\n",
- "# Snapshot modeling step 3: assessment\n",
- "\n",
- "The above code would variety of alternative snapshot models. In general, we would like to assess these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. In this portion of the tutorial, we focus specifically on estimating the sampling precision of the model, while quantitative comparisons between the model and experimental data will be reserved for the final step, when we assess trajectories. Again, this assessment process is quite computationally intensive, so, instead of running the script explicitly, we will walk you through the `snapshot_assessment.py` script, which is located in the `modeling/Snapshots/Snapshots_Assessment` folder.\n",
- "\n",
- "## Filtering good scoring models\n",
- "\n",
- "Initially, we want to filter the various alternative structural models to only select those that meet certain parameter thresholds. In this case, we filter the structural models comprising each snapshot model by the median cross correlation with EM data. We note that this filtering criteria is subjective, and developing a Bayesian method to objectively weigh different restraints for filtering remains an interesting future development in integrative modeling.\n",
- "\n",
- "The current filtering procedure involves three steps. In the first step, we look through the `stat.*.out` files to write out the cross correlation with EM data for each model, which, in this case, is labeled column `3`, `GaussianEMRestraint_None_CCC`. In other applications, the column that corresponds to each type of experimental data may change, depending on the scoring terms for each model. For each snapshot model, a new file is written with this data (`{state}_{time}_stat.txt`).\n",
- "\n",
- "\\code{.py}\n",
- "## state_dict - universal parameter",
- "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
- "## current directory",
- "main_dir = os.getcwd()\n",
- "\n",
- "## 1 calling extracting_stat_files function and related parameters",
- "keys_to_extract = [3]\n",
- "runs_nr = 50\n",
- "replica_nr = 16\n",
- "replica_output_name = 'output'\n",
- "decimals_nr = 16\n",
- "\n",
- "extracting_stat_files(state_dict, runs_nr, replica_nr, replica_output_name, keys_to_extract, decimals_nr)\n",
- "print(\"extracting_stat_files is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "In the second step, we want to determine the median value of EM cross correlation for each snapshot model. We wrote `general_rule_calculation` to look through the `general_rule_column` for each `{state}_{time}_stat.txt` file and determine both the median value and the number of structures generated.\n",
- "\n",
- "\\code{.py}\n",
- "## 2 calling general_rule_calculation and related parameters",
- "general_rule_column = '3'\n",
- "\n",
- "general_rule_calculation(state_dict, general_rule_column)\n",
- "\n",
- "print(\"general_rule_calculation is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "In the third step, we use the `imp_sampcon select_good` tool to filter each snapshot model, according to the median value determined in the previous step. For each snapshot model, this function produces a file, `good_scoring_models/model_ids_scores.txt`, which contains the run, replicaID, scores, and sampleID for each model that passes filtering. It also saves RMF files with each model from two independent groups of sampling runs from each snapshot model to `good_scoring_models/sample_A` and `good_scoring_models/sample_B`, writes the scores for the two independent groups of sampling runs to `good_scoring_models/scoresA.txt` and `good_scoring_models/scoresB.txt`, and writes `good_scoring_models/model_sample_ids.txt` to connect each model to its division of sampling run. More information on `imp_sampcon` is available in the analysis portion of the [actin tutorial](https://integrativemodeling.org/tutorials/actin/analysis.html).\n",
- "\n",
- "\\code{.py}\n",
- "## 3 calling general_rule_filter_independent_samples",
- "general_rule_filter_independent_samples(state_dict, main_dir)\n",
- "print(\"general_rule_filter_independent_samples is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "## Plotting data, clustering models, and determining sampling precision\n",
- "\n",
- "Next, scores can be plotted for analysis. Here, we wrote the `create_histograms` function to run `imp_sampcon plot_score` so that it plots distributions for various scores of interest. Each of these plots are saved to `histograms{state}_{time}/{score}.png`, where score is an object listed in the `score_list`. These plots are useful for debugging the modeling protocol, and should appear roughly Gaussian.\n",
- "\n",
- "\\code{.py}\n",
- "## 4 calling create_histograms and related parameters",
- "score_list = [\n",
- " 'Total_Score',\n",
- " 'ConnectivityRestraint_Score',\n",
- " 'ExcludedVolumeSphere_Score',\n",
- " 'GaussianEMRestraint_None',\n",
- " 'GaussianEMRestraint_None_CCC'\n",
- "] # list of histograms we want to create in each histograms{state}_{time} directory\n",
- "\n",
- "create_histograms(state_dict, main_dir, score_list)\n",
- "print(\"create_histograms is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "We then check the number of models in each sampling run though our function, `count_rows_and_generate_report`, which writes the `independent_samples_stat.txt` file. Empirically, we have found that ensuring the overall number of models in each independent sample after filtering is roughly equal serves a good first check on sampling convergence.\n",
- "\n",
- "\\code{.py}\n",
- "## 5 calling count_rows_and_generate_report",
- "count_rows_and_generate_report(state_dict)\n",
- "print(\"count_rows_and_generate_report is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "Next, we write the density range dictionaries, which are output as `{state}_{time}_density_ranges.txt`. These dictionaries label each protein in each snapshot model, which will be passed into `imp_sampcon` to calculate the localization density of each protein.\n",
- "\n",
- "\\code{.py}\n",
- "## 6 calling create_density_dictionary:",
- "create_density_dictionary_files(state_dict, main_dir)\n",
- "print(\"create_density_dictionary is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "Next, we run `imp_sampcon exhaust` on each snapshot model. This code performs checks on the exhaustiveness of the sampling. Specifically it analyzes the convergence of the model score, whether the two model sets were drawn from the same distribution, and whether each structural cluster includes models from each sample proportionally to its size. The output for each snapshot model is written out to the `exhaust_{state}_{time}` folder.\n",
- "\n",
- "\\code{.py}\n",
- "## 7 calling exhaust",
- "exhaust(state_dict, main_dir)\n",
- "print(\"exhaust is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "Plots for determining the sampling precision are shown below for a single snapshot model, 1_2min. (a) Tests the convergence of the lowest scoring model (`snapshot_{state}_{time}.Top_Score_Conv.pdf`). Error bars represent standard deviations of the best scores, estimated by selecting different subsets of models 10 times. The light-blue line indicates a lower bound reference on the total score. (b) Tests that the scores of two independently sampled models come from the same distribution (`snapshot_{state}_{time}.Score_Dist.pdf`). The difference between the two distributions, as measured by the KS test statistic (D) and KS test p-value (p) indicates that the difference is both statistically insignificant (p>0.05) and small in magnitude (D<0.3). (c) Determines the structural precision of a snapshot model (`snapshot_{state}_{time}.ChiSquare.pdf`). RMSD clustering is performed at 1 \u00c5 intervals until the clustered population (% clustered) is greater than 80%, and either the \u03c72 p-value is greater than 0.05 or Cramer\u2019s V is less than 0.1. The sampling precision is indicated by the dashed black line. (d) Populations from sample 1 and sample 2 are shown for each cluster (`snapshot_{state}_{time}.Cluster_Population.pdf`).\n",
- "\n",
- "\n",
- "\n",
- "Further structural analysis can be calculated by using the `cluster.*` files. The `cluster.*.{sample}.txt` files contain the model number for the models in that cluster, where `{sample}` indicates which round of sampling the models came from. The `cluster.*` folder contains an RMF for centroid model of that cluster, along with the localization densities for each protein. The localization densities of each protein from each independent sampling can be compared to ensure independent samplings produce the same results.\n",
- "\n",
- "Ideally, each of these plots should be checked for each snapshot model. As a way to summarize the output of these checks, we can gather the results of the KS test and the sampling precision test for all snapshot models. This is done by running `extract_exhaust_data` and `save_exhaust_data_as_png`, which write `KS_sampling_precision_output.txt` and `KS_sampling_precision_output.png`, respectively.\n",
- "\n",
- "\\code{.py}\n",
- "## 8 calling extract_exhaust_data",
- "extract_exhaust_data(state_dict)\n",
- "print(\"extract_exhaust_data is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\n",
- "## 9 calling save_exhaust_data_as_png",
- "save_exhaust_data_as_png()\n",
- "print(\"save_exhaust_data_as_png is DONE\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "These codes write a table that include the KS two sample test statistic (D), the KS test p-value, and the sampling precision for each snapshot model, which is replotted below.\n",
- "\n",
- "\n",
- "\n",
- "## Visualizing models\n",
- "\n",
- "The resulting RMF files and localization densities from this analysis can be viewed in [UCSF Chimera](https://www.rbvi.ucsf.edu/chimera/) (version>=1.13) or [UCSF ChimeraX](https://www.cgl.ucsf.edu/chimerax/).\n",
- "\n",
- "Here, we plotted each centroid model (A - blue, B - orange, and C - purple) from the most populated cluster for each snapshot model and compared that model to the experimental EM profile (gray).\n",
- "\n",
- "\n",
- "\n",
- "Finally, now that snapshot models were assessed, we can perform modeling of a trajectory."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6fc2546f-7c6f-4146-8c9b-ee143fcead6e",
- "metadata": {},
- "source": [
- "Modeling of a Trajectory\n",
- "====================================\n",
- "\n",
- "Here, we describe the final modeling problem in our composite workflow, how to build models of trajectory models using IMP.\n",
- "\n",
- "# Trajectory modeling step 1: gathering of information\n",
- "\n",
- "We begin trajectory modeling with the first step of integrative modeling, gathering information. Trajectory modeling utilizes dynamic information about the bimolecular process. In this case, we utilize heterogeneity models, snapshot models, physical theories, and synthetically generated small-angle X-ray scattering (SAXS) profiles.\n",
- "\n",
- "\n",
- "\n",
- "Heterogeneity models inform the possible compositional states at each time point and measure how well a compositional state agrees with input information. Snapshot models provide structural models for each heterogeneity model and measure how well those structural models agree with input information about their structure. Physical theories of macromolecular dynamics inform transitions between states. SAXS data informs the size and shape of the assembling complex and is left for validation.\n",
- "\n",
- "# Trajectory modeling step 2: representation, scoring function, and search process\n",
- "\n",
- "Trajectory modeling connects alternative snapshot models at adjacent time points, followed by scoring the trajectory models based on their fit to the input information, as described in full [here](https://www.biorxiv.org/content/10.1101/2024.08.06.606842v1.abstract).\n",
- "\n",
- "## Background behind integrative spatiotemporal modeling",
- "\n",
- "### Representing the model\n",
- "\n",
- "We choose to represent dynamic processes as a trajectory of snapshot models, with one snapshot model at each time point. In this case, we computed snapshot models at 3 time points (0, 1, and 2 minutes), so a single trajectory model will consist of 3 snapshot models, one at each 0, 1, and 2 minutes. The modeling procedure described here will produce a set of scored trajectory models, which can be displayed as a directed acyclic graph, where nodes in the graph represent the snapshot model and edges represent connections between snapshot models at neighboring time points.\n",
- "\n",
- "### Scoring the model\n",
- "\n",
- "To score trajectory models, we incorporate both the scores of individual snapshot models, as well as the scores of transitions between them. Under the assumption that the process is Markovian (*i.e.* memoryless), the weight of a trajectory model takes the form:\n",
- "\n",
- "$$\n",
- "W(\\chi) \\propto \\displaystyle\\prod^{T}_{t=0} P( X_{N,t}, N_{t} | D_{t}) \\cdot \\displaystyle\\prod^{T-1}_{t=0} W(X_{N,t+1},N_{t+1} | X_{N,t},N_{t}, D_{t,t+1}),\n",
- "$$\n",
- "\n",
- "where $t$ indexes times from 0 until the final modeled snapshot ($T$); $P(X_{N,t}, N_{t} | D_{t})$ is the snapshot model score; and \\f$W(X_{N,t+1},N_{t+1} | X_{N,t},N_{t}, D_{t,t+1})\\f$ is the transition score. Trajectory model weights ($W(\\chi)$) are normalized so that the sum of all trajectory models' weights is 1.0. Transition scores are currently based on a simple metric that either allows or disallows a transition. Transitions are only allowed if all proteins in the first snapshot model are included in the second snapshot model. In the future, we hope to include more detailed transition scoring terms, which may take into account experimental information or physical models of macromolecular dynamics.\n",
- "\n",
- "### Searching for good scoring models\n",
- "\n",
- "Trajectory models are constructed by enumerating all connections between adjacent snapshot models and scoring these trajectory models according to the equation above. This procedure results in a set of weighted trajectory models.\n",
- "\n",
- "## Computing trajectory models",
- "\n",
- "To compute trajectory models, we first copy all necessary files to a new directory, `data`. These files are (i) `{state}_{time}.config` files, which include the subcomplexes that are in each state, (ii) `{state}_{time}_scores.log`, which is a list of all scores of all structural models in that snapshot model, and (iii) `exp_comp{prot}.csv`, which is the experimental copy number for each protein (`{prot}`) as a function of time. Here, we copy files related to the snapshots (`*.log` files) from the `modeling` directory, as we skipped computing snapshots due to the computational expense.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "bb887efe-0630-47b6-9bf1-85fa216a6816",
- "metadata": {},
- "outputs": [],
- "source": [
- "def merge_scores(fileA, fileB, outputFile):\n",
- " \"\"\"\n",
- " For each function merges scoresA.txt and scoresB.txt into {state}_{time}_scores.log\n",
- "\n",
- " :param fileA: path to scoresA.txt\n",
- " :param fileB: path to scoresB.txt\n",
- " :param outputFile: path to output merged .log file named {state}_{time}_scores.log for each snapshot.\n",
- " This type of .log file is used in crete_DAG to generate trajectory model.\n",
- " \"\"\"\n",
- " # open both files, so data can be extracted\n",
- " with open(fileA, 'r') as file_a:\n",
- " data_a = file_a.readlines()\n",
- "\n",
- " with open(fileB, 'r') as file_b:\n",
- " data_b = file_b.readlines()\n",
- "\n",
- " # Merge the content of both files\n",
- " merged_data = data_a + data_b\n",
- "\n",
- " # Write the merged content into the output file\n",
- " with open(outputFile, 'w') as output:\n",
- " output.writelines(merged_data)\n",
- "\n",
- "def create_data_and_copy_files(state_dict, custom_source_dir1 = None, custom_source_dir2 = None, custom_source_dir3 = None):\n",
- " \"\"\"\n",
- " Copies three types of files important to generate trajectory models:\n",
- " -.config files created with start_sim.py in Snapshot_Modeling (source_dir1)\n",
- " -time-dependent stoichiometry data for each timepoint. Data should be presented in .csv file. With this function all\n",
- " csv file in source_dir2 will be copied. These .csv files will be used in the exp_comp dictionary in create_DAG\n",
- " function\n",
- " -scoresA and scoresB for each snapshot created with imp sampcon exhaust\n",
- " (source_dir1 + snapshot + good_scoring_models) are merged into total score .txt using merge_scores helper function.\n",
- " All copied files are gathered in newly created './data/' directory, where everything is prepared for create_DAG\n",
- " function.\n",
- "\n",
- "\n",
- " :param state_dict (dict): state_dict: dictionary that defines the spatiotemporal model.\n",
- " The keys are strings that correspond to each time point in the\n",
- " stepwise temporal process. Keys should be ordered according to the\n",
- " steps in the spatiotemporal process. The values are integers that\n",
- " correspond to the number of possible states at that timepoint.\n",
- " :param custom_source_dir1 (optional - str): Custom path to heterogeneity modeling dir (heterogeneity_modeling.py),\n",
- " to copy .config files\n",
- " :param custom_source_dir2 (optional - str): Custom path to stoichiometry data dir\n",
- " :param custom_source_dir3 (optional - str): Custom path to snapshot modeling dir (start_sim.py), to copy .config\n",
- " files and to access scoresA/scoresB (custom_source_dir3 + snapshot{state}_{time} + 'good_scoring_models')\n",
- " \"\"\"\n",
- "\n",
- " # Create the destination directory if it does not exist (./data/). Here all the\n",
- " destination_dir = './data/'\n",
- " os.makedirs(destination_dir, exist_ok=True)\n",
- "\n",
- " # Path to heterogeneity modeling dir\n",
- " if custom_source_dir1:\n",
- " source_dir1 = custom_source_dir1\n",
- " else:\n",
- " source_dir1 = '../../Heterogeneity/Heterogeneity_Modeling/'\n",
- "\n",
- " # Path to stoichiometry data dir\n",
- " if custom_source_dir2:\n",
- " source_dir2 = custom_source_dir2\n",
- " else:\n",
- " source_dir2 = '../../Input_Information/gen_FCS/'\n",
- "\n",
- " # Path to snapshot modeling dir\n",
- " if custom_source_dir3:\n",
- " source_dir3 = custom_source_dir3\n",
- " else:\n",
- " source_dir3 = '../../Snapshots/Snapshots_Modeling/'\n",
- "\n",
- " # Copy all .config files from the first source directory to the destination directory\n",
- " try:\n",
- " for file_name in os.listdir(source_dir1):\n",
- " if file_name.endswith('.config'):\n",
- " full_file_name = os.path.join(source_dir1, file_name)\n",
- " if os.path.isfile(full_file_name):\n",
- " shutil.copy(full_file_name, destination_dir)\n",
- " print(\".config files are copied\")\n",
- " except Exception as e:\n",
- " print(f\".config files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
- "\n",
- " # Copy all .csv stoichiometry files from the second source directory to the destination directory\n",
- " try:\n",
- " for file_name in os.listdir(source_dir2):\n",
- " if file_name.endswith('.csv'):\n",
- " full_file_name = os.path.join(source_dir2, file_name)\n",
- " if os.path.isfile(full_file_name):\n",
- " shutil.copy(full_file_name, destination_dir)\n",
- " print(\".csv stoichiometry files are copied\")\n",
- " except Exception as e:\n",
- " print(f\".csv stoichiometry files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
- "\n",
- " # Copy scoresA and scoresB from the snapshot_{state}_{time} directories and first source directory path\n",
- " for time in state_dict.keys():\n",
- " for state in range(1, state_dict[time] + 1):\n",
- " dir_name = f\"snapshot{state}_{time}\"\n",
- " good_scoring_path = \"good_scoring_models\"\n",
- " file_a = os.path.join(source_dir3, dir_name, good_scoring_path, \"scoresA.txt\")\n",
- " file_b = os.path.join(source_dir3, dir_name, good_scoring_path, \"scoresB.txt\")\n",
- " output_file = os.path.join(destination_dir, f\"{state}_{time}_scores.log\") # name of the output file\n",
- "\n",
- " try:\n",
- " # Ensure the directory exists before try to read/write files\n",
- " if os.path.exists(file_a) and os.path.exists(file_b):\n",
- " merge_scores(file_a, file_b, output_file) # call helper function to merge files\n",
- " print(f\"Scores for snapshot{state}_{time} have been merged and saved\")\n",
- " else: # many things can go wrong here, so it is good to know where is the problem\n",
- " print(f\"Path doesn't exist: {source_dir3}\")\n",
- " print(f\"Files not found in directory: {dir_name}\")\n",
- " print(f\"Files not found in directory: {file_a}\")\n",
- " print(f\"Files not found in directory: {file_b}\")\n",
- " print(f\"Output directory doesn't exist: {destination_dir}\")\n",
- " except Exception as e:\n",
- " print(f\"total scores files cannot be copied of merged. Reason for Error: {e}\")\n",
- "\n",
- "# copy all the relevant files for create_DAG\n",
- "# it is important that everything starts from main dir\n",
- "main_dir = os.getcwd()\n",
- "os.chdir(main_dir)\n",
- "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
- "create_data_and_copy_files(state_dict, custom_source_dir1=main_dir, custom_source_dir2='../../../modeling/Input_Information/gen_FCS/', custom_source_dir3='../../../modeling/Snapshots/Snapshots_Modeling/')\n",
- "\n",
- "# then trajectory model is created based on the all copied data\n",
- "expected_subcomplexes = ['A', 'B', 'C']\n",
- "exp_comp = {'A': 'exp_compA.csv', 'B': 'exp_compB.csv', 'C': 'exp_compC.csv'}\n",
- "input = './data/'\n",
- "output = \"../output/\""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7a4053cf-f2a3-4836-89f3-a8b2098109f2",
- "metadata": {},
- "source": [
- "Next, we compute the spatiotemporal model. The inputs we included are:\n",
- "- state_dict (dict): a dictionary that defines the spatiotemporal model. Keys are strings for each time point in the spatiotemporal process and values are integers corresponding to the number of snapshot models computed at that time point\n",
- "- out_pdf (bool): whether to write the probability distribution function (pdf).\n",
- "- npaths (int): Number of states two write to a file (path*.txt).\n",
- "- input_dir (str): directory with the input information.\n",
- "- scorestr (str): final characters at the end of the score files.\n",
- "- output_dir (str): directory to which model will be written. Will be created if it does not exist.\n",
- "- spatio_temporal_rule (bool): whether to include our transition scoring term, which enforces that all proteins in the first snapshot model are included in the second snapshot model.\n",
- "- expected_subcomplexes (list): list of string objects, which is the subcomplexes to look when enforcing the spatiotemporal rule. Strings should be substrings of those in `{state}_{time}.config` files.\n",
- "- score_comp (bool): whether to score the composition of each snapshot model.\n",
- "- exp_comp_map (dictionary): key is a string with the name of each protein that will undergo composition scoring, value is the `.csv` file with the copy number data for that protein.\n",
- "- draw_dag (bool): whether to write out an image of the directed acyclic graph."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "819bb205-fa52-42f9-a4ab-d2f7c3cff0ad",
- "metadata": {},
- "outputs": [],
- "source": [
- "nodes, graph, graph_prob, graph_scores = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
- " input_dir=input, scorestr='_scores.log',\n",
- " output_dir=output, spatio_temporal_rule=True,\n",
- " expected_subcomplexes=expected_subcomplexes,\n",
- " score_comp=True, exp_comp_map=exp_comp,\n",
- " draw_dag=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7184fcec-05b9-4348-a9b8-d3673db00fe5",
- "metadata": {},
- "source": [
- "After running `spatiotemporal.create_DAG`, a variety of outputs are written:\n",
- "- `cdf.txt`: the cumulative distribution function for the set of trajectory models.\n",
- "- `pdf.txt`: the probability distribution function for the set of trajectory models.\n",
- "- `labeled_pdf.txt`: Each row has 2 columns and represents a different trajectory model. The first column labels a single trajectory model as a series of snapshot models, where each snapshot model is written as `{state}_{time}|` in sequential order. The second column is the probability distribution function corresponding to that trajectory model.\n",
- "- `dag_heatmap.eps` and `dag_heatmap`: image of the directed acyclic graph from the set of models.\n",
- "- `path*.txt`: files where each row includes a `{state}_{time}` string, so that rows correspond to the states visited over that trajectory model. Files are numbered from the most likely path to the least likely path.\n",
- "\n",
- "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory model (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{N,t}N_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fb4c016e-e5ba-4e3b-a76e-499750abb782",
- "metadata": {},
- "source": [
- "\\image html Spatiotemporal_Model.png width=600px\n",
- "\n",
- "# Trajectory modeling step 3: assessment\n",
- "\n",
- "Now that the set of spatiotemporal models has been constructed, we must evaluate these models. We can evaluate these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model.\n",
- "\n",
- "## Sampling precision\n",
- "\n",
- "To begin, we calculate the sampling precision of the models. The sampling precision is calculated by using `spatiotemporal.create_DAG` to reconstruct the set of trajectory models using 2 independent sets of samplings for snapshot models. Then, the overlap between these snapshot models is evaluated using `analysis.temporal_precision`, which takes in two `labeled_pdf` files.\n",
- "\n",
- "The temporal precision can take values between 1.0 and 0.0, and indicates the overlap between the two models in trajectory space. Hence, values close to 1.0 indicate a high sampling precision, while values close to 0.0 indicate a low sampling precision. Here, the value close to 1.0 indicates that sampling does not affect the weights of the trajectory models.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f1436e33-81b6-400c-bf9c-7b667455265a",
- "metadata": {},
- "outputs": [],
- "source": [
- "## 1 - calculation of temporal precision\n",
- "\n",
- "# 1 - copy_files_for_data (copy all relevant files into 'data' directory)\n",
- "def copy_files_for_data(state_dict, custom_source_dir1 = None, custom_source_dir2 = None, custom_source_dir3 = None):\n",
- " \"\"\"\n",
- " Copies three types of files important to generate trajectory models:\n",
- " -.config files created with start_sim.py in Snapshot_Modeling (source_dir1)\n",
- " -time-dependent stoichiometry data for each timepoint. Data should be presented in .csv file. With this function all\n",
- " csv file in source_dir2 will be copied. These .csv files will be used in the exp_comp dictionary in create_DAG\n",
- " function\n",
- " -scoresA and scoresB for each snapshot created with imp sampcon exhaust\n",
- " (source_dir1 + snapshot + good_scoring_models) are merged into total score .txt using merge_scores helper function.\n",
- " All copied files are gathered in newly created './data/' directory, where everything is prepared for create_DAG\n",
- " function.\n",
- "\n",
- "\n",
- " :param state_dict (dict): state_dict: dictionary that defines the spatiotemporal model.\n",
- " The keys are strings that correspond to each time point in the\n",
- " stepwise temporal process. Keys should be ordered according to the\n",
- " steps in the spatiotemporal process. The values are integers that\n",
- " correspond to the number of possible states at that timepoint.\n",
- " :param custom_source_dir1 (optional - str): Custom path to heterogeneity modeling dir (heterogeneity_modeling.py),\n",
- " to copy .config files\n",
- " :param custom_source_dir2 (optional - str): Custom path to stoichiometry data dir\n",
- " :param custom_source_dir3 (optional - str): Custom path to snapshot modeling dir (start_sim.py), to copy .config\n",
- " files and to access scoresA/scoresB (custom_source_dir3 + snapshot{state}_{time} + 'good_scoring_models')\n",
- " \"\"\"\n",
- " # Create the destination directory for all the data copied in this function\n",
- " destination_dir = './data/'\n",
- " os.makedirs(destination_dir, exist_ok=True)\n",
- "\n",
- " # path to snapshot modeling dir\n",
- " if custom_source_dir1:\n",
- " source_dir1 = custom_source_dir1\n",
- " else:\n",
- " source_dir1 = '../../Heterogeneity/Heterogeneity_Modeling/'\n",
- "\n",
- " # path to stoichiometry data dir\n",
- " if custom_source_dir2:\n",
- " source_dir2 = custom_source_dir1\n",
- " else:\n",
- " source_dir2 = '../../Input_Information/gen_FCS/'\n",
- "\n",
- " # path to snapshot modeling dir\n",
- " if custom_source_dir3:\n",
- " source_dir3 = custom_source_dir3\n",
- " else:\n",
- " source_dir3 = '../../Snapshots/Snapshots_Modeling/'\n",
- "\n",
- " # Copy all .config files from the first source directory to the destination directory\n",
- " try:\n",
- " for file_name in os.listdir(source_dir1):\n",
- " if file_name.endswith('.config'):\n",
- " full_file_name = os.path.join(source_dir1, file_name)\n",
- " if os.path.isfile(full_file_name):\n",
- " shutil.copy(full_file_name, destination_dir)\n",
- " print(\".config files are copied\")\n",
- " except Exception as e:\n",
- " print(f\".config files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
- "\n",
- " # Copy all .csv stoichiometry files from the second source directory to the destination directory\n",
- " try:\n",
- " for file_name in os.listdir(source_dir2):\n",
- " if file_name.endswith('.csv'):\n",
- " full_file_name = os.path.join(source_dir2, file_name)\n",
- " if os.path.isfile(full_file_name):\n",
- " shutil.copy(full_file_name, destination_dir)\n",
- " print(\".csv stoichiometry files are copied\")\n",
- " except Exception as e:\n",
- " print(f\".csv stoichiometry files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
- "\n",
- " # Copy scoresA and scoresB from the snapshot_{state}_{time} directories and first source directory path\n",
- " try:\n",
- " for time in state_dict.keys():\n",
- " for state in range(1, state_dict[time] + 1):\n",
- " snapshot_dir = os.path.join(source_dir3, f'snapshot{state}_{time}')\n",
- " good_scoring_models_dir = os.path.join(snapshot_dir, 'good_scoring_models')\n",
- " if os.path.isdir(good_scoring_models_dir):\n",
- " for score_file in ['scoresA.txt', 'scoresB.txt']:\n",
- " full_file_name = os.path.join(good_scoring_models_dir, score_file)\n",
- " if os.path.isfile(full_file_name):\n",
- " new_file_name = f'{state}_{time}_{os.path.splitext(score_file)[0]}.log'\n",
- " shutil.copy(full_file_name, os.path.join(destination_dir, new_file_name))\n",
- " print(f\"Copied {full_file_name} to {os.path.join(destination_dir, new_file_name)}\")\n",
- " except Exception as e:\n",
- " print(f\"scoresA.txt and scoresB.txt cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
- "\n",
- "# copy all the relevant files\n",
- "copy_files_for_data(state_dict, custom_source_dir1='../../../modeling/Heterogeneity/Heterogeneity_Modeling/',\n",
- " custom_source_dir2='../../../modeling/Input_Information/gen_FCS/',\n",
- " custom_source_dir3='../../../modeling/Snapshots/Snapshots_Modeling/')\n",
- "\n",
- "# create two independent DAGs\n",
- "expected_subcomplexes = ['A', 'B', 'C']\n",
- "exp_comp = {'A': 'exp_compA.csv', 'B': 'exp_compB.csv', 'C': 'exp_compC.csv'}\n",
- "input = \"./data/\"\n",
- "outputA = \"../output_modelA/\"\n",
- "outputB = \"../output_modelB/\"\n",
- "\n",
- "# Output from sampling precision and model precision to be saved in united dir: analysis_output_precision\n",
- "analysis_output = \"./analysis_output_precision/\"\n",
- "os.makedirs(analysis_output, exist_ok=True)\n",
- "\n",
- "nodesA, graphA, graph_probA, graph_scoresA = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
- " input_dir=input, scorestr='_scoresA.log',\n",
- " output_dir=outputA,\n",
- " spatio_temporal_rule=False,\n",
- " expected_subcomplexes=expected_subcomplexes,\n",
- " score_comp=True, exp_comp_map=exp_comp,\n",
- " draw_dag=False)\n",
- "\n",
- "os.chdir(main_dir)\n",
- "nodesB, graphB, graph_probB, graph_scoresB = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
- " input_dir=input, scorestr='_scoresB.log',\n",
- " output_dir=outputB,\n",
- " spatio_temporal_rule=False,\n",
- " expected_subcomplexes=expected_subcomplexes,\n",
- " score_comp=True, exp_comp_map=exp_comp,\n",
- " draw_dag=False)\n",
- "\n",
- "## 1 - analysis\n",
- "analysis.temporal_precision(outputA + 'labeled_pdf.txt', outputB + 'labeled_pdf.txt',\n",
- " output_fn='.' + analysis_output + 'temporal_precision.txt')\n",
- "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
- "print(\"Step 1: calculation of temporal precision IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6ce48a52-b06c-4e09-80a5-debe0fd2cba2",
- "metadata": {},
- "source": [
- "## Model precision\n",
- "\n",
- "Next, we calculate the precision of the model, using `analysis.precision`. Here, the model precision calculates the number of trajectory models with high weights. The precision ranges from 1.0 to 1/d, where d is the number of trajectory models. Values approaching 1.0 indicate the model set can be described by a single trajectory model, while values close to 1/d indicate that all trajectory models have similar weights.\n",
- "\n",
- "The `analysis.precision` function reads in the `labeled_pdf` of the complete model, and calculates the precision of the model. The value close to 1.0 indicates that the set of models can be sufficiently represented by a single trajectory model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d3669407-4ddc-4e1f-b8ba-c8638024cd56",
- "metadata": {},
- "outputs": [],
- "source": [
- "## 2 - calculation of precision of the model\n",
- "\n",
- "# precision is calculated from .labeled_pdf.txt in Trajectories_Modeling dir\n",
- "trajectories_modeling_input_dir = \"./output/\"\n",
- "\n",
- "analysis.precision(trajectories_modeling_input_dir + 'labeled_pdf.txt', output_fn=analysis_output + 'precision.txt')\n",
- "\n",
- "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
- "print(\"Step 2: calculation of precision of the model IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8a44c8b5-93d8-4ab8-b8cc-d6a70168c0eb",
- "metadata": {},
- "source": [
- "## Comparison against data used in model construction\n",
- "\n",
- "We then evaluate the model against data used in model construction. First, we can calculate the cross-correlation between the original EM map and the forward density projected from each snapshot model. This calculation is too computationally expensive for this notebook, but can be found in `modeling/Trajectories/Trajectories_Assessment`, where we wrote the `ccEM` function to perform this comparison for all snapshot models.\n",
- "\n",
- "\\code{.py}\n",
- "## 3a - comparison of the model to data used in modeling (EM)",
- "exp_mrc_base_path = \"../../Input_Information/ET_data/add_noise\"\n",
- "ccEM(exp_mrc_base_path)\n",
- "print(\"Step 3a: ET validation IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "The results of this comparison are shown below."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "70a6e0a0-98d5-4b63-a839-221a4fdc493b",
- "metadata": {},
- "source": [
- "After comparing the model to EM data, we aimed to compare the model to copy number data, and wrote the `forward_model_copy_number` function to evaluate the copy numbers from our set of trajectory models. The output of `forward_model_copy_number` is written in `forward_model_copy_number/`. The folder contains `CN_prot_{prot}.txt` files for each protein, which have the mean and standard deviation of protein copy number at each time point. We can then plot these copy numbers from the forward models against those from the experiment, as shown below."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f3f7c451-2b04-4e8b-b39c-16c493e8a1a6",
- "metadata": {},
- "outputs": [],
- "source": [
- "def read_labeled_pdf(pdf_file):\n",
- " \"\"\"\n",
- " Function to read in a labeled probability distribution file output by spatiotemporal.create_DAG.\n",
- " Used to determine protein copy numbers by forward_model_copy_number.\n",
- " :param pdf_file (str): sting for the path of the labeled probability distribution file output by\n",
- " spatiotemporal.create_DAG.\n",
- " :return prob_dict (dict): dictionary defining the spatiotemporal model. Each key is a state, and each value is the\n",
- " probability of that state.\n",
- " \"\"\"\n",
- " # create blank dictonary to store the results\n",
- " prob_dict = {}\n",
- " # read in labeled pdf file\n",
- " old = open(pdf_file, 'r')\n",
- " line = old.readline()\n",
- " # store the path through various nodes, as well as the probability of that path\n",
- " while line:\n",
- " line_split = line.split()\n",
- " # assumes the first string is the trajectory string, the second string is the probability\n",
- " if len(line_split) > 1:\n",
- " # use # for comments\n",
- " if line_split[0]=='#':\n",
- " pass\n",
- " else:\n",
- " trj = line_split[0]\n",
- " prob = float(line_split[1])\n",
- " # store in dictionary\n",
- " prob_dict[trj] = prob\n",
- " line = old.readline()\n",
- " old.close()\n",
- " return prob_dict\n",
- "\n",
- "def copy_number_from_state(prot_list,trj,custom_data_folder = None):\n",
- " \"\"\"\n",
- " For a trajectory, returns an array of protein copy numbers as a function of time. Used by\n",
- " forward_model_copy_number().\n",
- " :param prot_list (list): list of proteins in the model. These proteins are searched for in each config file.\n",
- " :param trj (str): string defining a single trajectory.\n",
- " :param custom_data_folder (str, optional): path to custom data folder. Defaults to None, which points to '../data/'\n",
- " :return _prots (array): 2D array of protein copy numbers. The first index loops over the time,\n",
- " while the second index value loops over the protein (ordered as A, B, C).\n",
- " :return N (int): Number of time points in each trajectory.\n",
- " \"\"\"\n",
- " # find folder with config files\n",
- " if custom_data_folder:\n",
- " data_folder = custom_data_folder\n",
- " else:\n",
- " data_folder = 'data/'\n",
- "\n",
- " # split the trajectory into a list of individual states\n",
- " state_list=trj.split('|')\n",
- " state_list=state_list[:-1]\n",
- "\n",
- " N = len(state_list)\n",
- " # Map from index to protein: 0 - A, 1- B, 2- C\n",
- " _prots = np.zeros((N, len(prot_list)))\n",
- "\n",
- " # Grab _prots from .config file\n",
- " for i in range(0, N):\n",
- " prot_file = data_folder + state_list[i] + '.config'\n",
- " to_read = open(prot_file, 'r')\n",
- " line = to_read.readline()\n",
- " while line:\n",
- " # for each line, check if the protein is in that line\n",
- " for prot_index in range(len(prot_list)):\n",
- " if prot_list[prot_index] in line:\n",
- " _prots[i, prot_index] += 1\n",
- " line = to_read.readline()\n",
- "\n",
- " return _prots,N\n",
- "\n",
- "def forward_model_copy_number(prot_list,custom_labeled_pdf=None):\n",
- " \"\"\"\n",
- " Code to perform copy number analysis on each protein in the model. Writes output files where each row is ordered\n",
- " according to the time point in the model and the first column is the mean copy number, while the second column is\n",
- " the standard deviation in copy number.\n",
- " :param prot_list (list): list of proteins in the model. These proteins are searched for in each config file.\n",
- " :param custom_labeled_pdf (str, optional): path to custom labeled probability distribution file output by\n",
- " spatiotemporal.create_DAG.\n",
- " \"\"\"\n",
- " # find folder with config files\n",
- " if custom_labeled_pdf:\n",
- " _labeled_pdf = custom_data_folder\n",
- " else:\n",
- " _labeled_pdf = '../Trajectories_Modeling/output/labeled_pdf.txt'\n",
- "\n",
- " # Read in labeled_pdf file into a dictionary. Each trajectory is listed as a dictionary,\n",
- " # with keys as the trajectory and the values as the probability of that trajectory\n",
- " prob_dict = read_labeled_pdf(_labeled_pdf)\n",
- "\n",
- " # Loop over the full dictionary. Create a list with 2 values:\n",
- " # 1) the probability of the state, 2) the protein copy number of that state.\n",
- " key_list = prob_dict.keys()\n",
- " prot_prob = []\n",
- " for key in key_list:\n",
- " CN,N_times = copy_number_from_state(prot_list,key)\n",
- " prot_prob.append([prob_dict[key], CN])\n",
- "\n",
- " # Construct the full path to the output directory\n",
- " dir_name = \"forward_model_copy_number\"\n",
- " full_path = os.path.join(main_dir, dir_name)\n",
- " os.makedirs(full_path, exist_ok=True)\n",
- " os.chdir(full_path)\n",
- "\n",
- " # Determine copy number from the prot_prob\n",
- " for index in range(len(prot_prob[0][1][0])):\n",
- " copy_number = np.zeros((N_times, 2))\n",
- " # calculate mean\n",
- " for state in prot_prob:\n",
- " for i in range(N_times):\n",
- " copy_number[i, 0] += state[0] * state[1][i][index]\n",
- " # calculate std deviation\n",
- " for state in prot_prob:\n",
- " for i in range(N_times):\n",
- " # Calculate variance\n",
- " copy_number[i, 1] += state[0] * ((state[1][i][index] - copy_number[i, 0]) ** 2)\n",
- " # Take square root to get the standard deviation\n",
- " copy_number[:, 1] = np.sqrt(copy_number[:, 1])\n",
- " # save to file\n",
- " np.savetxt('CN_prot_'+prot_list[index]+'.txt', copy_number, header='mean CN\\tstd CN')\n",
- "\n",
- "# 3b - comparison of the model to data used in modeling (copy number)\n",
- "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
- "forward_model_copy_number(expected_subcomplexes)\n",
- "print(\"Step 3b: copy number validation IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "de81dea5-9d98-4424-ba64-d7a2def893b9",
- "metadata": {},
- "source": [
- "Here, we plot the comparison between the experimental data used in model construction and the set of trajectory models. This analysis includes the cross-correlation coefficient between the experimental EM density and the forward density of the set of sufficiently good scoring modeled structures in the highest weighted trajectory model (a), as well as comparisons between experimental and modeled protein copy numbers for proteins A (b), B (c), and C (d). Here, we see the model is in good agreement with the data used to construct it.\n",
- "\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "305b6e15-1147-4706-bb26-44125961f9fb",
- "metadata": {},
- "source": [
- "## Validation against data not used in model construction\n",
- "\n",
- "Finally, we aim to compare the model to data not used in model construction. Specifically, we reserved SAXS data for model validation. We aimed to compare the forward scattering profile from the centroid structural model of each snapshot model to the experimental profile. To make this comparison, we wrote functions that converted each centroid RMF to a PDB (`convert_rmfs`), copied the experimental SAXS profiles to the appropriate folder (`copy_SAXS_dat_files`), and ran [FoXS](https://integrativemodeling.org/tutorials/foxs/foxs.html) on each PDB to evaluate its agreement to the experimental profile (`process_foxs`)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d5cc728a-6159-4317-94eb-525dccc780aa",
- "metadata": {},
- "outputs": [],
- "source": [
- "# 4a - SAXS\n",
- "\"\"\"\n",
- "Comparing center models of the most dominant cluster for each snapshot (rmfs) to the SAXS data for each time point\n",
- " can be done in two steps:\n",
- "-converting rmfs to pdb files\n",
- "-comparing pdbs of each snapshot to experimental SAXS profile using FoXS\n",
- "\"\"\"\n",
- "\n",
- "def convert_rmfs(state_dict, model, custom_path=None):\n",
- " \"\"\"\n",
- " The purpose of this function is to automate the conversion of RMF files into PDB files for all the states from\n",
- " state_dict. Created PDBs are further used in comparison of SAXS profiles using FoXS. Additionally, they can be\n",
- " used for comparison to native PDB if available.\n",
- "\n",
- " :param state_dict (dict): dictionary that defines the spatiotemporal model.\n",
- " The keys are strings that correspond to each time point in the\n",
- " stepwise temporal process. Keys should be ordered according to the\n",
- " steps in the spatiotemporal process. The values are integers that\n",
- " correspond to the number of possible states at that timepoint.\n",
- " :param model (str): An IMP (Integrative Modeling Platform) model object.\n",
- " :param custom_path (optional - str): A custom path for the RMF file, allowing for flexibility in file location\n",
- " (should be compliant with stat_dict).\n",
- " \"\"\"\n",
- "\n",
- " for time in state_dict.keys():\n",
- " for state in range(1, state_dict[time] + 1):\n",
- " if custom_path:\n",
- " sim_rmf = custom_path # option for custom path\n",
- " else:\n",
- " sim_rmf = f\"../../../../modeling/Snapshots/Snapshots_Assessment/exhaust_{state}_{time}/cluster.0/cluster_center_model.rmf3\"\n",
- "\n",
- " pdb_output = f\"snapshot{state}_{time}.pdb\" # define the output of converted .pdb file\n",
- "\n",
- " if os.path.exists(sim_rmf):\n",
- " try:\n",
- " rmf_fh = RMF.open_rmf_file_read_only(sim_rmf) # open rmf file for reading\n",
- " rmf_hierarchy = IMP.rmf.create_hierarchies(rmf_fh, model)[0] # extract 1st hierarchy\n",
- " IMP.atom.write_pdb_of_c_alphas(rmf_hierarchy, pdb_output) # write coordinates of CA to .pdb\n",
- " print(f\"Finishing: snapshot{state}_{time}.pdb\")\n",
- " except Exception as e:\n",
- " print(f\"{sim_rmf} is empty or there is another problem: {e}\")\n",
- "\n",
- "\n",
- "def copy_SAXS_dat_files(custom_src_dir = None):\n",
- " \"\"\"\n",
- " Copies all files ending with .dat from the specified directory to the current directory.\n",
- "\n",
- " :param custom_src_dir (optional - str): Path to the source directory\n",
- " \"\"\"\n",
- " if custom_src_dir:\n",
- " src_dir = custom_src_dir\n",
- " else:\n",
- " src_dir = '../../../Input_Information/gen_SAXS'\n",
- " try:\n",
- " files = os.listdir(src_dir) # Get the list of all files in the src_dir directory\n",
- " dat_files = [f for f in files if f.endswith('.dat')] # Filter out files that end with .dat\n",
- "\n",
- " # Copy each .dat file to the current directory, so FoXS can be used\n",
- " for file_name in dat_files:\n",
- " full_file_name = os.path.join(src_dir, file_name)\n",
- " if os.path.isfile(full_file_name):\n",
- " shutil.copy(full_file_name, os.getcwd())\n",
- " # print(f\"Copied: {full_file_name} to {main_dir}\")\n",
- "\n",
- " print(\"All .dat files have been copied successfully...\")\n",
- "\n",
- " except Exception as e:\n",
- " print(f\"An error occurred: {e}\")\n",
- "\n",
- "\n",
- "def process_foxs(state_dict, custom_dat_file = None):\n",
- " \"\"\"\n",
- " This function automates the FoXS analysis for all specified time points in a single execution. It processes PDB\n",
- " files generated by the convert_rmfs function and uses SAXS data copied with the copy_SAXS function. All of this\n",
- " data should be present in the current running directory.\n",
- " FoXS tutorial is available here: https://integrativemodeling.org/tutorials/foxs/foxs.html\n",
- "\n",
- " :param state_dict (dict): dictionary that defines the spatiotemporal model.\n",
- " The keys are strings that correspond to each time point in the\n",
- " stepwise temporal process. Keys should be ordered according to the\n",
- " steps in the spatiotemporal process. The values are integers that\n",
- " correspond to the number of possible states at that timepoint.\n",
- " :param custom_dat_file (optional - str)): A custom name of SAXS files for each time point (should be\n",
- " compliant with stat_dict)\n",
- " \"\"\"\n",
- "\n",
- "\n",
- " print(\"...lets proceed to FoXS\")\n",
- "\n",
- " for time in state_dict.keys():\n",
- " try:\n",
- " if state_dict[time] > 1:\n",
- " # if there is more than one state in timepoint, FoXS creates fit.plt and it should be renamed\n",
- " if custom_dat_file:\n",
- " dat_file = custom_dat_file\n",
- " else:\n",
- " dat_file = f\"{time}_exp.dat\"\n",
- "\n",
- " pdb_files = \" \".join([f\"snapshot{state}_{time}.pdb\" for state in range(1, state_dict[time] + 1)])\n",
- "\n",
- " command1 = f\"foxs -r -g {pdb_files} {dat_file}\"\n",
- " # example how FoXS command should look like: foxs -r -g snapshot1_0min.pdb snapshot2_0min.pdb snapshot3_0min.pdb 0min_exp.dat\n",
- " os.system(command1)\n",
- " print(f\"FoXS for {time} is calculated and ready to create a plot. Nr of states is: {state_dict[time]}\")\n",
- "\n",
- " command2 = f\"gnuplot fit.plt\" # create plot from .plt code\n",
- " os.system(command2)\n",
- "\n",
- " command3 = f\"mv fit.plt {time}_FoXS.plt\" # rename .plt to avoid to be overwritten\n",
- " os.system(command3)\n",
- "\n",
- " command4 = f\"mv fit.png {time}_FoXS.png\" # rename plot to avoid to be overwritten\n",
- " os.system(command4)\n",
- "\n",
- " print(f\"Plot {time}_FoXS.png is created\")\n",
- "\n",
- " elif state_dict[time] == 1:\n",
- " print(f\"There is only one state in {time}\")\n",
- " dat_file1 = f\"{time}_exp.dat\"\n",
- " pdb_file1 = f\"snapshot1_{time}.pdb\"\n",
- "\n",
- " command5 = f\"foxs -r -g {pdb_file1} {dat_file1}\"\n",
- " os.system(command5)\n",
- " print(f\"FoXS for {time} is calculated and ready to create a plot. Nr of states is: {state_dict[time]}\")\n",
- "\n",
- " command6 = f\"gnuplot snapshot1_{time}_{time}_exp.plt\"\n",
- " os.system(command6)\n",
- "\n",
- " command7 = f\"mv snapshot1_{time}_{time}_exp.plt {time}_FoXS.plt\"\n",
- " os.system(command7)\n",
- "\n",
- " command8 = f\"mv snapshot1_{time}_{time}_exp.png {time}_FoXS.png\"\n",
- " os.system(command8)\n",
- " else:\n",
- " print(f\"There is no states in this timepoint. Check stat_dict.\")\n",
- "\n",
- " except Exception as e:\n",
- " print(f\"FoXS can not be executed properly due to following problem: {e}\")\n",
- "\n",
- "\n",
- "# 4a - SAXS\n",
- "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
- "SAXS_output = \"./SAXS_comparison/\"\n",
- "os.makedirs(SAXS_output, exist_ok=True)\n",
- "os.chdir(SAXS_output)\n",
- "convert_rmfs(state_dict, model)\n",
- "copy_SAXS_dat_files(custom_src_dir='../../../../modeling/Input_Information/gen_SAXS')\n",
- "process_foxs(state_dict)\n",
- "print(\"Step 4a: SAXS validation IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "bef8310b-99a2-4e71-b4b7-88c0dbb06def",
- "metadata": {},
- "source": [
- "The output of this analysis is written to `SAXS_comparison`. Standard FoXS outputs are available for each snapshot model (`snapshot{state}_{time}.*`). In particular, the `.fit` files include the forward and experimental profiles side by side, with the $\\chi^2$ for the fit. Further, the `{time}_FoXS.*` files include the information for all snapshot models at that time point, including plots of each profile in comparison to the experimental profile (`{time}_FoXS.png`)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d4ebc14c-1eba-45ff-b98e-2938e65057a4",
- "metadata": {},
- "source": [
- "As our model was generated from synthetic data, the ground truth structure is known at each time point. In addition to validating the model by assessing its comparison to SAXS data, we could approximate the model accuracy by comparing the snapshot model to the PDB structure, although this comparison is not perfect as the PDB structure was used to inform the structure of *rigid bodies* in the snapshot model. To do so, we wrote a function (`RMSD`) that calculates the RMSD between each structural model and the orignal PDB. The function is too computationally expensive to run in this notebook, but is found in the `Trajectories/Trajectories_Assessment/` folder and is demonstrated below.\n",
- "\n",
- "\\code{.py}\n",
- "## 4b - RMSD",
- "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
- "pdb_path = \"../../Input_Information/PDB/3rpg.pdb\"\n",
- "RMSD(pdb_path=pdb_path, custom_n_plot=20)\n",
- "print(\"Step 4b: RMSD validation IS COMPLETED\")\n",
- "print(\"\")\n",
- "print(\"\")\n",
- "\\endcode\n",
- "\n",
- "The output of this function is written in `RMSD_calculation_output`. The function outputs `rmsd_{state}_{time}.png` files, which plots the RMSD for each structural model within each snapshot model. This data is then summarized in `RMSD_analysis.txt`, which includes the minimum RMSD, average RMSD, and number of structural models in each snapshot model.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "64e1a2f6-7258-462d-bde3-7734743b5aa1",
- "metadata": {},
- "source": [
- "Finally, we plot the results for assessing the spatiotemporal model with data not used to construct it. Comparisons are made between the centroid structure of the most populated cluster in each snapshot model at each time point and the experimental SAXS profile for 0 (a), 1 (b), and 2 (c) minutes. Further, we plot both the sampling precision (dark red) and the RMSD to the PDB structure (light red) for each snapshot model in the highest trajectory model (d).\n",
- "\n",
- "\n",
- "\n",
- "To quantitatively compare the model to SAXS data, we used the $\\chi^2$ to compare each snapshot model to the experimental profile. We note that the $\\chi^2$ are substantially lower for the models along the highest trajectory model (1_0min, 1_1min, and 1_2min) than for other models, indicating that the highest weighted trajectory model is in better agreement with the experimental SAXS data than other possible trajectory models.\n",
- "\n",
- "\n",
- "\n",
- "Next, we can evaluate the accuracy of the model by comparing the RMSD to the PDB to the sampling precision of each snapshot model. We note that this is generally not possible, because in most biological applications the ground truth is not known. In this case, if the average RMSD to the PDB structure is smaller than the sampling precision, the PDB structure lies within the precision of the model. We find that the RMSD is within 1.5 \u00c5 of the sampling precision at all time points, indicating that the model lies within 1.5 \u00c5 of the ground truth.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "98f7a12c-d7e9-49ed-9a16-b713c3b4762a",
- "metadata": {},
- "source": [
- "# Next steps\n",
- "\n",
- "After assessing our model, we can must decide if the model is sufficient to answer biological questions of interest. If the model does not have sufficient precision for the desired application, assessment of the current model can be used to inform which new experiments may help improve the next iteration of the model. The [integrative spatiotemporal modeling procedure](https://integrativemodeling.org/tutorials/spatiotemporal/index.html#steps) can then be repeated iteratively, analogous to [integrative modeling of static structures](https://integrativemodeling.org/2.21.0/doc/manual/intro.html#procedure).\n",
- "\n",
- "If the model is sufficient to provide insight into the biological process of interest, the user may decide that it is ready for publication. In this case, the user should create an [mmCIF file](https://mmcif.wwpdb.org/) to deposit the model in the [PDB-dev database](https://pdb-dev.wwpdb.org/). This procedure is explained in the [deposition tutorial](https://integrativemodeling.org/tutorials/deposition/develop/).\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.11.7"
- }
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "4a3dfff8-9d60-418c-a5fd-a4adf7a121b1",
+ "metadata": {
+ "jp-MarkdownHeadingCollapsed": true
+ },
+ "source": [
+ "IMP spatiotemporal tutorial\n",
+ "========\n",
+ "\n",
+ "# Introduction\n",
+ "\n",
+ "Biomolecules are constantly in motion; therefore, a complete depiction of their function must include their dynamics instead of just static structures. We have developed an integrative spatiotemporal approach to model dynamic systems.\n",
+ "\n",
+ "Our approach applies a composite workflow, consisting of three modeling problems to compute (i) heterogeneity models, (ii) snapshot models, and (iii) a trajectory model.\n",
+ "Heterogeneity models describe the possible biomolecular compositions of the system at each time point. Optionally, other auxiliary variables can be considered, such as the coarse location in the final state when modeling an assembly process.\n",
+ "For each heterogeneity model, one snapshot model is produced. A snapshot model is a set of alternative standard static integrative structure models based on the information available for the corresponding time point.\n",
+ "Then, a set of trajectories ranked by their agreement with input information is computed by connecting alternative snapshot models at adjacent time points (*i.e.*, the “trajectory model”). This trajectory model can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.\n",
+ "\n",
+ "If you use this tutorial or its accompanying method, please site the corresponding publications:\n",
+ "\n",
+ "- Latham, A.P.; Tempkin, J.O.B.; Otsuka, S.; Zhang, W.; Ellenberg, J.; Sali, A. bioRxiv, 2024, https://doi.org/10.1101/2024.08.06.606842.\n",
+ "- Latham, A.P.; Rožič, M.; Webb, B.M., Sali, A. in preparation. (tutorial)\n",
+ "\n",
+ "# Integrative spatiotemporal modeling workflow\n",
+ "\n",
+ "In general, integrative modeling proceeds through three steps (i. gathering information; ii. choosing the model representation, scoring alternative models, and searching for good scoring models; and iii. assessing the models). In integrative spatiotemporal modeling, these three steps are repeated for each modeling problem in the composite workflow (i. modeling of heterogeneity, ii. modeling of snapshots, and iii. modeling of a trajectory).\n",
+ "\n",
+ "\n",
+ "\n",
+ "This tutorial will walk you through the creation of a spatiotemporal model for the hypothetical assembly mechanism of the Bmi1/Ring1b-UbcH5c complex. We note that all experimental data besides the static structure used in this study is purely hypothetical, and, thus, the model should not be interpreted to be meaningful about the actual assembly mechanism of the complex.\n",
+ "\n",
+ "Finally, this notebook is intended to present an abbreviated version of this protocol, with the computationally expensive steps excluded. A more complete version of this tutorial can be found as a series of python scripts at https://github.com/salilab/imp_spatiotemporal_tutorial.\n",
+ "\n"
+ ]
},
- "nbformat": 4,
- "nbformat_minor": 5
-}
\ No newline at end of file
+ {
+ "cell_type": "markdown",
+ "id": "dc9d93b4-3b97-4187-867c-977f8353f4aa",
+ "metadata": {},
+ "source": [
+ "Modeling of heterogeneity\n",
+ "====================================\n",
+ "\n",
+ "Here, we describe the first modeling problem in our composite workflow, how to build models of heterogeneity using IMP. In this tutorial, heterogeneity modeling only includes protein copy number; however, in general, other types of information, such as the coarse location in the final state, could also be included in heterogeneity models.\n",
+ "\n",
+ "# Heterogeneity modeling step 1: gathering of information\n",
+ "\n",
+ "We begin heterogeneity modeling with the first step of integrative modeling, gathering information. Heterogeneity modeling will rely on copy number information about the complex. In this case, we utilize the X-ray crystal structure of the fully assembled Bmi1/Ring1b-UbcH5c complex from the protein data bank (PDB), and synthetically generated protein copy numbers during the assembly process, which could be generated from experiments such as flourescence correlation spectroscopy (FCS).\n",
+ "\n",
+ "\n",
+ "\n",
+ "The PDB structure of the complex informs the final state of our model and constrains the maximum copy number for each protein, while the protein copy number data gives time-dependent information about the protein copy number in the assembling complex.\n",
+ "\n",
+ "# Heterogeneity modeling step 2: representation, scoring function, and search process\n",
+ "\n",
+ "Next, we represent, score and search for heterogeneity models models. A single heterogeneity model is a set of protein copy numbers, scored according to its fit to experimental copy number data at that time point. As ET and SAXS data, are only available at 0 minutes, 1 minute, and 2 minutes, we choose to create heterogeneity models at these three time points. We then use `prepare_protein_library`, to calculate the protein copy numbers for each heterogeneity model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each corresponding snapshot model. The choices made in this topology file are important for the representation, scoring function, and search process for snapshot models, and are discussed later. For heterogeneity modeling, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "id": "1ff4d3e5-04de-4092-8cfc-2ad3018675da",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# General imports for the tutorial\n",
+ "import sys, os, glob, shutil\n",
+ "import IMP\n",
+ "import IMP.atom\n",
+ "import RMF\n",
+ "import IMP.rmf\n",
+ "from IMP.spatiotemporal import prepare_protein_library\n",
+ "import IMP.spatiotemporal as spatiotemporal\n",
+ "from IMP.spatiotemporal import analysis\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b78a015-32e5-4701-8c6f-df14863ba9ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "main_dir = os.getcwd()\n",
+ "os.chdir(main_dir)\n",
+ "# parameters for prepare_protein_library:\n",
+ "times = [\"0min\", \"1min\", \"2min\"]\n",
+ "exp_comp = {'A': '../modeling/Input_Information/gen_FCS/exp_compA.csv',\n",
+ " 'B': '../modeling/Input_Information/gen_FCS/exp_compB.csv',\n",
+ " 'C': '../modeling/Input_Information/gen_FCS/exp_compC.csv'}\n",
+ "expected_subcomplexes = ['A', 'B', 'C']\n",
+ "template_topology = '../modeling/Heterogeneity/Heterogeneity_Modeling/spatiotemporal_topology.txt'\n",
+ "template_dict = {'A': ['Ubi-E2-D3'], 'B': ['BMI-1'], 'C': ['E3-ubi-RING2']}\n",
+ "nmodels = 3\n",
+ "\n",
+ "# calling prepare_protein_library\n",
+ "prepare_protein_library.prepare_protein_library(times, exp_comp, expected_subcomplexes, nmodels,\n",
+ " template_topology=template_topology, template_dict=template_dict)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b56fbe48-da12-412e-ac2e-dca673e04a43",
+ "metadata": {},
+ "source": [
+ "From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more heterogeneity models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
+ "- *.config, a file with a list of proteins represented in the heterogeneity model.\n",
+ "- *_topol.txt, a topology file for snapshot modeling corresponding to this heterogeneity model.\n",
+ "\n",
+ "# Heterogeneity modeling step 3: assessment\n",
+ "\n",
+ "Now, we have a variety of heterogeneity models. In general, there are four ways to assess a model: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. Here, we will focus specifically on comparing the model to experimental data, as other assessments will be performed later, when the trajectory model is assessed.\n",
+ "\n",
+ "Next, we can plot the modeled and experimental copy numbers simultaneously for each protein, as shown below for proteins A (a), B (b), and C (c).\n",
+ "\n",
+ "\n",
+ "\n",
+ "From these plots, we observe that the range of possible experimental copy numbers are well sampled by the heterogeneity models, indicating that we are prepared for snapshot modeling."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "496f2687-bb05-4f51-8eab-a1170f6ac5fa",
+ "metadata": {},
+ "source": [
+ "Modeling of snapshots\n",
+ "====================================\n",
+ "\n",
+ "Here, we describe the second modeling problem in our composite workflow, how to build static snapshot models using IMP. We note that this process is similar to previous tutorials of [actin](https://integrativemodeling.org/tutorials/actin/) and [RNA PolII](https://integrativemodeling.org/tutorials/rnapolii_stalk/).\n",
+ "\n",
+ "# Snapshot modeling step 1: gathering of information\n",
+ "\n",
+ "We begin snapshot modeling with the first step of integrative modeling, gathering information. Snapshot modeling utilizes structural information about the complex. In this case, we utilize heterogeneity models, the X-ray crystal structure of the fully assembled Bmi1/Ring1b-UbcH5c complex from the protein data bank (PDB), synthetically generated electron tomography (ET) density maps during the assembly process, and physical principles.\n",
+ "\n",
+ "\n",
+ "\n",
+ "The heterogeneity models inform protein copy numbers for the snapshot models. The PDB structure of the complex informs the structure of the individual proteins. The time-dependent ET data informs the size and shape of the assembling complex. Physical principles inform connectivity and excluded volume.\n",
+ "\n",
+ "# Snapshot modeling step 2: representation, scoring function, and search process\n",
+ "\n",
+ "Next, we represent, score and search for snapshot models. This step is quite computationally expensive. Therefore, we will not run the modeling protocol in this notebook, though the scripts are available in `modeling/Snapshots/Snapshots_Modeling/`. Here, we will simply describe the important steps made by two scripts. The first, `static_snapshot.py`, uses IMP to represent, score, and search for a single static snapshot model. The second, `start_sim.py`, automates the creation of a snapshot model for each heterogeneity model.\n",
+ "\n",
+ "## Modeling one snapshot\n",
+ "Here, we will describe the process of modeling a single snapshot model, as performed by running `static_snapshot.py`.\n",
+ "\n",
+ "### Representing the model\n",
+ "\n",
+ "We begin by representing the data and the model. In general, the *representation* of a system is defined by all the variables that need to be determined.\n",
+ "\n",
+ "For our model of a protein complex, we use a combination of two representations. The first is a series of *spherical beads*, which can correspond to portions of the biomolecules of interest, such as atoms or groups of atoms. The second is a series of *3D Gaussians*, which help calculate the overlap between our model and the density from ET data.\n",
+ "\n",
+ "Beads and Gaussians in our model belong to either a *rigid body* or *flexible string*. The positions of all beads and Gaussians in a single rigid body are constrained during sampling and do not move relative to each other. Meanwhile, flexible beads can move freely during sampling, but are restrained by sequence connectivity.\n",
+ "\n",
+ "To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies for each snapshot model. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](https://integrativemodeling.org/2.21.0/doc/ref/classIMP_1_1pmi_1_1topology_1_1TopologyReader.html) documentation.\n",
+ "\n",
+ "```\n",
+ "|molecule_name | color | fasta_fn | fasta_id | pdb_fn | chain | residue_range | pdb_offset | bead_size | em_residues_per_gaussian | rigid_body | super_rigid_body | chain_of_super_rigid_bodies | \n",
+ "\n",
+ "|Ubi-E2-D3|blue|3rpg.fasta.txt|Ubi-E2-D3|3rpg.pdb|A|-1,18|2|1|10|1|1||\n",
+ "|Ubi-E2-D3|blue|3rpg.fasta.txt|Ubi-E2-D3|3rpg.pdb|A|19,147|2|1|10|2|1||\n",
+ "|BMI-1|red|3rpg.fasta.txt|BMI-1|3rpg.pdb|B|3,83|-2|1|10|3|2||\n",
+ "|BMI-1|red|3rpg.fasta.txt|BMI-1|3rpg.pdb|B|84,101|-2|1|10|4|2||\n",
+ "|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|BEADS|C|16,44|-15|1|1|5|3||\n",
+ "|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||\n",
+ "```\n",
+ "\n",
+ "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which heterogeneity model to use, as well as the paths to other pieces of input information.\n",
+ "\n",
+ "```python\n",
+ "# Running parameters to access correct path of ET_data for EM restraint\n",
+ "# and topology file for certain {state}_{time}_topol.txt\n",
+ "state = sys.argv[1]\n",
+ "time = sys.argv[2]\n",
+ "\n",
+ "# Topology file\n",
+ "topology_file = f\"../{state}_{time}_topol.txt\"\n",
+ "# Paths to input data for topology file\n",
+ "pdb_dir = \"../../../../Input_Information/PDB\"\n",
+ "fasta_dir = \"../../../../Input_Information/FASTA\"\n",
+ "# Path where forward gmms are created with BuildSystem (based ont topology file)\n",
+ "# If gmms exist, they will be used from this folder\n",
+ "forward_gmm_dir = \"../forward_densities/\"\n",
+ "# Path to experimental gmms\n",
+ "exp_gmm_dir= '../../../../Input_Information/ET_data/add_noise'\n",
+ "```\n",
+ "\n",
+ "Next, we build the system, using the topology tile, described above.\n",
+ "```python\n",
+ "# Create a system from a topology file. Resolution is set on 1.\n",
+ "bs = IMP.pmi.macros.BuildSystem(mdl, resolutions= 1, name= f'Static_snapshots_{state}_{time}')\n",
+ "bs.add_state(t)\n",
+ "```\n",
+ "\n",
+ "Then, we prepare for later sampling steps by setting which Monte Carlo moves will be performed. Rotation (`rot`) and translation (`trans`) parameters are separately set for super rigid bodies (`srb`), rigid bodies (`rb`), and beads (`bead`).\n",
+ "```python\n",
+ "# Macro execution: It gives hierarchy and degrees of freedom (dof).\n",
+ "# In dof we define how much can each (super) rigid body translate and rotate between two adjacent Monte Carlo steps\n",
+ "root_hier, dof = bs.execute_macro(max_rb_trans=1.0,\n",
+ " max_rb_rot=0.5, max_bead_trans=2.0,\n",
+ " max_srb_trans=1.0, max_srb_rot=0.5)\n",
+ "```\n",
+ "\n",
+ "### Scoring the model\n",
+ "\n",
+ "After building the model representation, we choose a scoring function to score the model based on input information. This scoring function is represented as a series of restraints that serve as priors.\n",
+ "\n",
+ "#### Connectivity\n",
+ "We begin with a connectivity restraint, which restrains beads adjacent in sequence to be close in 3D space.\n",
+ "\n",
+ "```python\n",
+ "# Adding Restraints\n",
+ "# Empty list where the data from restraints should be collected\n",
+ "output_objects=[]\n",
+ "\n",
+ "# Two common restraints: ConnectivityRestraint and ExcludedVolumeSphere\n",
+ "# ConnectivityRestraint is added for each \"molecule\" separately\n",
+ "for m in root_hier.get_children()[0].get_children():\n",
+ " cr = IMP.pmi.restraints.stereochemistry.ConnectivityRestraint(m)\n",
+ " cr.add_to_model()\n",
+ " output_objects.append(cr)\n",
+ "```\n",
+ "\n",
+ "#### Excluded volume\n",
+ "Next is an excluded volume restraint, which restrains beads to minimize their spatial overlap.\n",
+ "\n",
+ "```python\n",
+ "# Add excluded volume\n",
+ "evr = IMP.pmi.restraints.stereochemistry.ExcludedVolumeSphere(\n",
+ " included_objects=[root_hier],\n",
+ " resolution=1000)\n",
+ "output_objects.append(evr)\n",
+ "evr.add_to_model()\n",
+ "```\n",
+ "\n",
+ "#### Electron tomography\n",
+ "Finally, we restrain our models based on their fit to ET density maps. Both the experimental map and the forward protein density are represented as Gaussian mixture models (GMMs) to speed up scoring. The score is based on the log of the correlation coefficient between the experimental density and the forward protein density.\n",
+ "\n",
+ "```python\n",
+ "# Applying time-dependent EM restraint. Point to correct gmm / mrc file at each time point\n",
+ "# Path to corresponding .gmm file (and .mrc file)\n",
+ "em_map = exp_gmm_dir + f\"/{time}_noisy.gmm\"\n",
+ "\n",
+ "# Create artificial densities from hierarchy\n",
+ "densities = IMP.atom.Selection(root_hier,\n",
+ " representation_type=IMP.atom.DENSITIES).get_selected_particles()\n",
+ "\n",
+ "# Create EM restraint based on these densities\n",
+ "emr = IMP.pmi.restraints.em.GaussianEMRestraint(\n",
+ " densities,\n",
+ " target_fn=em_map,\n",
+ " slope=0.000001,\n",
+ " scale_target_to_mass=True,\n",
+ " weight=1000)\n",
+ "output_objects.append(emr)\n",
+ "emr.add_to_model()\n",
+ "```\n",
+ "\n",
+ "### Searching for good scoring models\n",
+ "\n",
+ "After building a scoring function that scores alternative models based on their fit to the input information, we aim to search for good scoring models. For complicated systems, stochastic sampling techniques such as Monte Carlo (MC) sampling are often the most efficient way to compute good scoring models. Here, we generate a random initial configuration and then perform temperature replica exchange MC sampling with 16 temperatures from different initial configurations. By performing multiple runs of replica exchange MC from different initial configurations, we can later ensure that our sampling is sufficiently converged.\n",
+ "\n",
+ "```python\n",
+ "# Generate random configuration\n",
+ "IMP.pmi.tools.shuffle_configuration(root_hier,\n",
+ " max_translation=50)\n",
+ "\n",
+ "# Perform replica exchange sampling\n",
+ "rex=IMP.pmi.macros.ReplicaExchange(mdl,\n",
+ " root_hier=root_hier,\n",
+ " monte_carlo_sample_objects=dof.get_movers(),\n",
+ " global_output_directory='output', # name 'output' is the best for imp sampcon select_good\n",
+ " output_objects=output_objects,\n",
+ " monte_carlo_steps=200, # Number of MC steps between writing frames.\n",
+ " number_of_best_scoring_models=0,\n",
+ " number_of_frames=500) # number of frames to be saved\n",
+ "rex.execute_macro()\n",
+ "```\n",
+ "\n",
+ "After performing sampling, a variety of outputs will be created. These outputs include `.rmf` files, which contain multi-resolution models output by IMP, and `.out` files which contains a variety of information about the run such as the value of the restraints and the MC acceptance rate.\n",
+ "\n",
+ "## Generalizing modeling to all snapshots\n",
+ "\n",
+ "Next, we will describe the process of computing multiple static snapshot models, as performed by running `start_sim.py`.\n",
+ "\n",
+ "From heterogeneity modeling, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex), each of which has a corresponding topology file in `Heterogeneity/Heterogeneity_Modeling/`. We wrote a function, `generate_all_snapshots`, which creates a directory for each snapshot model, copies the python script and topology file into that directory, and submits a job script to run sampling. The job script will likely need to be customized for the user's computer or cluster.\n",
+ "\n",
+ "```python\n",
+ "# 1a - parameters for generate_all_snapshots\n",
+ "# state_dict - universal parameter\n",
+ "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
+ "\n",
+ "main_dir = os.getcwd()\n",
+ "topol_dir = os.path.join(os.getcwd(), '../../Heterogeneity/Heterogeneity_Modeling')\n",
+ "items_to_copy = ['static_snapshot.py'] # additionally we need to copy only specific topology file\n",
+ "# jobs script will likely depend on the user's cluster / configuration\n",
+ "job_template = (\"#!/bin/bash\\n#$ -S /bin/bash\\n#$ -cwd\\n#$ -r n\\n#$ -j y\\n#$ -N Tutorial\\n#$ -pe smp 16\\n\"\n",
+ " \"#$ -l h_rt=48:00:00\\n\\nmodule load Sali\\nmodule load imp\\nmodule load mpi/openmpi-x86_64\\n\\n\"\n",
+ " \"mpirun -np $NSLOTS python3 static_snapshot.py {state} {time}\")\n",
+ "number_of_runs = 50\n",
+ "\n",
+ "# 1b - calling generate_all_snapshots\n",
+ "generate_all_snapshots(state_dict, main_dir, topol_dir, items_to_copy, job_template, number_of_runs)\n",
+ "\n",
+ "```\n",
+ "\n",
+ "# Snapshot modeling step 3: assessment\n",
+ "\n",
+ "The above code would create a variety of alternative snapshot models. In general, we would like to assess these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. In this portion of the tutorial, we focus specifically on estimating the sampling precision of the model, while quantitative comparisons between the model and experimental data will be reserved for the final step, when we assess trajectories. Again, this assessment process is quite computationally intensive, so, instead of running the script explicitly, we will walk you through the `snapshot_assessment.py` script, which is located in the `modeling/Snapshots/Snapshots_Assessment` folder.\n",
+ "\n",
+ "## Filtering good scoring models\n",
+ "\n",
+ "Initially, we want to filter the various alternative structural models to only select those that meet certain parameter thresholds. In this case, we filter the structural models comprising each snapshot model by the median cross correlation with EM data. We note that this filtering criteria is subjective, and developing a Bayesian method to objectively weigh different restraints for filtering remains an interesting future development in integrative modeling.\n",
+ "\n",
+ "The current filtering procedure involves three steps. In the first step, we look through the `stat.*.out` files to write out the cross correlation with EM data for each model, which, in this case, is labeled column `3`, `GaussianEMRestraint_None_CCC`. In other applications, the column that corresponds to each type of experimental data may change, depending on the scoring terms for each model. For each snapshot model, a new file is written with this data (`{state}_{time}_stat.txt`).\n",
+ "\n",
+ "```python\n",
+ "# state_dict - universal parameter\n",
+ "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
+ "# current directory\n",
+ "main_dir = os.getcwd()\n",
+ "\n",
+ "# 1 calling extracting_stat_files function and related parameters\n",
+ "keys_to_extract = [3]\n",
+ "runs_nr = 50\n",
+ "replica_nr = 16\n",
+ "replica_output_name = 'output'\n",
+ "decimals_nr = 16\n",
+ "\n",
+ "extracting_stat_files(state_dict, runs_nr, replica_nr, replica_output_name, keys_to_extract, decimals_nr)\n",
+ "print(\"extracting_stat_files is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "In the second step, we want to determine the median value of EM cross correlation for each snapshot model. We wrote `general_rule_calculation` to look through the `general_rule_column` for each `{state}_{time}_stat.txt` file and determine both the median value and the number of structures generated.\n",
+ "\n",
+ "```python\n",
+ "# 2 calling general_rule_calculation and related parameters\n",
+ "general_rule_column = '3'\n",
+ "\n",
+ "general_rule_calculation(state_dict, general_rule_column)\n",
+ "\n",
+ "print(\"general_rule_calculation is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "In the third step, we use the `imp_sampcon select_good` tool to filter each snapshot model, according to the median value determined in the previous step. For each snapshot model, this function produces a file, `good_scoring_models/model_ids_scores.txt`, which contains the run, replicaID, scores, and sampleID for each model that passes filtering. It also saves RMF files with each model from two independent groups of sampling runs from each snapshot model to `good_scoring_models/sample_A` and `good_scoring_models/sample_B`, writes the scores for the two independent groups of sampling runs to `good_scoring_models/scoresA.txt` and `good_scoring_models/scoresB.txt`, and writes `good_scoring_models/model_sample_ids.txt` to connect each model to its division of sampling run. More information on `imp_sampcon` is available in the analysis portion of the [actin tutorial](https://integrativemodeling.org/tutorials/actin/analysis.html).\n",
+ "\n",
+ "```python\n",
+ "# 3 calling general_rule_filter_independent_samples\n",
+ "general_rule_filter_independent_samples(state_dict, main_dir)\n",
+ "print(\"general_rule_filter_independent_samples is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "## Plotting data, clustering models, and determining sampling precision\n",
+ "\n",
+ "Next, scores can be plotted for analysis. Here, we wrote the `create_histograms` function to run `imp_sampcon plot_score` so that it plots distributions for various scores of interest. Each of these plots are saved to `histograms{state}_{time}/{score}.png`, where score is an object listed in the `score_list`. These plots are useful for debugging the modeling protocol, and should appear roughly Gaussian.\n",
+ "\n",
+ "```python\n",
+ "# 4 calling create_histograms and related parameters\n",
+ "score_list = [\n",
+ " 'Total_Score',\n",
+ " 'ConnectivityRestraint_Score',\n",
+ " 'ExcludedVolumeSphere_Score',\n",
+ " 'GaussianEMRestraint_None',\n",
+ " 'GaussianEMRestraint_None_CCC'\n",
+ "] # list of histograms we want to create in each histograms{state}_{time} directory\n",
+ "\n",
+ "create_histograms(state_dict, main_dir, score_list)\n",
+ "print(\"create_histograms is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "We then check the number of models in each sampling run though our function, `count_rows_and_generate_report`, which writes the `independent_samples_stat.txt` file. Empirically, we have found that ensuring the overall number of models in each independent sample after filtering is roughly equal serves a good first check on sampling convergence.\n",
+ "\n",
+ "```python\n",
+ "# 5 calling count_rows_and_generate_report\n",
+ "count_rows_and_generate_report(state_dict)\n",
+ "print(\"count_rows_and_generate_report is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "Next, we write the density range dictionaries, which are output as `{state}_{time}_density_ranges.txt`. These dictionaries label each protein in each snapshot model, which will be passed into `imp_sampcon` to calculate the localization density of each protein.\n",
+ "\n",
+ "```python\n",
+ "# 6 calling create_density_dictionary:\n",
+ "create_density_dictionary_files(state_dict, main_dir)\n",
+ "print(\"create_density_dictionary is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "Next, we run `imp_sampcon exhaust` on each snapshot model. This code performs checks on the exhaustiveness of the sampling. Specifically it analyzes the convergence of the model score, whether the two model sets were drawn from the same distribution, and whether each structural cluster includes models from each sample proportionally to its size. The output for each snapshot model is written out to the `exhaust_{state}_{time}` folder.\n",
+ "\n",
+ "```python\n",
+ "# 7 calling exhaust\n",
+ "exhaust(state_dict, main_dir)\n",
+ "print(\"exhaust is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "Plots for determining the sampling precision are shown below for a single snapshot model, 1_2min. (a) Tests the convergence of the lowest scoring model (`snapshot_{state}_{time}.Top_Score_Conv.pdf`). Error bars represent standard deviations of the best scores, estimated by selecting different subsets of models 10 times. The light-blue line indicates a lower bound reference on the total score. (b) Tests that the scores of two independently sampled models come from the same distribution (`snapshot_{state}_{time}.Score_Dist.pdf`). The difference between the two distributions, as measured by the KS test statistic (D) and KS test p-value (p) indicates that the difference is both statistically insignificant (p>0.05) and small in magnitude (D<0.3). (c) Determines the structural precision of a snapshot model (`snapshot_{state}_{time}.ChiSquare.pdf`). RMSD clustering is performed at 1 Å intervals until the clustered population (% clustered) is greater than 80%, and either the χ2 p-value is greater than 0.05 or Cramer’s V is less than 0.1. The sampling precision is indicated by the dashed black line. (d) Populations from sample 1 and sample 2 are shown for each cluster (`snapshot_{state}_{time}.Cluster_Population.pdf`).\n",
+ "\n",
+ "\n",
+ "\n",
+ "Further structural analysis can be calculated by using the `cluster.*` files. The `cluster.*.{sample}.txt` files contain the model number for the models in that cluster, where `{sample}` indicates which round of sampling the models came from. The `cluster.*` folder contains an RMF for centroid model of that cluster, along with the localization densities for each protein. The localization densities of each protein from each independent sampling can be compared to ensure independent samplings produce the same results.\n",
+ "\n",
+ "Ideally, each of these plots should be checked for each snapshot model. As a way to summarize the output of these checks, we can gather the results of the KS test and the sampling precision test for all snapshot models. This is done by running `extract_exhaust_data` and `save_exhaust_data_as_png`, which write `KS_sampling_precision_output.txt` and `KS_sampling_precision_output.png`, respectively.\n",
+ "\n",
+ "```python\n",
+ "# 8 calling extract_exhaust_data\n",
+ "extract_exhaust_data(state_dict)\n",
+ "print(\"extract_exhaust_data is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "\n",
+ "# 9 calling save_exhaust_data_as_png\n",
+ "save_exhaust_data_as_png()\n",
+ "print(\"save_exhaust_data_as_png is DONE\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "These codes write a table that include the KS two sample test statistic (D), the KS test p-value, and the sampling precision for each snapshot model, which is replotted below.\n",
+ "\n",
+ "\n",
+ "\n",
+ "## Visualizing models\n",
+ "\n",
+ "The resulting RMF files and localization densities from this analysis can be viewed in [UCSF Chimera](https://www.rbvi.ucsf.edu/chimera/) (version>=1.13) or [UCSF ChimeraX](https://www.cgl.ucsf.edu/chimerax/).\n",
+ "\n",
+ "Here, we plotted each centroid model (A - blue, B - orange, and C - purple) from the most populated cluster for each snapshot model and compared that model to the experimental EM profile (gray).\n",
+ "\n",
+ "\n",
+ "\n",
+ "Finally, now that snapshot models were assessed, we can perform modeling of a trajectory."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc2546f-7c6f-4146-8c9b-ee143fcead6e",
+ "metadata": {},
+ "source": [
+ "Modeling of a Trajectory\n",
+ "====================================\n",
+ "\n",
+ "Here, we describe the final modeling problem in our composite workflow, how to build a trajectory model using IMP.\n",
+ "\n",
+ "# Trajectory modeling step 1: gathering of information\n",
+ "\n",
+ "We begin trajectory modeling with the first step of integrative modeling, gathering information. Trajectory modeling utilizes dynamic information about the bimolecular process. In this case, we utilize snapshot models, physical theories, and synthetically generated small-angle X-ray scattering (SAXS) profiles.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Snapshot models inform transitions between sampled time points, and their scores inform trajectory scores. Physical theories of macromolecular dynamics inform transitions between sampled time points. SAXS data informs the size and shape of the assembling complex and is left for validation.\n",
+ "\n",
+ "# Trajectory modeling step 2: representation, scoring function, and search process\n",
+ "\n",
+ "Trajectory modeling connects alternative snapshot models at adjacent time points, followed by scoring the trajectories based on their fit to the input information, as described in full [here](https://www.biorxiv.org/content/10.1101/2024.08.06.606842v1.abstract).\n",
+ "\n",
+ "## Background behind integrative spatiotemporal modeling\n",
+ "### Representing the model\n",
+ "\n",
+ "We choose to represent dynamic processes as a trajectory of snapshot models, with one snapshot model at each time point. In this case, we computed snapshot models at 3 time points (0, 1, and 2 minutes), so a single trajectory will consist of 3 snapshot models, one at each 0, 1, and 2 minutes. The modeling procedure described here will produce a set of scored trajectories, which can be displayed as a directed acyclic graph, where nodes in the graph represent the snapshot model and edges represent connections between snapshot models at neighboring time points.\n",
+ "\n",
+ "### Scoring the model\n",
+ "\n",
+ "To score trajectories, we incorporate both the scores of individual snapshot models, as well as the scores of transitions between them. Under the assumption that the process is Markovian (*i.e.* memoryless), the weight of a trajectory takes the form:\n",
+ "\n",
+ "$$\n",
+ "W(\\chi) \\propto \\displaystyle\\prod^{T}_{t=0} P( X_{t} | D_{t}) \\cdot \\displaystyle\\prod^{T-1}_{t=0} W(X_{t+1} | X_{t},D_{t,t+1}),\n",
+ "$$\n",
+ "\n",
+ "where $t$ indexes times from 0 until the final modeled snapshot ($T$); $P(X_{t} | D_{t})$ is the snapshot model score; and $W(X_{t+1} | X_{t},D_{t,t+1})$ is the transition score. Trajectory weights ($W(\\chi)$) are normalized so that the sum of all trajectory weights is 1.0. Transition scores are currently based on a simple metric that either allows or disallows a transition. Transitions are only allowed if all proteins in the first snapshot model are included in the second snapshot model. In the future, we hope to include more detailed transition scoring terms, which may take into account experimental information or physical models of macromolecular dynamics.\n",
+ "\n",
+ "### Searching for good scoring models\n",
+ "\n",
+ "Trajectories are constructed by enumerating all connections between adjacent snapshot models and scoring these trajectories according to the equation above. This procedure results in a set of weighted trajectories.\n",
+ "\n",
+ "## Computing trajectory models\n",
+ "To compute trajectories, we first copy all necessary files to a new directory, `data`. These files are (i) `{state}_{time}.config` files, which include the subcomplexes that are in each state, (ii) `{state}_{time}_scores.log`, which is a list of all scores of all structural models in that snapshot model, and (iii) `exp_comp{prot}.csv`, which is the experimental copy number for each protein (`{prot}`) as a function of time. Here, we copy files related to the snapshot models (`*.log` files) from the `modeling` directory, as we skipped computing snapshot models due to the computational expense.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb887efe-0630-47b6-9bf1-85fa216a6816",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def merge_scores(fileA, fileB, outputFile):\n",
+ " \"\"\"\n",
+ " For each function merges scoresA.txt and scoresB.txt into {state}_{time}_scores.log\n",
+ "\n",
+ " :param fileA: path to scoresA.txt\n",
+ " :param fileB: path to scoresB.txt\n",
+ " :param outputFile: path to output merged .log file named {state}_{time}_scores.log for each snapshot.\n",
+ " This type of .log file is used in crete_DAG to generate trajectory model.\n",
+ " \"\"\"\n",
+ " # open both files, so data can be extracted\n",
+ " with open(fileA, 'r') as file_a:\n",
+ " data_a = file_a.readlines()\n",
+ "\n",
+ " with open(fileB, 'r') as file_b:\n",
+ " data_b = file_b.readlines()\n",
+ "\n",
+ " # Merge the content of both files\n",
+ " merged_data = data_a + data_b\n",
+ "\n",
+ " # Write the merged content into the output file\n",
+ " with open(outputFile, 'w') as output:\n",
+ " output.writelines(merged_data)\n",
+ "\n",
+ "def create_data_and_copy_files(state_dict, custom_source_dir1 = None, custom_source_dir2 = None, custom_source_dir3 = None):\n",
+ " \"\"\"\n",
+ " Copies three types of files important to generate trajectory models:\n",
+ " -.config files created with start_sim.py in Snapshot_Modeling (source_dir1)\n",
+ " -time-dependent stoichiometry data for each timepoint. Data should be presented in .csv file. With this function all\n",
+ " csv file in source_dir2 will be copied. These .csv files will be used in the exp_comp dictionary in create_DAG\n",
+ " function\n",
+ " -scoresA and scoresB for each snapshot created with imp sampcon exhaust\n",
+ " (source_dir1 + snapshot + good_scoring_models) are merged into total score .txt using merge_scores helper function.\n",
+ " All copied files are gathered in newly created './data/' directory, where everything is prepared for create_DAG\n",
+ " function.\n",
+ "\n",
+ "\n",
+ " :param state_dict (dict): state_dict: dictionary that defines the spatiotemporal model.\n",
+ " The keys are strings that correspond to each time point in the\n",
+ " stepwise temporal process. Keys should be ordered according to the\n",
+ " steps in the spatiotemporal process. The values are integers that\n",
+ " correspond to the number of possible states at that timepoint.\n",
+ " :param custom_source_dir1 (optional - str): Custom path to heterogeneity modeling dir (heterogeneity_modeling.py),\n",
+ " to copy .config files\n",
+ " :param custom_source_dir2 (optional - str): Custom path to stoichiometry data dir\n",
+ " :param custom_source_dir3 (optional - str): Custom path to snapshot modeling dir (start_sim.py), to copy .config\n",
+ " files and to access scoresA/scoresB (custom_source_dir3 + snapshot{state}_{time} + 'good_scoring_models')\n",
+ " \"\"\"\n",
+ "\n",
+ " # Create the destination directory if it does not exist (./data/). Here all the\n",
+ " destination_dir = './data/'\n",
+ " os.makedirs(destination_dir, exist_ok=True)\n",
+ "\n",
+ " # Path to heterogeneity modeling dir\n",
+ " if custom_source_dir1:\n",
+ " source_dir1 = custom_source_dir1\n",
+ " else:\n",
+ " source_dir1 = '../../Heterogeneity/Heterogeneity_Modeling/'\n",
+ "\n",
+ " # Path to stoichiometry data dir\n",
+ " if custom_source_dir2:\n",
+ " source_dir2 = custom_source_dir2\n",
+ " else:\n",
+ " source_dir2 = '../../Input_Information/gen_FCS/'\n",
+ "\n",
+ " # Path to snapshot modeling dir\n",
+ " if custom_source_dir3:\n",
+ " source_dir3 = custom_source_dir3\n",
+ " else:\n",
+ " source_dir3 = '../../Snapshots/Snapshots_Modeling/'\n",
+ "\n",
+ " # Copy all .config files from the first source directory to the destination directory\n",
+ " try:\n",
+ " for file_name in os.listdir(source_dir1):\n",
+ " if file_name.endswith('.config'):\n",
+ " full_file_name = os.path.join(source_dir1, file_name)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " shutil.copy(full_file_name, destination_dir)\n",
+ " print(\".config files are copied\")\n",
+ " except Exception as e:\n",
+ " print(f\".config files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
+ "\n",
+ " # Copy all .csv stoichiometry files from the second source directory to the destination directory\n",
+ " try:\n",
+ " for file_name in os.listdir(source_dir2):\n",
+ " if file_name.endswith('.csv'):\n",
+ " full_file_name = os.path.join(source_dir2, file_name)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " shutil.copy(full_file_name, destination_dir)\n",
+ " print(\".csv stoichiometry files are copied\")\n",
+ " except Exception as e:\n",
+ " print(f\".csv stoichiometry files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
+ "\n",
+ " # Copy scoresA and scoresB from the snapshot_{state}_{time} directories and first source directory path\n",
+ " for time in state_dict.keys():\n",
+ " for state in range(1, state_dict[time] + 1):\n",
+ " dir_name = f\"snapshot{state}_{time}\"\n",
+ " good_scoring_path = \"good_scoring_models\"\n",
+ " file_a = os.path.join(source_dir3, dir_name, good_scoring_path, \"scoresA.txt\")\n",
+ " file_b = os.path.join(source_dir3, dir_name, good_scoring_path, \"scoresB.txt\")\n",
+ " output_file = os.path.join(destination_dir, f\"{state}_{time}_scores.log\") # name of the output file\n",
+ "\n",
+ " try:\n",
+ " # Ensure the directory exists before try to read/write files\n",
+ " if os.path.exists(file_a) and os.path.exists(file_b):\n",
+ " merge_scores(file_a, file_b, output_file) # call helper function to merge files\n",
+ " print(f\"Scores for snapshot{state}_{time} have been merged and saved\")\n",
+ " else: # many things can go wrong here, so it is good to know where is the problem\n",
+ " print(f\"Path doesn't exist: {source_dir3}\")\n",
+ " print(f\"Files not found in directory: {dir_name}\")\n",
+ " print(f\"Files not found in directory: {file_a}\")\n",
+ " print(f\"Files not found in directory: {file_b}\")\n",
+ " print(f\"Output directory doesn't exist: {destination_dir}\")\n",
+ " except Exception as e:\n",
+ " print(f\"total scores files cannot be copied of merged. Reason for Error: {e}\")\n",
+ "\n",
+ "# copy all the relevant files for create_DAG\n",
+ "# it is important that everything starts from main dir\n",
+ "os.chdir(main_dir)\n",
+ "state_dict = {'0min': 3, '1min': 3, '2min': 1}\n",
+ "create_data_and_copy_files(state_dict, custom_source_dir1=main_dir, custom_source_dir2='../modeling/Input_Information/gen_FCS/', custom_source_dir3='../modeling/Snapshots/Snapshots_Modeling/')\n",
+ "\n",
+ "# then trajectory model is created based on the all copied data\n",
+ "expected_subcomplexes = ['A', 'B', 'C']\n",
+ "exp_comp = {'A': 'exp_compA.csv', 'B': 'exp_compB.csv', 'C': 'exp_compC.csv'}\n",
+ "input = './data/'\n",
+ "output = \"../output/\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a4053cf-f2a3-4836-89f3-a8b2098109f2",
+ "metadata": {},
+ "source": [
+ "Next, we compute the spatiotemporal model. The inputs we included are:\n",
+ "- state_dict (dict): a dictionary that defines the spatiotemporal model. Keys are strings for each time point in the spatiotemporal process and values are integers corresponding to the number of snapshot models computed at that time point.\n",
+ "- out_pdf (bool): whether to write the probability distribution function (pdf).\n",
+ "- npaths (int): Number of states two write to a file (path*.txt).\n",
+ "- input_dir (str): directory with the input information.\n",
+ "- scorestr (str): final characters at the end of the score files.\n",
+ "- output_dir (str): directory to which model will be written. Will be created if it does not exist.\n",
+ "- spatio_temporal_rule (bool): whether to include our transition scoring term, which enforces that all proteins in the first snapshot model are included in the second snapshot model.\n",
+ "- expected_subcomplexes (list): list of string objects, which is the subcomplexes to look when enforcing the spatiotemporal rule. Strings should be substrings of those in `{state}_{time}.config` files.\n",
+ "- score_comp (bool): whether to score the composition of each snapshot model.\n",
+ "- exp_comp_map (dictionary): key is a string with the name of each protein that will undergo composition scoring, value is the `.csv` file with the copy number data for that protein.\n",
+ "- draw_dag (bool): whether to write out an image of the directed acyclic graph."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "819bb205-fa52-42f9-a4ab-d2f7c3cff0ad",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "nodes, graph, graph_prob, graph_scores = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
+ " input_dir=input, scorestr='_scores.log',\n",
+ " output_dir=output, spatio_temporal_rule=True,\n",
+ " expected_subcomplexes=expected_subcomplexes,\n",
+ " score_comp=True, exp_comp_map=exp_comp,\n",
+ " draw_dag=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7184fcec-05b9-4348-a9b8-d3673db00fe5",
+ "metadata": {},
+ "source": [
+ "After running `spatiotemporal.create_DAG`, a variety of outputs are written:\n",
+ "- `cdf.txt`: the cumulative distribution function for the set of trajectories.\n",
+ "- `pdf.txt`: the probability distribution function for the set of trajectories.\n",
+ "- `labeled_pdf.txt`: Each row has 2 columns and represents a different trajectory. The first column labels a single trajectory as a series of snapshot models, where each snapshot model is written as `{state}_{time}|` in sequential order. The second column is the probability distribution function corresponding to that trajectory.\n",
+ "- `dag_heatmap.eps` and `dag_heatmap`: image of the directed acyclic graph from the set of models.\n",
+ "- `path*.txt`: files where each row includes a `{state}_{time}` string, so that rows correspond to the states visited over that trajectory. Files are numbered from the most likely path to the least likely path.\n",
+ "\n",
+ "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
+ "\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb4c016e-e5ba-4e3b-a76e-499750abb782",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Trajectory modeling step 3: assessment\n",
+ "\n",
+ "Now that the set of spatiotemporal models has been constructed, we must evaluate these models. We can evaluate these models in at least 4 ways: estimating the sampling precision, comparing the model to data used to construct it, validating the model against data not used to construct it, and quantifying the precision of the model.\n",
+ "\n",
+ "## Sampling precision\n",
+ "\n",
+ "To begin, we calculate the sampling precision of the models. The sampling precision is calculated by using `spatiotemporal.create_DAG` to reconstruct the set of trajectories using 2 independent sets of samplings for snapshot models. Then, the overlap between these snapshot models is evaluated using `analysis.temporal_precision`, which takes in two `labeled_pdf` files.\n",
+ "\n",
+ "The temporal precision can take values between 1.0 and 0.0, and indicates the overlap between the two models in trajectory space. Hence, values close to 1.0 indicate a high sampling precision, while values close to 0.0 indicate a low sampling precision. Here, the value close to 1.0 indicates that sampling does not affect the weights of the trajectories.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f1436e33-81b6-400c-bf9c-7b667455265a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## 1 - calculation of temporal precision\n",
+ "\n",
+ "# 1 - copy_files_for_data (copy all relevant files into 'data' directory)\n",
+ "def copy_files_for_data(state_dict, custom_source_dir1 = None, custom_source_dir2 = None, custom_source_dir3 = None):\n",
+ " \"\"\"\n",
+ " Copies three types of files important to generate trajectory models:\n",
+ " -.config files created with start_sim.py in Snapshot_Modeling (source_dir1)\n",
+ " -time-dependent stoichiometry data for each timepoint. Data should be presented in .csv file. With this function all\n",
+ " csv file in source_dir2 will be copied. These .csv files will be used in the exp_comp dictionary in create_DAG\n",
+ " function\n",
+ " -scoresA and scoresB for each snapshot created with imp sampcon exhaust\n",
+ " (source_dir1 + snapshot + good_scoring_models) are merged into total score .txt using merge_scores helper function.\n",
+ " All copied files are gathered in newly created './data/' directory, where everything is prepared for create_DAG\n",
+ " function.\n",
+ "\n",
+ "\n",
+ " :param state_dict (dict): state_dict: dictionary that defines the spatiotemporal model.\n",
+ " The keys are strings that correspond to each time point in the\n",
+ " stepwise temporal process. Keys should be ordered according to the\n",
+ " steps in the spatiotemporal process. The values are integers that\n",
+ " correspond to the number of possible states at that timepoint.\n",
+ " :param custom_source_dir1 (optional - str): Custom path to heterogeneity modeling dir (heterogeneity_modeling.py),\n",
+ " to copy .config files\n",
+ " :param custom_source_dir2 (optional - str): Custom path to stoichiometry data dir\n",
+ " :param custom_source_dir3 (optional - str): Custom path to snapshot modeling dir (start_sim.py), to copy .config\n",
+ " files and to access scoresA/scoresB (custom_source_dir3 + snapshot{state}_{time} + 'good_scoring_models')\n",
+ " \"\"\"\n",
+ " # Create the destination directory for all the data copied in this function\n",
+ " destination_dir = './data/'\n",
+ " os.makedirs(destination_dir, exist_ok=True)\n",
+ "\n",
+ " # path to snapshot modeling dir\n",
+ " if custom_source_dir1:\n",
+ " source_dir1 = custom_source_dir1\n",
+ " else:\n",
+ " source_dir1 = '../../Heterogeneity/Heterogeneity_Modeling/'\n",
+ "\n",
+ " # path to stoichiometry data dir\n",
+ " if custom_source_dir2:\n",
+ " source_dir2 = custom_source_dir1\n",
+ " else:\n",
+ " source_dir2 = '../../Input_Information/gen_FCS/'\n",
+ "\n",
+ " # path to snapshot modeling dir\n",
+ " if custom_source_dir3:\n",
+ " source_dir3 = custom_source_dir3\n",
+ " else:\n",
+ " source_dir3 = '../../Snapshots/Snapshots_Modeling/'\n",
+ "\n",
+ " # Copy all .config files from the first source directory to the destination directory\n",
+ " try:\n",
+ " for file_name in os.listdir(source_dir1):\n",
+ " if file_name.endswith('.config'):\n",
+ " full_file_name = os.path.join(source_dir1, file_name)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " shutil.copy(full_file_name, destination_dir)\n",
+ " print(\".config files are copied\")\n",
+ " except Exception as e:\n",
+ " print(f\".config files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
+ "\n",
+ " # Copy all .csv stoichiometry files from the second source directory to the destination directory\n",
+ " try:\n",
+ " for file_name in os.listdir(source_dir2):\n",
+ " if file_name.endswith('.csv'):\n",
+ " full_file_name = os.path.join(source_dir2, file_name)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " shutil.copy(full_file_name, destination_dir)\n",
+ " print(\".csv stoichiometry files are copied\")\n",
+ " except Exception as e:\n",
+ " print(f\".csv stoichiometry files cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
+ "\n",
+ " # Copy scoresA and scoresB from the snapshot_{state}_{time} directories and first source directory path\n",
+ " try:\n",
+ " for time in state_dict.keys():\n",
+ " for state in range(1, state_dict[time] + 1):\n",
+ " snapshot_dir = os.path.join(source_dir3, f'snapshot{state}_{time}')\n",
+ " good_scoring_models_dir = os.path.join(snapshot_dir, 'good_scoring_models')\n",
+ " if os.path.isdir(good_scoring_models_dir):\n",
+ " for score_file in ['scoresA.txt', 'scoresB.txt']:\n",
+ " full_file_name = os.path.join(good_scoring_models_dir, score_file)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " new_file_name = f'{state}_{time}_{os.path.splitext(score_file)[0]}.log'\n",
+ " shutil.copy(full_file_name, os.path.join(destination_dir, new_file_name))\n",
+ " print(f\"Copied {full_file_name} to {os.path.join(destination_dir, new_file_name)}\")\n",
+ " except Exception as e:\n",
+ " print(f\"scoresA.txt and scoresB.txt cannot be copied. Try do do it manually. Reason for Error: {e}\")\n",
+ "\n",
+ "os.chdir(main_dir)\n",
+ "# copy all the relevant files\n",
+ "copy_files_for_data(state_dict, custom_source_dir1='../modeling/Heterogeneity/Heterogeneity_Modeling/',\n",
+ " custom_source_dir2='../modeling/Input_Information/gen_FCS/',\n",
+ " custom_source_dir3='../modeling/Snapshots/Snapshots_Modeling/')\n",
+ "\n",
+ "# create two independent DAGs\n",
+ "expected_subcomplexes = ['A', 'B', 'C']\n",
+ "exp_comp = {'A': 'exp_compA.csv', 'B': 'exp_compB.csv', 'C': 'exp_compC.csv'}\n",
+ "input = \"./data/\"\n",
+ "outputA = \"../output_modelA/\"\n",
+ "outputB = \"../output_modelB/\"\n",
+ "\n",
+ "# Output from sampling precision and model precision to be saved in united dir: analysis_output_precision\n",
+ "analysis_output = \"./analysis_output_precision/\"\n",
+ "os.makedirs(analysis_output, exist_ok=True)\n",
+ "\n",
+ "nodesA, graphA, graph_probA, graph_scoresA = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
+ " input_dir=input, scorestr='_scoresA.log',\n",
+ " output_dir=outputA,\n",
+ " spatio_temporal_rule=True,\n",
+ " expected_subcomplexes=expected_subcomplexes,\n",
+ " score_comp=True, exp_comp_map=exp_comp,\n",
+ " draw_dag=False)\n",
+ "\n",
+ "os.chdir(main_dir)\n",
+ "nodesB, graphB, graph_probB, graph_scoresB = spatiotemporal.create_DAG(state_dict, out_pdf=True, npaths=3,\n",
+ " input_dir=input, scorestr='_scoresB.log',\n",
+ " output_dir=outputB,\n",
+ " spatio_temporal_rule=True,\n",
+ " expected_subcomplexes=expected_subcomplexes,\n",
+ " score_comp=True, exp_comp_map=exp_comp,\n",
+ " draw_dag=False)\n",
+ "\n",
+ "## 1 - analysis\n",
+ "analysis.temporal_precision(outputA + 'labeled_pdf.txt', outputB + 'labeled_pdf.txt',\n",
+ " output_fn='.' + analysis_output + 'temporal_precision.txt')\n",
+ "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
+ "print(\"Step 1: calculation of temporal precision IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6ce48a52-b06c-4e09-80a5-debe0fd2cba2",
+ "metadata": {},
+ "source": [
+ "## Model precision\n",
+ "\n",
+ "Next, we calculate the precision of the model, using `analysis.precision`. Here, the model precision calculates the number of trajectories with high weights. The precision ranges from 1.0 to 1/d, where d is the number of trajectories. Values approaching 1.0 indicate the model set can be described by a single trajectory, while values close to 1/d indicate that all trajectories have similar weights.\n",
+ "\n",
+ "The `analysis.precision` function reads in the `labeled_pdf` of the complete model, and calculates the precision of the model. The value close to 1.0 indicates that the set of models can be sufficiently represented by a single trajectory."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d3669407-4ddc-4e1f-b8ba-c8638024cd56",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "## 2 - calculation of precision of the model\n",
+ "\n",
+ "# precision is calculated from .labeled_pdf.txt in Trajectories_Modeling dir\n",
+ "trajectories_modeling_input_dir = \"./output/\"\n",
+ "\n",
+ "analysis.precision(trajectories_modeling_input_dir + 'labeled_pdf.txt', output_fn=analysis_output + 'precision.txt')\n",
+ "\n",
+ "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
+ "print(\"Step 2: calculation of precision of the model IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8a44c8b5-93d8-4ab8-b8cc-d6a70168c0eb",
+ "metadata": {},
+ "source": [
+ "## Comparison against data used in model construction\n",
+ "\n",
+ "We then evaluate the model against data used in model construction. First, we can calculate the cross-correlation between the original EM map and the forward density projected from each snapshot model. This calculation is too computationally expensive for this notebook, but can be found in `modeling/Trajectories/Trajectories_Assessment`, where we wrote the `ccEM` function to perform this comparison for all snapshot models.\n",
+ "\n",
+ "```python\n",
+ "# 3a - comparison of the model to data used in modeling (EM)\n",
+ "exp_mrc_base_path = \"../../Input_Information/ET_data/add_noise\"\n",
+ "ccEM(exp_mrc_base_path)\n",
+ "print(\"Step 3a: ET validation IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "The results of this comparison are shown below."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "70a6e0a0-98d5-4b63-a839-221a4fdc493b",
+ "metadata": {},
+ "source": [
+ "After comparing the model to EM data, we aimed to compare the model to copy number data, and wrote the `forward_model_copy_number` function to evaluate the copy numbers from our set of trajectories. The output of `forward_model_copy_number` is written in `forward_model_copy_number/`. The folder contains `CN_prot_{prot}.txt` files for each protein, which have the mean and standard deviation of protein copy number at each time point. We can then plot these copy numbers from the forward models against those from the experiment, as shown below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f3f7c451-2b04-4e8b-b39c-16c493e8a1a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def read_labeled_pdf(pdf_file):\n",
+ " \"\"\"\n",
+ " Function to read in a labeled probability distribution file output by spatiotemporal.create_DAG.\n",
+ " Used to determine protein copy numbers by forward_model_copy_number.\n",
+ " :param pdf_file (str): sting for the path of the labeled probability distribution file output by\n",
+ " spatiotemporal.create_DAG.\n",
+ " :return prob_dict (dict): dictionary defining the spatiotemporal model. Each key is a state, and each value is the\n",
+ " probability of that state.\n",
+ " \"\"\"\n",
+ " # create blank dictonary to store the results\n",
+ " prob_dict = {}\n",
+ " # read in labeled pdf file\n",
+ " old = open(pdf_file, 'r')\n",
+ " line = old.readline()\n",
+ " # store the path through various nodes, as well as the probability of that path\n",
+ " while line:\n",
+ " line_split = line.split()\n",
+ " # assumes the first string is the trajectory string, the second string is the probability\n",
+ " if len(line_split) > 1:\n",
+ " # use # for comments\n",
+ " if line_split[0]=='#':\n",
+ " pass\n",
+ " else:\n",
+ " trj = line_split[0]\n",
+ " prob = float(line_split[1])\n",
+ " # store in dictionary\n",
+ " prob_dict[trj] = prob\n",
+ " line = old.readline()\n",
+ " old.close()\n",
+ " return prob_dict\n",
+ "\n",
+ "def copy_number_from_state(prot_list,trj,custom_data_folder = None):\n",
+ " \"\"\"\n",
+ " For a trajectory, returns an array of protein copy numbers as a function of time. Used by\n",
+ " forward_model_copy_number().\n",
+ " :param prot_list (list): list of proteins in the model. These proteins are searched for in each config file.\n",
+ " :param trj (str): string defining a single trajectory.\n",
+ " :param custom_data_folder (str, optional): path to custom data folder. Defaults to None, which points to '../data/'\n",
+ " :return _prots (array): 2D array of protein copy numbers. The first index loops over the time,\n",
+ " while the second index value loops over the protein (ordered as A, B, C).\n",
+ " :return N (int): Number of time points in each trajectory.\n",
+ " \"\"\"\n",
+ " # find folder with config files\n",
+ " if custom_data_folder:\n",
+ " data_folder = custom_data_folder\n",
+ " else:\n",
+ " data_folder = 'data/'\n",
+ "\n",
+ " # split the trajectory into a list of individual states\n",
+ " state_list=trj.split('|')\n",
+ " state_list=state_list[:-1]\n",
+ "\n",
+ " N = len(state_list)\n",
+ " # Map from index to protein: 0 - A, 1- B, 2- C\n",
+ " _prots = np.zeros((N, len(prot_list)))\n",
+ "\n",
+ " # Grab _prots from .config file\n",
+ " for i in range(0, N):\n",
+ " prot_file = data_folder + state_list[i] + '.config'\n",
+ " to_read = open(prot_file, 'r')\n",
+ " line = to_read.readline()\n",
+ " while line:\n",
+ " # for each line, check if the protein is in that line\n",
+ " for prot_index in range(len(prot_list)):\n",
+ " if prot_list[prot_index] in line:\n",
+ " _prots[i, prot_index] += 1\n",
+ " line = to_read.readline()\n",
+ "\n",
+ " return _prots,N\n",
+ "\n",
+ "def forward_model_copy_number(prot_list,custom_labeled_pdf=None):\n",
+ " \"\"\"\n",
+ " Code to perform copy number analysis on each protein in the model. Writes output files where each row is ordered\n",
+ " according to the time point in the model and the first column is the mean copy number, while the second column is\n",
+ " the standard deviation in copy number.\n",
+ " :param prot_list (list): list of proteins in the model. These proteins are searched for in each config file.\n",
+ " :param custom_labeled_pdf (str, optional): path to custom labeled probability distribution file output by\n",
+ " spatiotemporal.create_DAG.\n",
+ " \"\"\"\n",
+ " # find folder with config files\n",
+ " if custom_labeled_pdf:\n",
+ " _labeled_pdf = custom_labeled_pdf\n",
+ " else:\n",
+ " _labeled_pdf = '../Trajectories_Modeling/output/labeled_pdf.txt'\n",
+ "\n",
+ " # Read in labeled_pdf file into a dictionary. Each trajectory is listed as a dictionary,\n",
+ " # with keys as the trajectory and the values as the probability of that trajectory\n",
+ " prob_dict = read_labeled_pdf(_labeled_pdf)\n",
+ "\n",
+ " # Loop over the full dictionary. Create a list with 2 values:\n",
+ " # 1) the probability of the state, 2) the protein copy number of that state.\n",
+ " key_list = prob_dict.keys()\n",
+ " prot_prob = []\n",
+ " for key in key_list:\n",
+ " CN,N_times = copy_number_from_state(prot_list,key)\n",
+ " prot_prob.append([prob_dict[key], CN])\n",
+ "\n",
+ " # Construct the full path to the output directory\n",
+ " dir_name = \"forward_model_copy_number\"\n",
+ " full_path = os.path.join(main_dir, dir_name)\n",
+ " os.makedirs(full_path, exist_ok=True)\n",
+ " os.chdir(full_path)\n",
+ "\n",
+ " # Determine copy number from the prot_prob\n",
+ " for index in range(len(prot_prob[0][1][0])):\n",
+ " copy_number = np.zeros((N_times, 2))\n",
+ " # calculate mean\n",
+ " for state in prot_prob:\n",
+ " for i in range(N_times):\n",
+ " copy_number[i, 0] += state[0] * state[1][i][index]\n",
+ " # calculate std deviation\n",
+ " for state in prot_prob:\n",
+ " for i in range(N_times):\n",
+ " # Calculate variance\n",
+ " copy_number[i, 1] += state[0] * ((state[1][i][index] - copy_number[i, 0]) ** 2)\n",
+ " # Take square root to get the standard deviation\n",
+ " copy_number[:, 1] = np.sqrt(copy_number[:, 1])\n",
+ " # save to file\n",
+ " np.savetxt('CN_prot_'+prot_list[index]+'.txt', copy_number, header='mean CN\\tstd CN')\n",
+ "\n",
+ "# 3b - comparison of the model to data used in modeling (copy number)\n",
+ "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
+ "forward_model_copy_number(expected_subcomplexes,custom_labeled_pdf='output/labeled_pdf.txt')\n",
+ "print(\"Step 3b: copy number validation IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "de81dea5-9d98-4424-ba64-d7a2def893b9",
+ "metadata": {},
+ "source": [
+ "Here, we plot the comparison between the experimental data used in model construction and the set of trajectories. This analysis includes the cross-correlation coefficient between the experimental EM density and the forward density of the set of sufficiently good scoring modeled structures in the highest weighted trajectory (a), as well as comparisons between experimental and modeled protein copy numbers for proteins A (b), B (c), and C (d). Here, we see the model is in good agreement with the data used to construct it.\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "305b6e15-1147-4706-bb26-44125961f9fb",
+ "metadata": {},
+ "source": [
+ "## Validation against data not used in model construction\n",
+ "\n",
+ "Finally, we aim to compare the model to data not used in model construction. Specifically, we reserved SAXS data for model validation. We aimed to compare the forward scattering profile from the centroid structural model of each snapshot model to the experimental profile. To make this comparison, we wrote functions that converted each centroid RMF to a PDB (`convert_rmfs`), copied the experimental SAXS profiles to the appropriate folder (`copy_SAXS_dat_files`), and ran [FoXS](https://integrativemodeling.org/tutorials/foxs/foxs.html) on each PDB to evaluate its agreement to the experimental profile (`process_foxs`)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d5cc728a-6159-4317-94eb-525dccc780aa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 4a - SAXS\n",
+ "\"\"\"\n",
+ "Comparing center models of the most dominant cluster for each snapshot (rmfs) to the SAXS data for each time point\n",
+ " can be done in two steps:\n",
+ "-converting rmfs to pdb files\n",
+ "-comparing pdbs of each snapshot to experimental SAXS profile using FoXS\n",
+ "\"\"\"\n",
+ "\n",
+ "def convert_rmfs(state_dict, model, custom_path=None):\n",
+ " \"\"\"\n",
+ " The purpose of this function is to automate the conversion of RMF files into PDB files for all the states from\n",
+ " state_dict. Created PDBs are further used in comparison of SAXS profiles using FoXS. Additionally, they can be\n",
+ " used for comparison to native PDB if available.\n",
+ "\n",
+ " :param state_dict (dict): dictionary that defines the spatiotemporal model.\n",
+ " The keys are strings that correspond to each time point in the\n",
+ " stepwise temporal process. Keys should be ordered according to the\n",
+ " steps in the spatiotemporal process. The values are integers that\n",
+ " correspond to the number of possible states at that timepoint.\n",
+ " :param model (str): An IMP (Integrative Modeling Platform) model object.\n",
+ " :param custom_path (optional - str): A custom path for the RMF file, allowing for flexibility in file location\n",
+ " (should be compliant with stat_dict).\n",
+ " \"\"\"\n",
+ "\n",
+ " for time in state_dict.keys():\n",
+ " for state in range(1, state_dict[time] + 1):\n",
+ " if custom_path:\n",
+ " sim_rmf = custom_path # option for custom path\n",
+ " else:\n",
+ " sim_rmf = f\"../../modeling/Snapshots/Snapshots_Assessment/exhaust_{state}_{time}/cluster.0/cluster_center_model.rmf3\"\n",
+ "\n",
+ " pdb_output = f\"snapshot{state}_{time}.pdb\" # define the output of converted .pdb file\n",
+ "\n",
+ " if os.path.exists(sim_rmf):\n",
+ " try:\n",
+ " rmf_fh = RMF.open_rmf_file_read_only(sim_rmf) # open rmf file for reading\n",
+ " rmf_hierarchy = IMP.rmf.create_hierarchies(rmf_fh, model)[0] # extract 1st hierarchy\n",
+ " IMP.atom.write_pdb_of_c_alphas(rmf_hierarchy, pdb_output) # write coordinates of CA to .pdb\n",
+ " print(f\"Finishing: snapshot{state}_{time}.pdb\")\n",
+ " except Exception as e:\n",
+ " print(f\"{sim_rmf} is empty or there is another problem: {e}\")\n",
+ "\n",
+ "\n",
+ "def copy_SAXS_dat_files(custom_src_dir = None):\n",
+ " \"\"\"\n",
+ " Copies all files ending with .dat from the specified directory to the current directory.\n",
+ "\n",
+ " :param custom_src_dir (optional - str): Path to the source directory\n",
+ " \"\"\"\n",
+ " if custom_src_dir:\n",
+ " src_dir = custom_src_dir\n",
+ " else:\n",
+ " src_dir = '../../../Input_Information/gen_SAXS'\n",
+ " try:\n",
+ " files = os.listdir(src_dir) # Get the list of all files in the src_dir directory\n",
+ " dat_files = [f for f in files if f.endswith('.dat')] # Filter out files that end with .dat\n",
+ "\n",
+ " # Copy each .dat file to the current directory, so FoXS can be used\n",
+ " for file_name in dat_files:\n",
+ " full_file_name = os.path.join(src_dir, file_name)\n",
+ " if os.path.isfile(full_file_name):\n",
+ " shutil.copy(full_file_name, os.getcwd())\n",
+ " # print(f\"Copied: {full_file_name} to {main_dir}\")\n",
+ "\n",
+ " print(\"All .dat files have been copied successfully...\")\n",
+ "\n",
+ " except Exception as e:\n",
+ " print(f\"An error occurred: {e}\")\n",
+ "\n",
+ "\n",
+ "def process_foxs(state_dict, custom_dat_file = None):\n",
+ " \"\"\"\n",
+ " This function automates the FoXS analysis for all specified time points in a single execution. It processes PDB\n",
+ " files generated by the convert_rmfs function and uses SAXS data copied with the copy_SAXS function. All of this\n",
+ " data should be present in the current running directory.\n",
+ " FoXS tutorial is available here: https://integrativemodeling.org/tutorials/foxs/foxs.html\n",
+ "\n",
+ " :param state_dict (dict): dictionary that defines the spatiotemporal model.\n",
+ " The keys are strings that correspond to each time point in the\n",
+ " stepwise temporal process. Keys should be ordered according to the\n",
+ " steps in the spatiotemporal process. The values are integers that\n",
+ " correspond to the number of possible states at that timepoint.\n",
+ " :param custom_dat_file (optional - str)): A custom name of SAXS files for each time point (should be\n",
+ " compliant with stat_dict)\n",
+ " \"\"\"\n",
+ "\n",
+ "\n",
+ " print(\"...lets proceed to FoXS\")\n",
+ "\n",
+ " for time in state_dict.keys():\n",
+ " try:\n",
+ " if state_dict[time] > 1:\n",
+ " # if there is more than one state in timepoint, FoXS creates fit.plt and it should be renamed\n",
+ " if custom_dat_file:\n",
+ " dat_file = custom_dat_file\n",
+ " else:\n",
+ " dat_file = f\"{time}_exp.dat\"\n",
+ "\n",
+ " pdb_files = \" \".join([f\"snapshot{state}_{time}.pdb\" for state in range(1, state_dict[time] + 1)])\n",
+ "\n",
+ " command1 = f\"foxs -r -g {pdb_files} {dat_file}\"\n",
+ " # example how FoXS command should look like: foxs -r -g snapshot1_0min.pdb snapshot2_0min.pdb snapshot3_0min.pdb 0min_exp.dat\n",
+ " os.system(command1)\n",
+ " print(f\"FoXS for {time} is calculated and ready to create a plot. Nr of states is: {state_dict[time]}\")\n",
+ "\n",
+ " command2 = f\"gnuplot fit.plt\" # create plot from .plt code\n",
+ " os.system(command2)\n",
+ "\n",
+ " command3 = f\"mv fit.plt {time}_FoXS.plt\" # rename .plt to avoid to be overwritten\n",
+ " os.system(command3)\n",
+ "\n",
+ " command4 = f\"mv fit.png {time}_FoXS.png\" # rename plot to avoid to be overwritten\n",
+ " os.system(command4)\n",
+ "\n",
+ " print(f\"Plot {time}_FoXS.png is created\")\n",
+ "\n",
+ " elif state_dict[time] == 1:\n",
+ " print(f\"There is only one state in {time}\")\n",
+ " dat_file1 = f\"{time}_exp.dat\"\n",
+ " pdb_file1 = f\"snapshot1_{time}.pdb\"\n",
+ "\n",
+ " command5 = f\"foxs -r -g {pdb_file1} {dat_file1}\"\n",
+ " os.system(command5)\n",
+ " print(f\"FoXS for {time} is calculated and ready to create a plot. Nr of states is: {state_dict[time]}\")\n",
+ "\n",
+ " command6 = f\"gnuplot snapshot1_{time}_{time}_exp.plt\"\n",
+ " os.system(command6)\n",
+ "\n",
+ " command7 = f\"mv snapshot1_{time}_{time}_exp.plt {time}_FoXS.plt\"\n",
+ " os.system(command7)\n",
+ "\n",
+ " command8 = f\"mv snapshot1_{time}_{time}_exp.png {time}_FoXS.png\"\n",
+ " os.system(command8)\n",
+ " else:\n",
+ " print(f\"There is no states in this timepoint. Check stat_dict.\")\n",
+ "\n",
+ " except Exception as e:\n",
+ " print(f\"FoXS can not be executed properly due to following problem: {e}\")\n",
+ "\n",
+ "\n",
+ "# 4a - SAXS\n",
+ "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
+ "SAXS_output = \"./SAXS_comparison/\"\n",
+ "os.makedirs(SAXS_output, exist_ok=True)\n",
+ "os.chdir(SAXS_output)\n",
+ "model = IMP.Model()\n",
+ "convert_rmfs(state_dict, model)\n",
+ "copy_SAXS_dat_files(custom_src_dir='../../modeling/Input_Information/gen_SAXS')\n",
+ "process_foxs(state_dict)\n",
+ "print(\"Step 4a: SAXS validation IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "os.chdir(main_dir)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bef8310b-99a2-4e71-b4b7-88c0dbb06def",
+ "metadata": {},
+ "source": [
+ "The output of this analysis is written to `SAXS_comparison`. Standard FoXS outputs are available for each snapshot model (`snapshot{state}_{time}.*`). In particular, the `.fit` files include the forward and experimental profiles side by side, with the $\\chi^2$ for the fit. Further, the `{time}_FoXS.*` files include the information for all snapshot models at that time point, including plots of each profile in comparison to the experimental profile (`{time}_FoXS.png`)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d4ebc14c-1eba-45ff-b98e-2938e65057a4",
+ "metadata": {},
+ "source": [
+ "As our model was generated from synthetic data, the ground truth structure is known at each time point. In addition to validating the model by assessing its comparison to SAXS data, we could approximate the model accuracy by comparing the snapshot model to the PDB structure, although this comparison is not perfect as the PDB structure was used to inform the structure of *rigid bodies* in the snapshot model. To do so, we wrote a function (`RMSD`) that calculates the RMSD between each structural model and the orignal PDB. The function is too computationally expensive to run in this notebook, but is found in the `Trajectories/Trajectories_Assessment/` folder and is demonstrated below.\n",
+ "\n",
+ "```python\n",
+ "# 4b - RMSD\n",
+ "os.chdir(main_dir) # it is crucial that after each step, directory is changed back to main\n",
+ "pdb_path = \"../../Input_Information/PDB/3rpg.pdb\"\n",
+ "RMSD(pdb_path=pdb_path, custom_n_plot=20)\n",
+ "print(\"Step 4b: RMSD validation IS COMPLETED\")\n",
+ "print(\"\")\n",
+ "print(\"\")\n",
+ "```\n",
+ "\n",
+ "The output of this function is written in `RMSD_calculation_output`. The function outputs `rmsd_{state}_{time}.png` files, which plots the RMSD for each structural model within each snapshot model. This data is then summarized in `RMSD_analysis.txt`, which includes the minimum RMSD, average RMSD, and number of structural models in each snapshot model.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "64e1a2f6-7258-462d-bde3-7734743b5aa1",
+ "metadata": {},
+ "source": [
+ "Finally, we plot the results for assessing the spatiotemporal model with data not used to construct it. Comparisons are made between the centroid structure of the most populated cluster in each snapshot model at each time point and the experimental SAXS profile for 0 (a), 1 (b), and 2 (c) minutes. Further, we plot both the sampling precision (dark red) and the RMSD to the PDB structure (light red) for each snapshot model in the highest weighted trajectory (d).\n",
+ "\n",
+ "\n",
+ "\n",
+ "To quantitatively compare the model to SAXS data, we used the $\\chi^2$ to compare each snapshot model to the experimental profile. We note that the $\\chi^2$ are substantially lower for the models along the highest weighted trajectory (1_0min, 1_1min, and 1_2min) than for other models, indicating that the highest weighted trajectory is in better agreement with the experimental SAXS data than other possible trajectories.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Next, we can evaluate the accuracy of the model by comparing the RMSD to the PDB to the sampling precision of each snapshot model. We note that this is generally not possible, because in most biological applications the ground truth is not known. In this case, if the average RMSD to the PDB structure is smaller than the sampling precision, the PDB structure lies within the precision of the model. We find that the RMSD is within 1.5 Å of the sampling precision at all time points, indicating that the model lies within 1.5 Å of the ground truth.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "98f7a12c-d7e9-49ed-9a16-b713c3b4762a",
+ "metadata": {},
+ "source": [
+ "# Next steps\n",
+ "\n",
+ "After assessing our model, we can must decide if the model is sufficient to answer biological questions of interest. If the model does not have sufficient precision for the desired application, assessment of the current model can be used to inform which new experiments may help improve the next iteration of the model. The [integrative spatiotemporal modeling procedure](https://integrativemodeling.org/tutorials/spatiotemporal/index.html#steps) can then be repeated iteratively, analogous to [integrative modeling of static structures](https://integrativemodeling.org/2.21.0/doc/manual/intro.html#procedure).\n",
+ "\n",
+ "If the model is sufficient to provide insight into the biological process of interest, the user may decide that it is ready for publication. In this case, the user should create an [mmCIF file](https://mmcif.wwpdb.org/) to deposit the model in the [PDB-dev database](https://pdb-dev.wwpdb.org/). This procedure is explained in the [deposition tutorial](https://integrativemodeling.org/tutorials/deposition/develop/).\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/Jupyter/.ipynb_checkpoints/spatiotemporal-colab-checkpoint.ipynb b/Jupyter/.ipynb_checkpoints/spatiotemporal-colab-checkpoint.ipynb
index c55a16c91..8e9b5c444 100644
--- a/Jupyter/.ipynb_checkpoints/spatiotemporal-colab-checkpoint.ipynb
+++ b/Jupyter/.ipynb_checkpoints/spatiotemporal-colab-checkpoint.ipynb
@@ -134,7 +134,7 @@
"metadata": {},
"source": [
"From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more heterogeneity models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
- "- *.config, a file with a list of proteins represented in the heterogeneity model\n",
+ "- *.config, a file with a list of proteins represented in the heterogeneity model.\n",
"- *_topol.txt, a topology file for snapshot modeling corresponding to this heterogeneity model.\n",
"\n",
"# Heterogeneity modeling step 3: assessment\n",
@@ -194,7 +194,7 @@
"|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||\n",
"```\n",
"\n",
- "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which topology to use, as well as the paths to other pieces of input information.\n",
+ "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which heterogeneity model to use, as well as the paths to other pieces of input information.\n",
"\n",
"```python\n",
"# Running parameters to access correct path of ET_data for EM restraint\n",
@@ -698,7 +698,7 @@
"- `dag_heatmap.eps` and `dag_heatmap`: image of the directed acyclic graph from the set of models.\n",
"- `path*.txt`: files where each row includes a `{state}_{time}` string, so that rows correspond to the states visited over that trajectory. Files are numbered from the most likely path to the least likely path.\n",
"\n",
- "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{N,t}N_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
+ "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
"\n",
"\n"
]
diff --git a/Jupyter/spatiotemporal-colab.ipynb b/Jupyter/spatiotemporal-colab.ipynb
index e9e37f135..8e9b5c444 100644
--- a/Jupyter/spatiotemporal-colab.ipynb
+++ b/Jupyter/spatiotemporal-colab.ipynb
@@ -134,7 +134,7 @@
"metadata": {},
"source": [
"From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more heterogeneity models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
- "- *.config, a file with a list of proteins represented in the heterogeneity model\n",
+ "- *.config, a file with a list of proteins represented in the heterogeneity model.\n",
"- *_topol.txt, a topology file for snapshot modeling corresponding to this heterogeneity model.\n",
"\n",
"# Heterogeneity modeling step 3: assessment\n",
@@ -194,7 +194,7 @@
"|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||\n",
"```\n",
"\n",
- "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which topology to use, as well as the paths to other pieces of input information.\n",
+ "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which heterogeneity model to use, as well as the paths to other pieces of input information.\n",
"\n",
"```python\n",
"# Running parameters to access correct path of ET_data for EM restraint\n",
diff --git a/Jupyter/spatiotemporal.ipynb b/Jupyter/spatiotemporal.ipynb
index e63880465..c1f58a8d5 100644
--- a/Jupyter/spatiotemporal.ipynb
+++ b/Jupyter/spatiotemporal.ipynb
@@ -14,10 +14,10 @@
"\n",
"Biomolecules are constantly in motion; therefore, a complete depiction of their function must include their dynamics instead of just static structures. We have developed an integrative spatiotemporal approach to model dynamic systems.\n",
"\n",
- "Our approach applies a composite workflow, consisting of three modeling problems to compute (i) heterogeneity models, (ii) snapshot models, and (iii) trajectory models.\n",
+ "Our approach applies a composite workflow, consisting of three modeling problems to compute (i) heterogeneity models, (ii) snapshot models, and (iii) a trajectory model.\n",
"Heterogeneity models describe the possible biomolecular compositions of the system at each time point. Optionally, other auxiliary variables can be considered, such as the coarse location in the final state when modeling an assembly process.\n",
"For each heterogeneity model, one snapshot model is produced. A snapshot model is a set of alternative standard static integrative structure models based on the information available for the corresponding time point.\n",
- "Then, trajectory models are created by connecting alternative snapshot models at adjacent time points. These trajectory models can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.\n",
+ "Then, a set of trajectories ranked by their agreement with input information is computed by connecting alternative snapshot models at adjacent time points (*i.e.*, the “trajectory model”). This trajectory model can be scored based on both the scores of static structures and the transitions between them, allowing for the creation of trajectories that are in agreement with the input information by construction.\n",
"\n",
"If you use this tutorial or its accompanying method, please site the corresponding publications:\n",
"\n",
@@ -44,7 +44,7 @@
"Modeling of heterogeneity\n",
"====================================\n",
"\n",
- "Here, we describe the first modeling problem in our composite workflow, how to build models of heterogeneity modeling using IMP. In this tutorial, heterogeneity modeling only includes protein copy number; however, in general, other types of information, such as the coarse location in the final state, could also be included in heterogeneity models.\n",
+ "Here, we describe the first modeling problem in our composite workflow, how to build models of heterogeneity using IMP. In this tutorial, heterogeneity modeling only includes protein copy number; however, in general, other types of information, such as the coarse location in the final state, could also be included in heterogeneity models.\n",
"\n",
"# Heterogeneity modeling step 1: gathering of information\n",
"\n",
@@ -56,12 +56,12 @@
"\n",
"# Heterogeneity modeling step 2: representation, scoring function, and search process\n",
"\n",
- "Next, we represent, score and search for heterogeneity models models. A single heterogeneity model is a set of protein copy numbers, scored according to its fit to experimental copy number data at that time point. As ET and SAXS data, are only available at 0 minutes, 1 minute, and 2 minutes, we choose to create heterogeneity models at these three time points. We then use `prepare_protein_library`, to calculate the protein copy numbers for each snapshot model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each of these snapshot models. The choices made in this topology file are important for the representation, scoring function, and search process for snapshot models, and are discussed later. For heterogeneity modeling, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure.\n"
+ "Next, we represent, score and search for heterogeneity models models. A single heterogeneity model is a set of protein copy numbers, scored according to its fit to experimental copy number data at that time point. As ET and SAXS data, are only available at 0 minutes, 1 minute, and 2 minutes, we choose to create heterogeneity models at these three time points. We then use `prepare_protein_library`, to calculate the protein copy numbers for each heterogeneity model and to use the topology file of the full complex (`spatiotemporal_topology.txt`) to generate a topology file for each corresponding snapshot model. The choices made in this topology file are important for the representation, scoring function, and search process for snapshot models, and are discussed later. For heterogeneity modeling, we choose to model 3 protein copy numbers at each time point, and restrict the final time point to have the same protein copy numbers as the PDB structure.\n"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 49,
"id": "1ff4d3e5-04de-4092-8cfc-2ad3018675da",
"metadata": {},
"outputs": [],
@@ -108,13 +108,13 @@
"id": "b56fbe48-da12-412e-ac2e-dca673e04a43",
"metadata": {},
"source": [
- "From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more snapshot models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
- "- *.config, a file with a list of proteins represented in the heterogeneity model\n",
+ "From the output of `prepare_protein_library`, we see that there are 3 heterogeneity models at each time point (it is possible to have more heterogeneity models than copy numbers if multiple copies of the protein exist in the complex). For each heterogeneity model, we see 2 files:\n",
+ "- *.config, a file with a list of proteins represented in the heterogeneity model.\n",
"- *_topol.txt, a topology file for snapshot modeling corresponding to this heterogeneity model.\n",
"\n",
"# Heterogeneity modeling step 3: assessment\n",
"\n",
- "Now, we have a variety of heterogeneity models. In general, there are four ways to assess a model: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. Here, we will focus specifically on comparing the model to experimental data, as other assessments will be performed later, when the trajectory models are assessed.\n",
+ "Now, we have a variety of heterogeneity models. In general, there are four ways to assess a model: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. Here, we will focus specifically on comparing the model to experimental data, as other assessments will be performed later, when the trajectory model is assessed.\n",
"\n",
"Next, we can plot the modeled and experimental copy numbers simultaneously for each protein, as shown below for proteins A (a), B (b), and C (c).\n",
"\n",
@@ -131,7 +131,7 @@
"Modeling of snapshots\n",
"====================================\n",
"\n",
- "Here, we describe the second modeling problem in our composite workflow, how to build models of static snapshot models using IMP. We note that this process is similar to previous tutorials of [actin](https://integrativemodeling.org/tutorials/actin/) and [RNA PolII](https://integrativemodeling.org/tutorials/rnapolii_stalk/).\n",
+ "Here, we describe the second modeling problem in our composite workflow, how to build static snapshot models using IMP. We note that this process is similar to previous tutorials of [actin](https://integrativemodeling.org/tutorials/actin/) and [RNA PolII](https://integrativemodeling.org/tutorials/rnapolii_stalk/).\n",
"\n",
"# Snapshot modeling step 1: gathering of information\n",
"\n",
@@ -139,7 +139,7 @@
"\n",
"\n",
"\n",
- "The heterogeneity models inform protein copy numbers for the snapshot models. The PDB structure of the complex informs the structure of the individual proteins. The time-dependent ET data informs the size and shape of the assembling complex. physical principles inform connectivity and excluded volume.\n",
+ "The heterogeneity models inform protein copy numbers for the snapshot models. The PDB structure of the complex informs the structure of the individual proteins. The time-dependent ET data informs the size and shape of the assembling complex. Physical principles inform connectivity and excluded volume.\n",
"\n",
"# Snapshot modeling step 2: representation, scoring function, and search process\n",
"\n",
@@ -156,7 +156,7 @@
"\n",
"Beads and Gaussians in our model belong to either a *rigid body* or *flexible string*. The positions of all beads and Gaussians in a single rigid body are constrained during sampling and do not move relative to each other. Meanwhile, flexible beads can move freely during sampling, but are restrained by sequence connectivity.\n",
"\n",
- "To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies of each heterogeneity model. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](https://integrativemodeling.org/2.21.0/doc/ref/classIMP_1_1pmi_1_1topology_1_1TopologyReader.html) documentation.\n",
+ "To begin, we built a topology file with the representation for the model of the complete system, `spatiotemporal_topology.txt`, located in `Heterogeneity/Heterogeneity_Modeling/`. This complete topology was used as a template to build topologies for each snapshot model. Based on our observation of the structure of the complex, we chose to represent each protein with at least 2 separate rigid bodies, and left the first 28 residues of protein C as flexible beads. Rigid bodies were described with 1 bead for every residue, and 10 residues per Gaussian. Flexible beads were described with 1 bead for every residue and 1 residue per Gaussian. A more complete description of the options available in topology files is available in the the [TopologyReader](https://integrativemodeling.org/2.21.0/doc/ref/classIMP_1_1pmi_1_1topology_1_1TopologyReader.html) documentation.\n",
"\n",
"```\n",
"|molecule_name | color | fasta_fn | fasta_id | pdb_fn | chain | residue_range | pdb_offset | bead_size | em_residues_per_gaussian | rigid_body | super_rigid_body | chain_of_super_rigid_bodies | \n",
@@ -169,7 +169,7 @@
"|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||\n",
"```\n",
"\n",
- "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which topology to use, as well as the paths to other pieces of input information.\n",
+ "Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which heterogeneity model to use, as well as the paths to other pieces of input information.\n",
"\n",
"```python\n",
"# Running parameters to access correct path of ET_data for EM restraint\n",
@@ -278,7 +278,6 @@
" monte_carlo_steps=200, # Number of MC steps between writing frames.\n",
" number_of_best_scoring_models=0,\n",
" number_of_frames=500) # number of frames to be saved\n",
- "# In our case, for each snapshot we generated 25000 frames altogether (50*500)\n",
"rex.execute_macro()\n",
"```\n",
"\n",
@@ -311,7 +310,7 @@
"\n",
"# Snapshot modeling step 3: assessment\n",
"\n",
- "The above code would variety of alternative snapshot models. In general, we would like to assess these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. In this portion of the tutorial, we focus specifically on estimating the sampling precision of the model, while quantitative comparisons between the model and experimental data will be reserved for the final step, when we assess trajectories. Again, this assessment process is quite computationally intensive, so, instead of running the script explicitly, we will walk you through the `snapshot_assessment.py` script, which is located in the `modeling/Snapshots/Snapshots_Assessment` folder.\n",
+ "The above code would create a variety of alternative snapshot models. In general, we would like to assess these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model. In this portion of the tutorial, we focus specifically on estimating the sampling precision of the model, while quantitative comparisons between the model and experimental data will be reserved for the final step, when we assess trajectories. Again, this assessment process is quite computationally intensive, so, instead of running the script explicitly, we will walk you through the `snapshot_assessment.py` script, which is located in the `modeling/Snapshots/Snapshots_Assessment` folder.\n",
"\n",
"## Filtering good scoring models\n",
"\n",
@@ -456,7 +455,7 @@
"Modeling of a Trajectory\n",
"====================================\n",
"\n",
- "Here, we describe the final modeling problem in our composite workflow, how to build models of trajectory models using IMP.\n",
+ "Here, we describe the final modeling problem in our composite workflow, how to build a trajectory model using IMP.\n",
"\n",
"# Trajectory modeling step 1: gathering of information\n",
"\n",
@@ -468,29 +467,29 @@
"\n",
"# Trajectory modeling step 2: representation, scoring function, and search process\n",
"\n",
- "Trajectory modeling connects alternative snapshot models at adjacent time points, followed by scoring the trajectory models based on their fit to the input information, as described in full [here](https://www.biorxiv.org/content/10.1101/2024.08.06.606842v1.abstract).\n",
+ "Trajectory modeling connects alternative snapshot models at adjacent time points, followed by scoring the trajectories based on their fit to the input information, as described in full [here](https://www.biorxiv.org/content/10.1101/2024.08.06.606842v1.abstract).\n",
"\n",
"## Background behind integrative spatiotemporal modeling\n",
"### Representing the model\n",
"\n",
- "We choose to represent dynamic processes as a trajectory of snapshot models, with one snapshot model at each time point. In this case, we computed snapshot models at 3 time points (0, 1, and 2 minutes), so a single trajectory model will consist of 3 snapshot models, one at each 0, 1, and 2 minutes. The modeling procedure described here will produce a set of scored trajectory models, which can be displayed as a directed acyclic graph, where nodes in the graph represent the snapshot model and edges represent connections between snapshot models at neighboring time points.\n",
+ "We choose to represent dynamic processes as a trajectory of snapshot models, with one snapshot model at each time point. In this case, we computed snapshot models at 3 time points (0, 1, and 2 minutes), so a single trajectory will consist of 3 snapshot models, one at each 0, 1, and 2 minutes. The modeling procedure described here will produce a set of scored trajectories, which can be displayed as a directed acyclic graph, where nodes in the graph represent the snapshot model and edges represent connections between snapshot models at neighboring time points.\n",
"\n",
"### Scoring the model\n",
"\n",
- "To score trajectory models, we incorporate both the scores of individual snapshot models, as well as the scores of transitions between them. Under the assumption that the process is Markovian (*i.e.* memoryless), the weight of a trajectory model takes the form:\n",
+ "To score trajectories, we incorporate both the scores of individual snapshot models, as well as the scores of transitions between them. Under the assumption that the process is Markovian (*i.e.* memoryless), the weight of a trajectory takes the form:\n",
"\n",
"$$\n",
"W(\\chi) \\propto \\displaystyle\\prod^{T}_{t=0} P( X_{t} | D_{t}) \\cdot \\displaystyle\\prod^{T-1}_{t=0} W(X_{t+1} | X_{t},D_{t,t+1}),\n",
"$$\n",
"\n",
- "where $t$ indexes times from 0 until the final modeled snapshot ($T$); $P(X_{t} | D_{t})$ is the snapshot model score; and $W(X_{t+1} | X_{t},D_{t,t+1})$ is the transition score. Trajectory model weights ($W(\\chi)$) are normalized so that the sum of all trajectory models' weights is 1.0. Transition scores are currently based on a simple metric that either allows or disallows a transition. Transitions are only allowed if all proteins in the first snapshot model are included in the second snapshot model. In the future, we hope to include more detailed transition scoring terms, which may take into account experimental information or physical models of macromolecular dynamics.\n",
+ "where $t$ indexes times from 0 until the final modeled snapshot ($T$); $P(X_{t} | D_{t})$ is the snapshot model score; and $W(X_{t+1} | X_{t},D_{t,t+1})$ is the transition score. Trajectory weights ($W(\\chi)$) are normalized so that the sum of all trajectory weights is 1.0. Transition scores are currently based on a simple metric that either allows or disallows a transition. Transitions are only allowed if all proteins in the first snapshot model are included in the second snapshot model. In the future, we hope to include more detailed transition scoring terms, which may take into account experimental information or physical models of macromolecular dynamics.\n",
"\n",
"### Searching for good scoring models\n",
"\n",
- "Trajectory models are constructed by enumerating all connections between adjacent snapshot models and scoring these trajectory models according to the equation above. This procedure results in a set of weighted trajectory models.\n",
+ "Trajectories are constructed by enumerating all connections between adjacent snapshot models and scoring these trajectories according to the equation above. This procedure results in a set of weighted trajectories.\n",
"\n",
"## Computing trajectory models\n",
- "To compute trajectory models, we first copy all necessary files to a new directory, `data`. These files are (i) `{state}_{time}.config` files, which include the subcomplexes that are in each state, (ii) `{state}_{time}_scores.log`, which is a list of all scores of all structural models in that snapshot model, and (iii) `exp_comp{prot}.csv`, which is the experimental copy number for each protein (`{prot}`) as a function of time. Here, we copy files related to the snapshots (`*.log` files) from the `modeling` directory, as we skipped computing snapshots due to the computational expense.\n"
+ "To compute trajectories, we first copy all necessary files to a new directory, `data`. These files are (i) `{state}_{time}.config` files, which include the subcomplexes that are in each state, (ii) `{state}_{time}_scores.log`, which is a list of all scores of all structural models in that snapshot model, and (iii) `exp_comp{prot}.csv`, which is the experimental copy number for each protein (`{prot}`) as a function of time. Here, we copy files related to the snapshot models (`*.log` files) from the `modeling` directory, as we skipped computing snapshot models due to the computational expense.\n"
]
},
{
@@ -634,7 +633,7 @@
"metadata": {},
"source": [
"Next, we compute the spatiotemporal model. The inputs we included are:\n",
- "- state_dict (dict): a dictionary that defines the spatiotemporal model. Keys are strings for each time point in the spatiotemporal process and values are integers corresponding to the number of snapshot models computed at that time point\n",
+ "- state_dict (dict): a dictionary that defines the spatiotemporal model. Keys are strings for each time point in the spatiotemporal process and values are integers corresponding to the number of snapshot models computed at that time point.\n",
"- out_pdf (bool): whether to write the probability distribution function (pdf).\n",
"- npaths (int): Number of states two write to a file (path*.txt).\n",
"- input_dir (str): directory with the input information.\n",
@@ -668,13 +667,13 @@
"metadata": {},
"source": [
"After running `spatiotemporal.create_DAG`, a variety of outputs are written:\n",
- "- `cdf.txt`: the cumulative distribution function for the set of trajectory models.\n",
- "- `pdf.txt`: the probability distribution function for the set of trajectory models.\n",
- "- `labeled_pdf.txt`: Each row has 2 columns and represents a different trajectory model. The first column labels a single trajectory model as a series of snapshot models, where each snapshot model is written as `{state}_{time}|` in sequential order. The second column is the probability distribution function corresponding to that trajectory model.\n",
+ "- `cdf.txt`: the cumulative distribution function for the set of trajectories.\n",
+ "- `pdf.txt`: the probability distribution function for the set of trajectories.\n",
+ "- `labeled_pdf.txt`: Each row has 2 columns and represents a different trajectory. The first column labels a single trajectory as a series of snapshot models, where each snapshot model is written as `{state}_{time}|` in sequential order. The second column is the probability distribution function corresponding to that trajectory.\n",
"- `dag_heatmap.eps` and `dag_heatmap`: image of the directed acyclic graph from the set of models.\n",
- "- `path*.txt`: files where each row includes a `{state}_{time}` string, so that rows correspond to the states visited over that trajectory model. Files are numbered from the most likely path to the least likely path.\n",
+ "- `path*.txt`: files where each row includes a `{state}_{time}` string, so that rows correspond to the states visited over that trajectory. Files are numbered from the most likely path to the least likely path.\n",
"\n",
- "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory model (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{N,t}N_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
+ "Now that we have a trajectory model, we can plot the directed acyclic graph (left) and the series of centroid models from each snapshot model along the most likely trajectory (right). Each row corresponds to a different time point in the assembly process (0 min, 1 min, and 2 min). Each node is shaded according to its weight in the final model ($W(X_{t})$). Proteins are colored as A - blue, B - orange, and C - purple.\n",
"\n",
"\n"
]
@@ -687,13 +686,13 @@
"\n",
"# Trajectory modeling step 3: assessment\n",
"\n",
- "Now that the set of spatiotemporal models has been constructed, we must evaluate these models. We can evaluate these models in at least 4 ways: estimate the sampling precision, compare the model to data used to construct it, validate the model against data not used to construct it, and quantify the precision of the model.\n",
+ "Now that the set of spatiotemporal models has been constructed, we must evaluate these models. We can evaluate these models in at least 4 ways: estimating the sampling precision, comparing the model to data used to construct it, validating the model against data not used to construct it, and quantifying the precision of the model.\n",
"\n",
"## Sampling precision\n",
"\n",
- "To begin, we calculate the sampling precision of the models. The sampling precision is calculated by using `spatiotemporal.create_DAG` to reconstruct the set of trajectory models using 2 independent sets of samplings for snapshot models. Then, the overlap between these snapshot models is evaluated using `analysis.temporal_precision`, which takes in two `labeled_pdf` files.\n",
+ "To begin, we calculate the sampling precision of the models. The sampling precision is calculated by using `spatiotemporal.create_DAG` to reconstruct the set of trajectories using 2 independent sets of samplings for snapshot models. Then, the overlap between these snapshot models is evaluated using `analysis.temporal_precision`, which takes in two `labeled_pdf` files.\n",
"\n",
- "The temporal precision can take values between 1.0 and 0.0, and indicates the overlap between the two models in trajectory space. Hence, values close to 1.0 indicate a high sampling precision, while values close to 0.0 indicate a low sampling precision. Here, the value close to 1.0 indicates that sampling does not affect the weights of the trajectory models.\n"
+ "The temporal precision can take values between 1.0 and 0.0, and indicates the overlap between the two models in trajectory space. Hence, values close to 1.0 indicate a high sampling precision, while values close to 0.0 indicate a low sampling precision. Here, the value close to 1.0 indicates that sampling does not affect the weights of the trajectories.\n"
]
},
{
@@ -840,9 +839,9 @@
"source": [
"## Model precision\n",
"\n",
- "Next, we calculate the precision of the model, using `analysis.precision`. Here, the model precision calculates the number of trajectory models with high weights. The precision ranges from 1.0 to 1/d, where d is the number of trajectory models. Values approaching 1.0 indicate the model set can be described by a single trajectory model, while values close to 1/d indicate that all trajectory models have similar weights.\n",
+ "Next, we calculate the precision of the model, using `analysis.precision`. Here, the model precision calculates the number of trajectories with high weights. The precision ranges from 1.0 to 1/d, where d is the number of trajectories. Values approaching 1.0 indicate the model set can be described by a single trajectory, while values close to 1/d indicate that all trajectories have similar weights.\n",
"\n",
- "The `analysis.precision` function reads in the `labeled_pdf` of the complete model, and calculates the precision of the model. The value close to 1.0 indicates that the set of models can be sufficiently represented by a single trajectory model."
+ "The `analysis.precision` function reads in the `labeled_pdf` of the complete model, and calculates the precision of the model. The value close to 1.0 indicates that the set of models can be sufficiently represented by a single trajectory."
]
},
{
@@ -891,7 +890,7 @@
"id": "70a6e0a0-98d5-4b63-a839-221a4fdc493b",
"metadata": {},
"source": [
- "After comparing the model to EM data, we aimed to compare the model to copy number data, and wrote the `forward_model_copy_number` function to evaluate the copy numbers from our set of trajectory models. The output of `forward_model_copy_number` is written in `forward_model_copy_number/`. The folder contains `CN_prot_{prot}.txt` files for each protein, which have the mean and standard deviation of protein copy number at each time point. We can then plot these copy numbers from the forward models against those from the experiment, as shown below."
+ "After comparing the model to EM data, we aimed to compare the model to copy number data, and wrote the `forward_model_copy_number` function to evaluate the copy numbers from our set of trajectories. The output of `forward_model_copy_number` is written in `forward_model_copy_number/`. The folder contains `CN_prot_{prot}.txt` files for each protein, which have the mean and standard deviation of protein copy number at each time point. We can then plot these copy numbers from the forward models against those from the experiment, as shown below."
]
},
{
@@ -1034,7 +1033,7 @@
"id": "de81dea5-9d98-4424-ba64-d7a2def893b9",
"metadata": {},
"source": [
- "Here, we plot the comparison between the experimental data used in model construction and the set of trajectory models. This analysis includes the cross-correlation coefficient between the experimental EM density and the forward density of the set of sufficiently good scoring modeled structures in the highest weighted trajectory model (a), as well as comparisons between experimental and modeled protein copy numbers for proteins A (b), B (c), and C (d). Here, we see the model is in good agreement with the data used to construct it.\n",
+ "Here, we plot the comparison between the experimental data used in model construction and the set of trajectories. This analysis includes the cross-correlation coefficient between the experimental EM density and the forward density of the set of sufficiently good scoring modeled structures in the highest weighted trajectory (a), as well as comparisons between experimental and modeled protein copy numbers for proteins A (b), B (c), and C (d). Here, we see the model is in good agreement with the data used to construct it.\n",
"\n",
""
]
@@ -1244,11 +1243,11 @@
"id": "64e1a2f6-7258-462d-bde3-7734743b5aa1",
"metadata": {},
"source": [
- "Finally, we plot the results for assessing the spatiotemporal model with data not used to construct it. Comparisons are made between the centroid structure of the most populated cluster in each snapshot model at each time point and the experimental SAXS profile for 0 (a), 1 (b), and 2 (c) minutes. Further, we plot both the sampling precision (dark red) and the RMSD to the PDB structure (light red) for each snapshot model in the highest trajectory model (d).\n",
+ "Finally, we plot the results for assessing the spatiotemporal model with data not used to construct it. Comparisons are made between the centroid structure of the most populated cluster in each snapshot model at each time point and the experimental SAXS profile for 0 (a), 1 (b), and 2 (c) minutes. Further, we plot both the sampling precision (dark red) and the RMSD to the PDB structure (light red) for each snapshot model in the highest weighted trajectory (d).\n",
"\n",
"\n",
"\n",
- "To quantitatively compare the model to SAXS data, we used the $\\chi^2$ to compare each snapshot model to the experimental profile. We note that the $\\chi^2$ are substantially lower for the models along the highest trajectory model (1_0min, 1_1min, and 1_2min) than for other models, indicating that the highest weighted trajectory model is in better agreement with the experimental SAXS data than other possible trajectory models.\n",
+ "To quantitatively compare the model to SAXS data, we used the $\\chi^2$ to compare each snapshot model to the experimental profile. We note that the $\\chi^2$ are substantially lower for the models along the highest weighted trajectory (1_0min, 1_1min, and 1_2min) than for other models, indicating that the highest weighted trajectory is in better agreement with the experimental SAXS data than other possible trajectories.\n",
"\n",
"\n",
"\n",
diff --git a/doc/snapshot.md b/doc/snapshot.md
index 58aa215ce..536c29c53 100644
--- a/doc/snapshot.md
+++ b/doc/snapshot.md
@@ -40,7 +40,7 @@ To begin, we built a topology file with the representation for the model of the
|E3-ubi-RING2|green|3rpg.fasta.txt|E3-ubi-RING2|3rpg.pdb|C|45,116|-15|1|10|6|3||
\endcode
-Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which topology to use, as well as the paths to other pieces of input information.
+Next, we must prepare `static_snapshot.py` to read in this topology file. We begin by defining the input variables, `state` and `time`, which define which heterogeneity model to use, as well as the paths to other pieces of input information.
\code{.py}
# Running parameters to access correct path of ET_data for EM restraint
diff --git a/doc/trajectory.md b/doc/trajectory.md
index 23ff97db3..25e3409b9 100644
--- a/doc/trajectory.md
+++ b/doc/trajectory.md
@@ -70,7 +70,7 @@ nodes, graph, graph_prob, graph_scores = IMP.spatiotemporal.create_DAG(state_dic
The inputs we included are:
- state_dict (dict): a dictionary that defines the spatiotemporal model. Keys are strings for each time point in the spatiotemporal process and values are integers corresponding to the number of snapshot models computed at that time point.
- out_pdf (bool): whether to write the probability distribution function (pdf).
-- npaths (int): Number of states two write to a file (path*.txt).
+- npaths (int): Number of states to write to a file (path*.txt).
- input_dir (str): directory with the input information.
- scorestr (str): final characters at the end of the score files.
- output_dir (str): directory to which model will be written. Will be created if it does not exist.