Skip to content

Latest commit

 

History

History
511 lines (348 loc) · 22.2 KB

README.rst

File metadata and controls

511 lines (348 loc) · 22.2 KB

ChemicalToolBoX

Contents


What's the ChemicalToolBoX?

The ChemicalToolBoX is a set of tools integrated into the Galaxy-workflow-management system to enable researchers easy-to-use, reproducible, and transparent access to cheminformatics libraries and drug discovery tools. It includes standard applications for similarity and substructure searches, clustering of compounds, prediction of properties and descriptors, filtering, and many other tools that range from drug-likeness classification to fragmentation and fragment-merging. By combinating the various tools many more powerful applications can be designed.

ChemicalToolBox is based on open-source software, web-accessible, freely available, and easily expandable. It can be downloaded and easily deployed locally or on a large scale cluster.

CTB is available as a public test instance @ http://ctb.pharmaceutical-bioinformatics.org.

Galaxy

Galaxy is an open, web-based platform for data intensive research. All tools can be combined in workflows without any need of programming skills. Furthermore the platform can be extended with more tools at any time. Each tool has its own information about what it does and how the input is supposed to look like. You can make data available for Galaxy by uploading local files or downloading online content. Inputfiles, workflowsteps and results are stored in a history where you can view them or reaccess them later. It is possible to share workflows and histories with other users or make the public available. Saved workflows can be used with new input files or just to rerun an analyses which ensures repeatability.

Parallelisation

ChemicalToolBoX is capable of accelerating the computation time of resource-intensive processes. Large molecule files are automatically splitted in smaller chunks. Each chunk then is processed on a separate core and merged afterwards. Everything stays the same for the user but computation time decreases.

As the parallelisation is scalable the job will run on a predefined number of cores. The more cores the faster the processing.

Supported Filetypes

  • InChI
    International Chemical Identifier - developed by the IUPAC. Representation of a chemical molecule as a string which can include information about the bond, tautomerism, isotope, charge and stereochemistry. Strings are generated following the InChI-algorithm.
  • MOL2
    A Tripos Mol2 file can store a complete representation of a SYBYL molecule.
  • MOL & SDF
    Structure-data-file that can consist of many molecules. Molecules are separated by four Dollar signs ($$$$). Allows the storing of metainformation like molecular mass or a unique identifier. Developed by MDL Information System (Accelrys).
  • SMILES
    A line notation using ASCII strings to represent chemical molecules. Information about the charge, isotope or radical can be included besides the stereo information (CIP convention) and the normal bonds. The Simplified Molecular Input Line Entry Specification was developed by Daylight Chemical Information System Incorporation.
  • and others:
    Special filetypes like the Open Babel Fastsearch index or the Pharmacophore type from silicos-it are also supported.

All filetypes are interchangable due to three easy to use converting options:

  • the built-in conversion via the pencil icon pencilicon

    https://github.com/bgruening/galaxytools/raw/master/chemicaltoolbox/convert_pencil.jpg
  • the Compound Converter tool described in the Tools section

  • the automatic conversion each tool offers

    https://github.com/bgruening/galaxytools/raw/master/chemicaltoolbox/internal_conversion.png

Getting Started

ChemicalToolBoX can be installed on all common Unix systems. However, it is developed on Linux and I don't have access to OS X. You are welcome to help improving this documentation, just contact me.

For any additional information, especially cluster configuration or general Galaxy questions, please have a look at the Galaxy Wiki.

Prerequisites:

* Python 2.6 or 2.7
* standard C compiler, C++ and Fortran compiler
* Autotools
* CMake
* cairo development files (used for PNG depictions)
* python development files
* libblas and liblapack development files
* Java Runtime Environment (JRE, used by OPSIN and NPLS)

To install all of the prerequisites you can run the following command, depending on your OS:

  • Debian based systems: apt-get install build-essential gfortran cmake mercurial libcairo2-dev python-dev
  • Fedora: yum install make automake gcc gcc-c++ gcc-gfortran cmake mercurial libcairo2-devel python-devel
  • OS X (MacPorts): port install gcc cmake automake mercurial cairo-devel

Galaxy installation

  1. Create a sand-boxed Python using virtualenv (not necessary but recommended):

    wget https://raw.github.com/pypa/virtualenv/master/virtualenv.py
    python ./virtualenv.py --no-site-packages galaxy_env
    . ./galaxy_env/bin/activate
    
  1. Clone the latest Galaxy platform:

    git clone https://github.com/galaxyproject/galaxy.git
    
  1. Navigate to the galaxy folder and update it:

    cd ~/galaxy
    git pull
    

    This step is not necessary if you have a fresh checkout. Anyway, it is good to know ;)

  2. Create folders for toolshed and dependencies:

    mkdir ~/shed_tools
    mkdir ~/galaxy/tool_deps
    
  3. Create configuration file:

    cp ~/galaxy/config/galaxy.ini.sample ~/galaxy/config/galaxy.ini
    
  4. Open config/galaxy.ini and change the dependencies directory:

    LINUX: gedit ~/galaxy/config/galaxy.ini
    OS X: open -a TextEdit ~/galaxy/config/galaxy.ini
    
  5. Search for tool_dependency_dir = None and change it to tool_dependency_dir = ./tool_deps, remove the # if needed

  6. Remove the # in front of tool_config_file and tool_path

  7. (Re-)Start the galaxy daemon:

    sh run.sh --reload
    

    In deamon mode all logs will be written to main.log in your Galaxy Home directory. You can also use:

    run.sh
    

    During the first startup Galaxy will prepare your database. That can take some time. Have a look at the log file if you want to know what happens.

After launching galaxy is accessible via the browser at http://localhost:8080/.

Tool Shed configuration

  • Register a new user account in your Galaxy instance: Top Panel → User → Register
  • Become an admin
    • open config/galaxy.ini in your favourite text editor (gedit config/galaxy.ini)
    • search admin_users = None and change it to admin_users = EMAIL_ADDRESS (your Galaxy Username)
    • remove the # if needed
  • restart Galaxy
sh run.sh --reload

ChemicalToolBoX installation

ChemicalToolBoX will automatically download and compile all requirements, like Open Babel, RDKit, chemfp, numpy and so on. It can take up to 2-3 hours.

Installation via Galaxy API (recommended)

  • Generate an API Key

  • Run the installation script:

    python ./scripts/api/install_tool_shed_repositories.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o bgruening --name chemicaltoolbox --tool-deps --repository-deps --panel-section-name ChemicalToolBoX
    

You can watch the installation status under: Top Panel → Admin → Manage installed tool shed repositories

Installation via webbrowser

  • go to the admin page
  • select Search and browse tool sheds
  • Galaxy tool shed > Computational chemistry > chemicaltoolbox
  • install chemicaltoolbox

Additional Notes

You can also configure CTB to use system installed binaries, but you will loose some degree of reproducibility. Nevertheless, if you want to do this the recommended depencency versions are specified in a file called tool_dependencies.xml, located in each subfolder.

Troubleshooting

If you have any trouble or the installation did not finish properly, do not hesitate to contact me. However, if the installation fails during the Galaxy installation, you can have a look at the Galaxy wiki. If the ChemicalToolBoX installation fails, you can try to run:

python ./scripts/api/repair_tool_shed_repository.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o bgruening -r 30ae0e5218b4 --name chemicaltoolbox

That will rerun all failed installation routines. Alternatively, you can navigate to the ChemicalToolBoX repository in your browser and repair manually: Top Panel → Admin → Manage installed tool shed repositories → chemicaltoolbox → Repository Actions → Repair repository


On slow computers and during the compilation of large software libraries, like openbabel or boost, the Tool Shed can run into a timeout and kills the installation. That problem is known and should be fixed in the near future.

If you encouter a timeout or 'hung' during the installation you can increase the threadpool_kill_thread_limit in your config/galaxy.ini file.


Database locking errors

Please note that Galaxy per default uses a SQLite database. Sqlite is not intended for production use. With multiple users or complex components, like that workflow, you will see database locking errors. We highly recommend to use PostgreSQL for any kind of production system.

Jmol Editor Installation

Jmol Editor needs be run on a separate webserver, this is how to setup the server:

  • download Jmol Editor from:

    wget https://github.com/bgruening/download_store/raw/master/jmoleditor.tar.gz
    
  • copy the directory jmoleditor into your Galaxy Root directory

    cp -a ~/galaxytools/chemicaltoolbox/data_source/jmoleditor/ ~/galaxy/
    
  • launch the webserver from your galaxy root directory

    python -m SimpleHTTPServer &
    

Tools

  • Get Chemical Data
    • Jmol Editor
      Jmol Editor can be used to paint structures or alter atoms or identities from single molecules.
  • Chemical Converters
    • Compound converter
      Compound converter joins several Open Babel command prompt converters in an easy to use tool. It converts various chemistry and moleculare modeling data files. The output format can be specified as well as several parameters. Some parameters are available for all tools (e.g. protonation state & pH) others are specific for a given output format (e.g. exclude isotopes for conversion to canonical SMILES).
    • Molecule recognition
      OSRA (Optical Structure Recognition Application) is a utility designed to convert graphical representations of chemical structures into SMILES or SDF. It generates the SMILES or SDF representation of any molecular structure image within a document which is parseable by GraphicMagick.
    • IUPAC name-to-structure
      OPSIN is a IUPAC name-to-structure conversion tool offering high recall and precision on organic chemical nomenclature.
  • Filter / Sort
    • (Multi) Compound search
      Uses the Open Babel Obgrep to search for molecules inside multi-molecule files (e.g. SMI, SDF, etc.).
    • Remove counterions and fragments
      Parses a multiple molecules file and deletes any present counterions or fragments.
    • Remove duplicated molecules
      Filters a library of compounds and removes duplicated molecules comparing either InChI or SMI.
    • Filter
      Filters a library of compounds based on user-defined physico-chemical parameters or predefined options (e.g. Ro5, lead-like properties, etc.). Multiple parameters can be selected for more specific queries.
    • Remove small molecules
      Filters a library of compounds and removes small molecules below a predefined input number of atoms.
  • Search
    • Spectrophores™ search

      Spectrophores™ is a screening technology by Silicos which converts three-dimensional molecular property data into one-dimensional spectra. Typical characteristics that can be converted include electrostatic potentials, molecular shape, lipophilicity, hardness and softness potentials. The computation is independent of the position and orientation of a molecule and allows an easy comparison of Spectrophores™ of different molecules.

      Molecules with similar three-dimensional properties and shape, and therefore also similar biological activities, always have similar Spectrophores™. As a result this technique is a very powerful tool to investigate the similarity of molecules and can be applied as a screening tool for molecular databases, virtual screening, and database characterisations.

    • Similarity search

      Similarity searches using a variety of different fingerprints using either the chemfp FPS type or the Open Babel Fastsearch index.

    • Substructure search

      Substructure search is based on Open Babel FastSearch. FastSearch uses molecular fingerprints to prepare and search an index of a multi-molecule datafile.

  • Calculate / Modify
    • Compute physico-chemical properties

      Computes several physico-chemical properties (e.g. logP, PSA, MW, etc.) for a set of molecules. Accepts SDF or MOL2 as input file as 3D coordinates of the molecules have to be provided.

    • Add hydrogen atoms

      Parses a molecular file and adds hydrogen atoms at a user-defined pH value.

    • Remove protonation state

      Parses a molecular file and removes the protonation state of every atom.

    • Change title

      Changes the title of a molecule file to a metadata value of a given ID in the same molecule file.

    • Confab

      Confab is a conformation generator. The algorithm starts with an input 3D structure which, after some initialisation steps, is used to generate multiple conformers which are filtered on-the-fly to identify diverse low energy conformers.

    • Molecules to fingerprints

      10 different fingerprints can be calculated from all common file formats using chemfp. Chemfp supports the FPS fingerprint file format and is utilising Open Babel, OpenEye and RDKit.

    • SDF to fingerprint

      Read an input SD file (PubChem), extract the fingerprints and store them in a FPS-file.

    • Drug-likeness

      Estimates the drug-likeness of molecules and reports a score. Comes with three applicable varieties (QEDw,mo, QEDw,max, QEDw,u ).

    • Descriptors by RDKit

      This tool calculates all available descriptors from RDKit..

    • Natural Product likeness

      Calculates the Natural Product(NP)-likeness of a molecule, i.e. the similarity of the molecule to the structure space covered by known natural products.

    • Shape-it™

      Shape-it™ is a silicos-it tool that aligns a reference molecule against a set of database molecules using the shape of the molecules as the align criterion. It is based on the use of gaussian volumes as descriptor for molecular shape as it was introduced by Grant and Pickup.

      Shape-it™ is a program that is instructed by means of command line options. The program expects a single reference molecule (with three-dimensional coordinates) and a database file containing one or more molecules (with three-dimensional coordinates) that need to be shape-aligned onto the reference molecule. The tool returns all aligned database molecules and their respective shape overlap scores, or the top-best scoring molecules.

    • Strip-it™

      Strip-it™ is a program by silicos-it that identifies and extracts predefined scaffolds from organic small molecules. The program is linked against the open source C++ library of Open Babel.

      The program comes with a number of predefined molecular scaffolds for extraction. These scaffolds include, amongst others molecular frameworks as originally described by Bemis and Murcko, molecular frameworks and the reduced molecular frameworks as described by Ansgar Schuffenhauer and coworkers and scaffold topologies as described by Sara Pollock and coworkers.

  • Chemical Clustering
    • NxN clustering
      Generates hierarchical clusters and visualises clusters with dendrograms. Powered by chemfp.
    • Taylor-Butina clustering
      Taylor-Butina clustering is an unsupervised non-hierarchical clustering method which guarantees that every cluster contains molecules which are within a distance cutoff of the central molecule. Powered by chemfp.
  • Fragmentation
    • Fragmenter
      Splits a molecule on predefined spots, e.g. the RECAP-rules.
    • Merging
      Merges small molecules together to larger compounds using predefined reactions. The options iteration depth and number of repeats can be used to adjust the created number of compounds and the actual computation time.
  • Visualisation
    • Depiction
      Creates an .svg or .png image of a small set of molecules (few hundreds). Based on Open Babel PNG/SVG 2D depiction.
    • More to come ...
      We are working on several ideas how to improve the visualision of small and large libraries in Galaxy. If you are interested and want to discuss it further please contact me (e-mail).

Workflows

An example workflow is located in the Tool Shed:

http://toolshed.g2.bx.psu.edu/view/bgruening/chemicaltoolbox_merging_chemical_databases_workflow

You can install the workflow with the API:

python ./scripts/api/install_tool_shed_repositories.py --api YOUR_API_KEY -l http://localhost:8080 --url http://toolshed.g2.bx.psu.edu/ -o bgruening -r e1bc8415f875 --name chemicaltoolbox_merging_chemical_databases_workflow --tool-deps --repository-deps --panel-section-name ChemicalToolBoX

or as described above via webbrowser. You have now successfully installed the workflow, to import it to all your users you need to go to the admin panel, choose the worklow and import it. For more information have a look at the Galaxy wiki:

http://wiki.galaxyproject.org/ToolShedWorkflowSharing#Finding_workflows_in_tool_shed_repositories

Please note that Galaxy per default uses a SQLite database. Sqlite is not intended for production use. With multiple users or complex components, like that workflow, you will see database locking errors. We highly recommend to use PostgreSQL for any kind of production system.

Publications using CTB

Bug Tracker

Have a bug or a feature request? Please write a new card. Before writing a new card, please search for existing issues.

Contributing

We encourage you to contribute to ChemicalToolBoX! Check out our Trello board or contact us via e-mail.