Skip to content
G-Huang edited this page Oct 4, 2018 · 22 revisions

Instruction for LillyMol

Glossary

SMILES: a simple ascii string-based method for representing molecules and reactions (see http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html). Note that a single molecule can be represented by multiple SMILES strings.

SMARTS: a simple ascii string-based method for representing molecular substructures; an extension of SMILES (see http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html)

SMILES file: A text file containing SMILES strings; each SMILES is the first element of each line/row. Traditionally, the second row is the identifier (ID) of the molecule. Additional columns may exist. Columns are separated by space or tab with the former being the standard for LillyMol. Typical file extension: ‘.smi’

SMARTS file: A text file containing SMARTS strings; each SMARTS is the first element of each line/row. Traditionally, the second row is the identifier (ID) of the substructure. Additional columns may exist. Columns are separated by space or tab with the former being the standard for LillyMol. Typical file extension: ‘.smt’

SDF or SD file: A simple, ascii connection table-based method for representing molecules and substructures (see https://en.wikipedia.org/wiki/Chemical_table_file). Typical file extension: ‘.sdf’

RDF or RD file: A simple, ascii connection table-based method for representing chemical reactions (see http://c4.cabrillo.edu/404/ctfile.pdf). Typical file extension: ‘.rdf’

Chemical substructure: A contiguous chemical fragment; may not be a valid molecule.

Substructure search: The process of searching for the presence of a chemical substructure in molecules.

Canonical SMILES: A special, unique SMILES representation for a specific chemical structure.

Structure clean-up: Common chemoinformatics process that ‘cleans’ a structure representation from salts, fragments, etc and checks the structure representation for simple errors e.g. syntax, valence, etc.

Reaction SMILES: a simple ascii string-based method for representing chemical reactions using SMILES strings.

Reaction SMILES file: A text file containing reaction SMILES strings; each reaction SMILES is the first element of each line/row. Traditionally, the second row is the identifier (ID) of the reaction. Additional columns may exist. Columns are separated by space or tab with the former being the standard for LillyMol. Typical file extension: ‘.rsmi’

Reaction signature: The unique SMILES-like string representing the extended reaction core of a chemical reaction.

References/Resources

https://en.wikipedia.org/wiki/Chemical_file_format

http://c4.cabrillo.edu/404/ctfile.pdf

Tool Documentation

1. common_names

Description:

Merge identical chemical structures to one common name in a SMILES file (also see unique_molecules tool). Useful for identifying unique chemical structures in a SMILES file.

Author/owner: C3/Eli Lilly and Co

Sample 1:

common_names -S ./output -s 10000 -r 10000 -D + -v input1.smi input2.smi

Explanation:

Find all compounds in input1.smi and input2.smi with common structure and different name; write them to file output.smi with new name consisting of old names separated by a "+" symbol; maximum number of molecules to process is 10000 (-s 10000); report progress every 10000 rows (-r 10000)

Shell output:

… output to './output'

File output: (output.smi)

Combined compound list

Help command:

common_names

2. fetch_smiles_quick

Description:

Fetches records from one file based on identifiers in another file

Author/Owner: C3/Eli Lilly and Co

Sample 1:

fetch_smiles_quick -j -c 1 -C 2 -X notInRecord -Y notInIdentifier record.w structures.smi

Explanation:

Fetches records from record.w file to identifier.smi file based on common identifiers. The matched records (column 1 in record.w and column 2 in structures.smi) will be displayed in the shell window. The list of unmatched identifiers will be saved in the notInRecord file. The list of unmatched records will be saved in the notInIdentifier file. The generated identifier file is a descriptor file without header record(-j).

Shell output: (matched record)

O=C(C)C=CC=C(C)CCC=C(C)C PBCHM1756999 24 3 6 349.4 5 0.2083 10

File output: (notInIdentifier)

Unmatched records

File output: (notInRecord)

Unmatched identifiers

Help command:

fetch_smiles_quick

3. unique_molecules

Description:

Filters out duplicate chemical structures based on unique smiles

Author/owner: C3/Eli Lilly and Co

Sample 1:

unique_molecules -S unique -D duplicate -v -l input.smi

Explanation:

Traverse structures in input.smi and identify duplicate structures; write duplicates in duplicate.smi; write unique in unique.smi; only consider largest fragment of each smiles (-l)

Shell output:

Execution summary

File output: (duplicate.smi)

Duplicate molecule list

File output: (unique.smi)

Unique molecule list

Help command:

unique_molecules

4. unique_rows

Description:

Identifies the unique rows in a file

Author/Owner: C3/Eli Lilly and Co

Sample 1:

unique_rows -c 1 -c 2 input.dat

Explanation:

Check input.dat file for unique rows based on values in column 1 and 2. The unique rows will be displayed in the shell window

Shell output:

Unique row list

Help command:

unique_rows

5. iwcut

Description:

Extract columns from a text file

Author/Owner: C3/Eli Lilly and Co

Sample 1: iwcut -f 5,3 input.txt

Explanation:

Extract column 5 and column 3 from input.txt file. The extracted columns data will be displayed in the shell window

Shell output:

Data from column 5 and column 3

Help command:

iwcut

6. fileconv

Description:

Structure file utility to clean up SMILES files and filter on specific criteria. It can also be used to convert between chemical file formats including, e.g from SDF to SMILES

Author/owner: C3/Eli Lilly and Co

Sample 1:

fileconv -Y dbg -B 100 -S -a input.smi

Explanation:

Debug/print each molecule structure in input.smi; ignore as many as 100 fatal input errors

Shell output:

Molecule information

Sample 2:

fileconv -F 6 -c 4 -C 14 -v -i smi -S selection list.smi

Explanation:

Select the molecules that have number of atoms ranging from 4-14 and less than 6 fragments from list.smi file; store results in file with selection.smi

Shell output:

Execution summary

File output: (selection.smi)

List of molecules meeting the search criteria

Sample 3:

fileconv -o sdf -i smi -S single single.smi

Explanation:

Convert single.smi file to the sdf format single.sdf

File output: (single.sdf)

Converted sdf file

Help command:

fileconv

7. rxn_signature

Description:

Generates reaction signatures for input reactions.

Author/Owner: C3/Eli Lilly and Co

Sample 1:

rxn_signature -v -r 0,1,2 -C Cfile -F Ffile all.rsmi >all.sig 2>all.log

Explanation:

Extract the reaction signatures of all reactions in all.rsmi. Store signatures in all.sig – program prints to stdout. The signature radius from the reaction core (i.e. the changing atoms) to the signature is 0 1 2. The list of changed atoms are written to Cfile. Failed reactions are written to Ffile.

Notes:

Reaction signatures capture the extended core of a reaction around the atoms that change in a reaction. A signature is based on the unique smiles of the reaction core. The smiles includes atoms colored by their environment in the original reaction smiles. In addition, information about the ring bond status in the original reaction smiles is appended to the reaction signature produced.

Help command:

rxn_signature

8. rxn_standardize

Description:

Checks and standardizes input chemical reactions; converts to a reaction smiles file format

Author/Owner: C3/Eli Lilly and Co

Sample 1:

rxn_standardize -s -c -D x -X igbad -v -C 60 -K -E autocreate -e -o -b -f gsub input.rsmi > output.rsmi

Explanation:

Check and standardize reactions in an input reactions smiles (.rsmi) file. Discard chirality on input (-c). Discard reactions containing duplicate atom map numbers (-D x). Ignore bad reactions (-X igbad). Discard any reaction where the largest reactant has more than 60 atoms (-C 60). Kekule fix (-K). Automatically create new elements when encountered (-E autocreate). Move small fragments that show up on products to orphan status (-e). Create reagent fragments that are orphans (-o). Remove duplicate reactants, even if atom maps scrambled (-b). Replace unusual characters in reaction names with _ (-f gsub).

Notes:

Input file can be in RDF or rsmi format. Output is in rsmi format

Help command:

rxn_standardize

9. tsubstructure

Description:

Perform 2D substructure searches with SMILES/SMARTS against SMILES files

Author/owner: C3/Eli Lilly and Co

Sample 1:

tsubstructure -s 'C(C)(=O)C' -m hits.smi -n nonhits.smi list.smi

Explanation:

Search for molecules in list.smi containing defined smarts (-s); write hits in hits.smi (-m) and nonhits in nonhits.smi(-n)

Sample 2:

tsubstructure.sh -f -b -A D -o smi -m hits.smi -s 'C(C)(=O)C' list.smi

Explanation:

Search for molecules containing defined smarts (-s); only find one embedding of the query (-f); for each molecule, break after finding a query which matches (-b); use daylight aromaticity (-A D); write hits in hits.smi

Note:

Use -X to successfully skip structures with unconventional symbols, e.g. X, R, ...

Sample 3:

tsubstructure -s '[ND1H2]-[C@H]1CCN2CCCCC2C1' -A D -o usmi -m match.out list.smi

Explanation:

Search for molecules containing defined smarts (-s)

Sample 4:

tsubstructure -A D -q carboxylic_acids.qry -u -M imp2exp -m match.smi list.smi

Explanation:

Find all matches to specific query file (-q) and place in match.smi (-m); use Daylight aromaticity; convert implicit hydrogen in target molecules to explicit before matching attempt (-M imp2exp); find unique matches only (-u)

Help command:

tsubstructure

10. retrosynthesis

Description:

Defines synthetic routes for input chemical structures by deconstructing input molecules into reactants using a set of known reactions templates. Conceptually, the inverse process of chemical reaction synthesis as implemented by tool trxn.

Author/Owner: C3/Eli Lilly and Co

Sample 1:

retrosynthesis -Y all -X kg -X kekule -X ersfrm -a 2 -q f -v -R 1 -I CentroidRxnSmi_1 -P UST:AZUCORS -M ncon -M ring -M unsat -M arom 10Cmpds.smi

Explanation:

Looks for synthesis paths for the molecules in 10Cmpds.smi using the reaction signatures in CentroidRxnSmi_1. Various standardization flags (-Y, -X, -q, -P, -M options). Require at least 2 heavy atoms in fragments (-a), verbose (-v), centroid radius 1 (-R).

Help command:

retrosynthesis

11. trxn

Description:

Performs reactions between reactant molecules to enumerate product structures. Uses a control reaction file, a scaffold SMILES file and zero or more reactant SMILES files. Conceptually inverse of retrosynthesis process as implemented by tool retrosynthesis.

Author/owner: C3/Eli Lilly and Co

Sample 1:

trxn -v -r 1.2.1_Aldehyde_reductive_amination_FROM_amines_AND_aldehydes.rxn -Z -z i -M RMX -m RMX -S 1.2.1_run 20180412_amines.smi 20180412_aldehydes.smi

Explanation:

Perform reaction in 1.2.1_Aldehyde_reductive_amination_FROM_amines_AND_aldehydes.rxn ignoring sidechains (-Z) and modules (-z i) not reacting, ignoring sidechains with multiple substructure match (-M RMX), ignoring scaffolds that generate multiple structure hits(-m RMX). Output file is saved to 1.2.1_run 20180412_amines.smi 20180412_aldehydes.

Sample 2:

trxn -v –r 2.1.2_Carboxylic_acid_+_amine_condensation_FROM_amines_AND_carboxylic_acids.rxn -Z -z i -M RMX -m RMX -S 2.1.2_run 20180412_amines.smi 20180412_carboxylic_acids.smixbntr

Explanation:

Perform reaction in ./2.1.2_Carboxylic_acid_+_amine_condensation_FROM_amines_AND_carboxylic_acids.rxn

Help command:

trxn

12. iwdemerit

Description:

Computes demerit of a molecule. In this context demerits refers to non-desirable molecular structure characteristics/features.

Author/owner: C3/Eli Lilly and Co

Sample 1:

iwdemerit -A D -A I -S foo -G - -f 99999 -t -W imp2exp -W maxe=1 -E autocreate -q F:PAINS/queries_latest -O hard -W dnv=0 -W slist -i smi pubchem_example.smi

Explanation:

Compute the demerits for the molecules in pubchem_example.smi (-i smi pubchem_example.smi) using the queries_latest query file (-q F:PAINS/queries_latest). The good, non-rejected (-G) structures will be written into foo.demerit (-S foo). Use Daylight aromaticity definitions (-A D) and enable input of aromatic structures (-A I). Molecules are rejected when they have 9999 or higher demerits (-f 9999). Append demerit text to molecule names (-t). Make implicit hydrogen explicit (-W imp2exp), maximum number of substructure queries to identity is 1 (-W maxe=1), use value 0 in the query file as the demerit score (-W dnv=0) and write a sorted list of demerit values and reasons (-W slist). Skip all the hard coded substructure queries (-O hard).

Help command:

iwdemerit

Clone this wiki locally