Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atom index of the ligand in covalent bond restraint. And explicit hydrogen #217

Open
andrew0901 opened this issue Dec 5, 2024 · 4 comments

Comments

@andrew0901
Copy link

Hi Chai team,

Thanks for the update for the covalent bond restraint. I'm wondering to specify a linkage between ligand and protein, how do I find the atom name/index to put in the restraint file for the ligand atom of interest? Or is the function for now just for glycan and other modification with standard names?

My second question: I realize Chai removes all hydrogens in the output ligand, which is, I think, due to rdkit's procedure reading the smiles string. But sometimes there are some important hydrogens, like metal-hydride, that I wish to be explicitly predicted. Is there any way to do that in the current version or near future?

@wukevin
Copy link
Contributor

wukevin commented Dec 6, 2024

Regarding the linkage between ligand and protein, we do support arbitrary SMILES string in addition to standard CCD codes, and you can find the atom index by doing something like this, opening the resulting file in PyMOL (or something similar), clicking through, and finding the name of the atom that you wish to bind to:

def write_reference_conformer(query: str, query_type: str, output_file="ligand.pdb"):
    """Writes reference conformer of the given query to the given pdb file.

    NOTE: query_type must either be "ccd" or "smiles" and determine how we parse query.

    Meant as a research utility for quickly determining atom names and checking the
    conformer generated for a smiles string.
    """
    conformer_generator = RefConformerGenerator()

    if query_type == "ccd":
        conf_data = conformer_generator.get(query)
        assert conf_data is not None
    elif query_type == "smiles":
        conf_data = conformer_generator.generate(query)
    else:
        raise ValueError(f"Unrecognized {query_type=}")

    # log the atom names
    for i, n in enumerate(conf_data.atom_names):
        logging.info(f"Atom {i}:\t{n}")

    with open(output_file, "w") as f:
        # PDB format uses 1-based indexing for atom numbers
        for atom_index, (atom, coord) in enumerate(
            zip(conf_data.atom_names, conf_data.position), 1
        ):
            # Format each line according to PDB specifications
            line = (
                f"HETATM{atom_index:5d}  {atom:<3s} LIG A   1    "
                f"{coord[0]:8.3f}{coord[1]:8.3f}{coord[2]:8.3f}"
                f"  1.00  0.00          {atom:>2s}\n"
            )
            f.write(line)

        bonds_grouped = defaultdict(list)
        for i, j in sorted(conf_data.bonds):
            bonds_grouped[i + 1].append(j + 1)  # Converts 0 to 1 indexed
        for center_atom, connections in bonds_grouped.items():
            bonded_atoms = sorted(connections)
            while bonded_atoms:
                # Take up to 4 bonded atoms at a time
                current_bonds = bonded_atoms[:4]
                bonded_atoms = bonded_atoms[4:]

                # Format the CONECT record
                # The format requires each atom number to take exactly 5 columns
                line = f"CONECT{center_atom:5d}"
                for bonded_atom in current_bonds:
                    line += f"{bonded_atom:5d}"
                line += "\n"
                f.write(line)

        # Write the END record to properly terminate the PDB file
        f.write("END\n")

Regarding hydrogens - we remove all hydrogens during training, so while it might be possible to hack this in by changing the conformer generation logic, I wouldn't expect particularly good results when doing so. If you wish to hack it in, I would start by seeing what happens if you remove the call to remove hydrogens here:

def _load_ref_conformer_from_rdkit(self, mol: Chem.Mol) -> ConformerData:
mol = Chem.RemoveAllHs(mol)

@wukevin
Copy link
Contributor

wukevin commented Dec 12, 2024

We've also added an example of specifying covalently bound ligands via SMILES string here: https://github.com/chaidiscovery/chai-lab/tree/main/examples/covalent_bonds#non-glycan-ligands

@andrew0901
Copy link
Author

Thanks for providing the script. I ran it and here is the PDB I ended up with. In your example, you specified the sulfur atom as @s1. Does the "1" here means the first S atom in the atom order of the generated PDB? Because the index is not 1 for the desired atom in the output file.

HETATM 1 C LIG A 1 0.867 -0.164 -1.047 1.00 0.00 C
HETATM 2 C LIG A 1 0.155 -1.005 -1.898 1.00 0.00 C
HETATM 3 C LIG A 1 -1.020 -0.593 -2.489 1.00 0.00 C
HETATM 4 C LIG A 1 -1.470 0.679 -2.211 1.00 0.00 C
HETATM 5 C LIG A 1 -0.754 1.506 -1.363 1.00 0.00 C
HETATM 6 C LIG A 1 0.422 1.108 -0.765 1.00 0.00 C
HETATM 7 O LIG A 1 1.081 1.989 0.071 1.00 0.00 O
HETATM 8 C LIG A 1 2.271 1.682 0.745 1.00 0.00 C
HETATM 9 C LIG A 1 2.181 0.616 1.751 1.00 0.00 C
HETATM 10 O LIG A 1 3.235 0.274 2.391 1.00 0.00 O
HETATM 11 N LIG A 1 0.962 -0.037 2.033 1.00 0.00 N
HETATM 12 C LIG A 1 0.932 -1.083 3.030 1.00 0.00 C
HETATM 13 C LIG A 1 1.340 -0.610 4.406 1.00 0.00 C
HETATM 14 S LIG A 1 1.246 -2.029 5.541 1.00 0.00 S
HETATM 15 CL LIG A 1 -2.960 1.237 -2.944 1.00 0.00 CL
HETATM 16 CL LIG A 1 -1.907 -1.671 -3.564 1.00 0.00 CL
CONECT 1 2
CONECT 2 3
CONECT 3 4 16
CONECT 4 5 15
CONECT 5 6
CONECT 6 1 7
CONECT 7 8
CONECT 8 9
CONECT 9 10 11
CONECT 11 12
CONECT 12 13
CONECT 13 14
END

@wukevin
Copy link
Contributor

wukevin commented Dec 14, 2024

Thanks for providing the script. I ran it and here is the PDB I ended up with. In your example, you specified the sulfur atom as @s1. Does the "1" here means the first S atom in the atom order of the generated PDB? Because the index is not 1 for the desired atom in the output file.

Atom names are assigned in order of occurrence in the reference conformer, so @S1 indicates the first sulfur in the reference conformer.

In the general case, you can hover over each atom in something like PyMOL to find the atom name we assign to it, and change which atom you specify in your restraint accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants