Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor CovalentBuilder, Enable Custom Tether Atoms within Receptor, and Export Modified Polymer to PDB #277

Closed

Conversation

rwxayheee
Copy link
Contributor

@rwxayheee rwxayheee commented Dec 10, 2024

This was discussed a while back among several of us. The major objective of this PR is to bring covalent ligands into the polymer world as monomers as suggested by @mattholc with improvements on how covalent docking can be set up and generalized in Meeko, with and without ProDy.

Three major tasks were included in this PR:

  • In mk_prepare_ligand.py: Improved Receptor Parsing
  • Class CovalentBuilder and the dependency of ProDy in covalentbuilder.py were removed (refactored). Reusable functions (find_smarts and transform) were isolated.

  • Supports parsing of [json, pdb] (and [mmcif] with ProDy) to define receptor tether atoms. Some options (--add_templates, --wanted_altloc, --default_altloc) were exposed the same way as mk_prepare_receptor.py because they're potentially important options to determine the coordinates of the receptor tether atoms.

  • Sets receptor tether atoms by --rec_residue, --rec_tether_names or alternatively --rec_tether_smarts and --rec_smarts_indices.

d706720 is a working checkpoint for covalent ligand preparation. No more CovalentBuilder and ProDy is optional.

  • Via mk_prepare_ligand.py -> JSON -> mk_prepare_receptor.py: A Specialized Receptor Preparation Routine and Custom Setup of Receptor Tether Atoms
  • Previously, covalent docking used the same receptor preparation routine but that lacks the flexibility to customize receptor tether points. Now, one is expected to begin the covalent docking preparation with ligand preparation, from which one can get the updated receptor JSON files. The JSON files have overlapping atoms marked as "is_ignored" and can directly be used for receptor preparation. However, the "is_ignored" tags are not required for the export of poses.

393372b is a working checkpoint to write covalent receptor JSON from mk_prepare_ligand.py at the same time as the ligands are prepared. One can re-use this JSON to write the rigid part to the receptor PDBQT for covalent docking.

63747dd is a working checkpoint that has the optimized mk_export.py. Previously, I implemented deepcopy of Polymer which is very inefficient. Now, the Polymer is modified outside of the pose iteration. The additional PDB block is simply written by Chem.MolToPDBBlock. It is also possible, although not implemented, to export the updated Polymer JSON with covalent ligand as an extra Monomer.

Additional changes:

  • Report (raise warning or error for) lone hydrogens from bad input PDB  #231
    Fix: Log (logger.warning) lone hydrogens that are ignored in template matching
    This can be quite common when hydrogens present in PDB have an elongated bond length. Previously, these lone hydrogens were ignored silently. This PR includes a warning without directly disrupting the execution of template matching:
WARNING: Lone hydrogen is ignored: 
  PDBAtomInfo(name='HXX', resName='RG', resNum=1, icode='', chain='A') 
<mol_name>_<ligand_connect_pattern>_<monomer_label>_<residue_connect_pattern>.pdbqt

Examples:

C30H33NO6_10_A_HIS_396_2-6.pdbqt
C30H33NO6_9_A_HIS_396_0-2.pdbqt
C30H33NO6_8_A_HIS_375_2-6.pdbqt
C30H33NO6_7_A_HIS_375_0-2.pdbqt
C30H33NO6_6_A_HIS_361_2-6.pdbqt
C30H33NO6_5_A_HIS_361_0-2.pdbqt
C30H33NO6_4_A_HIS_295_2-6.pdbqt
C30H33NO6_3_A_HIS_295_0-2.pdbqt
C30H33NO6_2_A_HIS_246_2-6.pdbqt
C30H33NO6_1_A_HIS_246_0-2.pdbqt
  • Relocated parse_cmdline_res and parse_cmdline_res_assign to meeko.utils.utils
    to be re-used in mk_prepare_ligand.py

  • Added option to write updated receptor JSON files from mk_prepare_ligand.py
    to be re-used by mk_prepare_receptor.py

  --write_receptor_json
                        write receptor JSON files with consistent residue connect pattern tags. Output receptor filename will be:
                        <input_receptor_file_basename>_<residue_connect_pattern>.json
  • Added option to read receptor JSON file in mk_prepare_receptor.py
  -ij MACROMOL_JSON_FILENAME, --read_json MACROMOL_JSON_FILENAME
                        reads receptor JSON file

In summary, this PR extends covalent docking capabilities, making receptor tether atom setups more flexible and improving both the efficiency and continuity of the ligand and receptor preparation workflows. Together with optional alternate charge models for receptor and better support for metals, this work would be useful for nucleic acid docking.

in CLI script, when there are multiple connect_patterns or target res
@rwxayheee rwxayheee changed the title Polymer with covalent ligand as monomer Refactor CovalentBuilder, Enable Custom Attractors, and Export Modified Polymer to PDB Dec 12, 2024
@rwxayheee rwxayheee marked this pull request as ready for review December 12, 2024 09:10
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 12, 2024

This is mostly done, but since the receptor tether atoms are no longer restricted to CA,CB, I would like to take some time and test some strange setup. There're new options available, but the majority of the syntax didn't change. I’ll be adding a new usage note soon.

@rwxayheee rwxayheee marked this pull request as draft December 13, 2024 01:21
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 13, 2024

Usage Note 1: Custom receptor tether atoms (C-O) by Smarts: Docking Penicillin G (a bad enantiomer?)

  1. Pre-process PDB Structure (This may be optional, depending on the starting structure)

Input:
1FQG.pdb.txt

grep -v HOH 1FQG.pdb > 1FQG_dry.pdb

mk_prepare_receptor.py -i 1FQG_dry.pdb --delete_residue A:523 \
-o 1FQG_unbound -j \
--write_pdb 1FQG_unbound.pdb
  1. Prepare Ligand

Input:
Penicillin_g_methyl_ester.sdf.txt

mk_prepare_ligand.py -i Penicillin_g_methyl_ester.sdf --tether_smarts "[CH3]OC(=O)" \
--receptor 1FQG_unbound.json --rec_residue A:SER:70 --rec_tether_smarts "C[OH1]" \
--write_receptor
  1. Prepare Receptor
receptor_json="1FQG_unbound_A_SER_70_6-9.json"
ligand_pdbqt="Penicillin_g_methyl_ester_1_A_SER_70_6-9.pdbqt"
grep -v _RES $ligand_pdbqt > box_envelop.pdbqt

mk_prepare_receptor.py -ij $receptor_json \
-o 1FQG_receptor -p -g \
--box_envelop box_envelop.pdbqt --padding 4
  1. Export Docked Complex PDB

Input:
Penicillin_g_methyl_ester_1_A_SER_70_6-9.dlg.txt

adgpu_output="Penicillin_g_methyl_ester_1_A_SER_70_6-9.dlg"

mk_export.py $adgpu_output -j 1FQG_unbound.json -p dock_complexes.pdb

Output:
dock_complexes.pdb.txt

Usage Note 2: Custom receptor tether atoms by input atom names: In RNA & Targeting a Nonstandard Residue

This example needs 9391c01 in #259

  1. Pre-process PDB Structure

Input:
3ZD5.pdb.txt

grep -v HOH 3ZD5.pdb > 3ZD5_dry.pdb

mk_prepare_receptor.py -i 3ZD5_dry.pdb \
-o 3ZD5_unbound -j \
--write_pdb 3ZD5_unbound.pdb
  1. Prepare Ligand

Input:
iminomethyl_aniline.sdf.txt

mk_prepare_ligand.py -i iminomethyl_aniline.sdf --tether_smarts "CN=C" \
--receptor 3ZD5_unbound.pdb --rec_residue B:5BU:11 --rec_tether_names "C5" "BR" \
--write_receptor
  1. Prepare Receptor
receptor_json="3ZD5_unbound_B_5BU_11_13-14.json"
ligand_pdbqt="iminomethyl_aniline_1_B_5BU_11_13-14.pdbqt"
grep -v _RES $ligand_pdbqt > box_envelop.pdbqt

mk_prepare_receptor.py -ij $receptor_json -o 3ZD5_receptor -p -g --box_envelop box_envelop.pdbqt --padding 4
  1. Export Docked Complex PDB

Input:
iminomethyl_aniline_1_B_5BU_11_13-14.dlg.txt

adgpu_output="iminomethyl_aniline_1_B_5BU_11_13-14.dlg"

mk_export.py $adgpu_output -j 3ZD5_unbound.json -p dock_complexes.pdb 

Output:
dock_complexes.pdb.txt

@rwxayheee rwxayheee marked this pull request as ready for review December 13, 2024 04:18
@rwxayheee rwxayheee requested a review from diogomart December 13, 2024 04:18
@rwxayheee rwxayheee mentioned this pull request Dec 15, 2024
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 15, 2024

Just one more comment: We could modularize the prepare covalent ligand function similar to the original CovalentBuilder. However, it’s not necessary for the ligand’s anchored atoms to be linked to actual atoms. Similarly, the receptor tether atoms may not originate from the same monomer (especially if metals are involved) or share the identical atom types with the ligand tether atoms. Given the wide variety of possible configurations, it seems much more convenient to just set up the tether atoms by their indices and coordinates. This can be supplied from Python and via the transform function.

@rwxayheee rwxayheee changed the title Refactor CovalentBuilder, Enable Custom Attractors, and Export Modified Polymer to PDB Refactor CovalentBuilder, Enable Custom Tether Atoms within Receptor, and Export Modified Polymer to PDB Dec 17, 2024
@diogomart
Copy link
Contributor

Thank you for the great work!

I think we want the covalent ligand to be a monomer because it is covalently bonded to the receptor and thus it is the same molecule as the receptor. Future features that require this design are:

  1. parameterize the entire polymer including covalent attachment for MD simulation
  2. enable flexibilization of the entire sidechain, e.g. from the CA-CB bond if it's an amino acid, not just from the attachment point. Will require temporarily merging monomers as implemented in create single rdkit mol from entire chorizo #188

We discussed creating a new command line script for preparing covalent modification. Do you have more thoughts on that? I'm not sure it's a good or a bad idea, I'm bringing it up because it feels strange to have mk_prepare_ligand.py being able to fully prepare a receptor and running mk_prepare_receptor.py twice, one before and once after preparing the ligand. Although I understand why you did that and honestly can't really think of a better alternative, but maybe it's something we should discuss more to try to find a more linear workflow.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 22, 2024

Hi @diogomart
Thanks for your kind response. I understand your points. Here's some tentative plan for moving forward -

parameterize the entire polymer including covalent attachment for MD simulation

This makes sense and can be very useful. But to avoid overhead I want to defer it, and not include the "parameterization of the modified polymer" in docking preparation or basic export. In this PR, I put very primitive codes into mk_export.py (but eventually I commented it out):
https://github.com/forlilab/Meeko/pull/277/files#diff-de93e8e364ffdcbdc7c1630eea2b3730eb6253e416c1a0d6ee640d2b7c9c2cfdR196-R204

This part of the code uses the two-monomer representation of target residue and covalent ligand, which is more flexible. This is useful for processing a large library of covalent ligands.
But to parameterize the modified polymer, one-monomer (combining of target residue and covalent ligand) might be easier for padding and parameterization purposes (unless the residue is too big).

Therefore, for post-processing of covalent docking and parameterization, a "monomer-fusion" function could be helpful. It needs more work, but I can open an issue or new PR for this. Does it sound like a good plan?

With the two-monomer representation or even more fragmentation, we need to think about how to reconstruct padded_mol for individual fragments. This can also be a very useful thing to consider in future (for residue that is too big). But I will not include it in this PR.

enable flexibilization of the entire sidechain, e.g. from the CA-CB bond if it's an amino acid, not just from the attachment point

This is OK before and after this PR. This PR gives the freedom to define the attachment points, in case the residue has nonstandard backbone or is a nucleic acid or for any reason the user does not want to flexibilize the entire sidechain. But again, it's always possible to flexibilize the sidechain from CA-CB.
However, I don't want to force the attachment points to be CA-CB. Will it break other related functions if the attachment points are not CA-CB?

We discussed creating a new command line script for preparing covalent modification

I can draft a command-line script. I think it might be a good idea so I can revert the changes to the current command-line scripts. Let me prepare a draft, update this PR and see if you like it.

find a more linear workflow.

I understand this too. But I also like the current way that we can re-use mk_prepare_receptor.py for multiple purposes, given the wide range of options.

Thanks again for your kind response. Let me know what you think!

@diogomart
Copy link
Contributor

I wouldn't draft a new command line script just yet. The change I'm thinking about, storing the covalent ligand inside the polymer from the start, not just after the docking, might affect how we want the scripts to look like. In a way, preparing a covalent docking includes inserting a new monomer into an existing polymer.

However, I don't want to force the attachment points to be CA-CB. Will it break other related functions if the attachment points are not CA-CB?

I was just giving an example. I was not suggesting to use atom names. What I'm thinking is to enable keeping the original residue, e.g. cysteine, as a monomer, the covalent ligand, e.g. an acrylamide, as a second monomer, and still be able to make both the cysteine and acrylamide flexible, or just the acrylamide flexible as in the examples you provided.

we need to think about how to reconstruct padded_mol for individual fragments

Great point, this is the crux of it. It is too crazy to have the user specify a padding reaction?

@diogomart
Copy link
Contributor

diogomart commented Dec 22, 2024

To clarify, I think we should stick to the two-monomer representation of target residue and covalent ligand that you already implemented. I think that's the best design. The only change is to add the covalent ligand to the polymer during preparation (before docking). The positions will clash and will be reasonable only after docking.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 22, 2024

Hi @diogomart
Thanks for the clarification. We can meet and talk more about this when it's convenient.

In a way, preparing a covalent docking includes inserting a new monomer into an existing polymer... storing the covalent ligand inside the polymer from the start, not just after the docking

This is the overhead I was trying to avoid for docking. I don't want to store the covalent ligand in polymer, if we're only at the stage of docking. The modified polymer can be constructed after processing of covalent docking results (not for all covalent ligands). unless it's required by the docking engine, I don't know

What I'm thinking is to enable keeping the original residue, e.g. cysteine, as a monomer, the covalent ligand, e.g. an acrylamide, as a second monomer, and still be able to make both the cysteine and acrylamide flexible, or just the acrylamide flexible as in the examples you provided.

In this PR, the original residue (target residue) is kept as is. It just has some atoms ignored so that they won't appear in the (rigid) receptor PDBQT file. The ignored atoms can be up to CA-CB or even backbone atoms, which is how this PR makes the target residue flexible.

still be able to make both the cysteine and acrylamide flexible, or just the acrylamide flexible as in the examples you provided.

Both are possible with this PR (it's exactly what this PR is for it's supported, but not implemented the same way a standard residue is flexiblized). Maybe we were just talking about different ways to implement it... It's always possible to make both the residue and the covalent ligand flexible, but the current AutoDock flexibility model needs the ROOT to be on the one that's closer to the receptor core.

It is too crazy to have the user specify a padding reaction?

This is ok (currently possible?) but requires user knowledge of RDKit reaction language.

I really appreciate the time and your thoughts. I will think more about this!

@rwxayheee
Copy link
Contributor Author

The only change is to add the covalent ligand to the polymer during preparation (before docking). The positions will clash and will be reasonable only after docking.

I can draft that too and see if we like it! ^^ I really don't have a strong preference.

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 22, 2024

Just one more clarification for this PR, on making both the cysteine (original residue) and acrylamide (covalent ligand) flexible:

This is supported in this PR, but the flexible branches taken from the target residue need to be part of the covalent ligand. This way, the flexibility model is constructed over the covalent ligand Monomer.

Currently there's no quick way to construct and store one flexibility model over two monomers (??). This is why I was thinking about making a "monomer-fusion" function.

@diogomart
Copy link
Contributor

Can you elaborate on what you mean by overhead and on the disadvantages of including the covalent ligand in the polymer?

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Dec 23, 2024

Can you elaborate on what you mean by overhead and on the disadvantages of including the covalent ligand in the polymer?

I do like the idea of including the covalent ligand in the Polymer (I already wrote the implementation in the code block, but commented it out because it's not essential for exporting to PDB). I just wish to defer (delay) this process until the post-processing of docking, perhaps after some filtering.

This is because, I wanted to avoid modification of polymer for individual covalent ligand in the docking preparation stage. I feel like Polymer is a very heavy data structure, thus I want to avoid the process of copying (making an editable copy) or modifying it. If we do want to store a polymer-like structure for the target residue and the covalent ligand, I might want to implement an "di-mer" editable copy so we're not readinf/writing the whole Polymer over and over again.

The implementation (in this PR) does not modify the polymer for each covalent ligand. So that the receptor only needs to be processed once in the docking preparation, like in basic docking. The biggest advantage of current is that, the receptor is only prepared once and it's good for any covalent ligands (just like the way before this PR).

The modified polymer can be exported from mk_export.py if we uncomment the commented code block. (but after reviewing the JSON-bound properties, I think it's still missing other properties like bonds and padded_mol to be actually valid polymer)

I will think about this >.< I can draft a version that modifies Polymer in the preparation. Actually I'm not sure how much "overhead" I'm talking about, maybe I'm just afraid of modifying Polymer...

We can talk tomorrow or sometime in the near future about this. I get your points of including the covalent ligand in the Polymer and the important of it. Thanks again for your time and kind response.

@diogomart
Copy link
Contributor

not modify the polymer for each covalent ligand. [...] the receptor is only prepared once and it's good for any covalent ligands

This makes perfect sense! And it's far superior for large screening. Thanks for the explanation. I was too focused on the single ligand case. Let's keep this as you implemented: ligand not inserted in polymer before docking.

One other concern I have is that the is_ignore attribute is True only for atoms that are excluded from energetic interactions, but not as a means to decide if they are part of a flexible sidechain or not. I know that no atoms are lost in this process but that depends on the modified polymer being used with the associated covalent ligands. If someone tries to use the polymer without a covelent ligand it will be "missing" one or more atoms. I'll think about this a little more.

Let's discuss tomorrow and then decide if we want to modify this PR or not.

@rwxayheee
Copy link
Contributor Author

Thanks! Let's meet tomorrow and discuss.

One other concern I have is that the is_ignore attribute is True only for atoms that are excluded from energetic interactions, but not as a means to decide if they are part of a flexible sidechain or not.

Yes, that's a great point I didn't consider. Maybe I'm not doing the right thing.. is_ignore is the hacky way I use to suppress the writing to receptor PDBQT.
Thanks again for the review, your insights and kind support, as always.

@rwxayheee
Copy link
Contributor Author

Commenting for record purposes - recap of today's discussion:

  1. The major drawback with this PR is that no updates are made to rdkit_mol or padded_mol of the target residue, which could potentially impact the parameterization. While this should work with the current docking programs, the receptor may not be fully accurate for further parameterization

  2. We'll put this PR on hold to evaluate whether further progress on create single rdkit mol from entire chorizo #188 can address the challenges related to monomer manipulation in this PR.

  3. We might want to separate covalent docking preparation functionalities from the basic docking preparation command-line scripts. Covalent docking preparation could have its own dedicated script to generate both the receptor (polymer with corrected target residue characteristics) and the ligands.

With that, we will convert it to a draft for now.

@rwxayheee rwxayheee marked this pull request as draft December 24, 2024 01:41
@rwxayheee
Copy link
Contributor Author

I will close this PR as it has become too complex to move forward without significant changes to other parts of meeko.

Smaller but more realistic objects like being able to parse covalently modified polymers, handling of generalized padding, cross-linking / multiple bonds between pairs of residues, template fusion, etc, are required to have the blunt-headed and properly padded target residue for covalent docking. But at the moment, I feel it's very difficult to continue this PR without making these changes to many other parts. It might cause confusion to include all changes in every aspect into this PR.

Even with the drawback we discussed above, I still believe the current state of this PR is an efficient solution with a reasonable procedure for docking preparation. Therefore, I would like to leave this PR as a workaround for those who need to perform covalent docking with custom tether atoms, preferably with the current docking engines like AutoDock GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment