Skip to content

Latest commit

 

History

History
548 lines (386 loc) · 17.8 KB

File metadata and controls

548 lines (386 loc) · 17.8 KB

Troubleshooting

This page collects common failure modes and practical fixes. Search this page for the error message you encounter. If you want a symptom-first entrypoint, start with Common Error Recipes and then return here for details.


Preflight checklist

Before a long run, verify:

  • You can run mlmm -h and see the CLI help.
  • MLIP model weights can be downloaded (for the default UMA backend, a Hugging Face login/token must be available on the machine; other backends may download from different sources).
  • For enzyme workflows: your input PDB(s) contain hydrogens and element symbols.
  • When you provide multiple PDBs: they have the same atoms in the same order (only coordinates differ).
  • AmberTools is correctly installed via conda channels (or built from source) and tleap is available (required for mm-parm).
  • The hessian_ff C++ native extension is correctly built (if automatic build fails, run cd hessian_ff/native && make).

Input / extraction problems

"Element symbols are missing... please run add-elem-info"

Typical message:

Element symbols are missing in '...'.
Please run `mlmm add-elem-info -i...` to populate element columns before running extract.

Fix:

  • Run:

    mlmm add-elem-info -i input.pdb -o input_with_elem.pdb
  • Then re-run extract / all using the updated PDB.

Why it happens:

  • Sometimes, PDBs do not populate the element column consistently. extract requires element symbols for reliable atom typing.

"[multi] Atom count mismatch..." or "[multi] Atom order mismatch..."

Typical messages:

[multi] Atom count mismatch between input #1 and input #2:...
[multi] Atom order mismatch between input #1 and input #2.

Fix:

  • Regenerate all structures with the same preparation workflow (same protonation tool, same settings).
  • If you add hydrogens, do it in a way that produces consistent ordering across all frames.

Tip:

  • For ensembles generated by MD, prefer extracting frames from the same trajectory/topology rather than mixing PDBs produced by different tools.

Alternative:

  • If you cannot prepare matching multi-structure inputs, use the single-structure scan workflow instead: provide one PDB with --scan-lists to generate endpoints via distance scans.

ML region is missing important residues

Symptoms:

  • The extracted pocket is unexpectedly small.
  • Key catalytic residues are missing.

Fixes to try:

  • Increase --radius (e.g., 2.6 -> 3.5 Angstrom).
  • Use --selected-resn to force-include residues (e.g., --selected-resn 'A:123,B:456').
  • Alternatively, you can manually create the ML region PDB using a molecular viewer (e.g., PyMOL) by selecting the active-site atoms and exporting them. Supply this PDB via --model-pdb.

Unreliable energies / barriers

Symptoms:

  • Calculated energies or reaction barriers seem unreasonable.
  • Results change significantly when the model size is increased.

Fix:

  • If the extracted pocket is too small, calculated energies and barriers may be unreliable. Increase the extraction radius (e.g., -r 4.0 or higher) to include more of the protein environment:

    mlmm extract -i complex.pdb -c 'SUB' -o pocket.pdb -r 4.0

Non-standard residues not truncated correctly

If the extracted pocket contains modified amino acid residues (e.g., phosphoserine, methylated lysine, D-amino acids) with non-standard three-letter codes, backbone truncation and link-hydrogen placement will not be applied to them by default. Use --modified-residue to register them:

mlmm extract -i complex.pdb -c PRE --modified-residue "SEP,TPO,MLY" -o pocket.pdb

The same flag is available on the all command and is forwarded to the extraction stage.

If --modified-residue is insufficient (e.g., the residue has an unusual backbone topology), construct the pocket model manually with appropriate link hydrogens, and pass the pocket PDB and parm7 files directly to downstream commands (opt, tsopt, path-opt, etc.) via --parm and --model-pdb.


Charge / spin problems

Charge resolution issues

Calculation subcommands require explicit -q/--charge and -m/--multiplicity. In all, charge is resolved in order: -q/--charge override -> extraction summary -> --ligand-charge fallback when extraction is skipped.

Fix:

  • Provide charge and multiplicity explicitly:

    mlmm path-search -i R.pdb P.pdb --parm real.parm7 --model-pdb model.pdb -q 0 -m 1
  • Or, when using extraction, provide a residue-name mapping and run through all:

    mlmm -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3'

AmberTools / mm-parm problems

tleap not found

Typical message:

FileNotFoundError: tleap not found on PATH

or

mm-parm requires AmberTools (tleap, antechamber, parmchk2).

Fix:

  • Install AmberTools via conda:

    conda install -c conda-forge ambertools -y
  • Or build from source (https://ambermd.org/AmberTools.php), or load the appropriate module on HPC:

    module load ambertools
  • Verify availability:

    which tleap
    which antechamber
    which parmchk2
  • Note: without AmberTools, you can still run opt, tsopt, path-search, etc. if you supply --parm manually.


antechamber fails for a ligand

Symptoms:

  • mm-parm fails during ligand parameterization.
  • Errors about atom type assignment or charge calculation.

Fixes to try:

  • Check that the ligand has correct element symbols and bond connectivity in the PDB.

  • Ensure --ligand-charge is specified correctly: -l 'GPP:-3,SAM:1'.

  • Use --keep-temp to preserve intermediate files and inspect <resname>.antechamber.log:

    mlmm mm-parm -i input.pdb -l 'LIG:-1' --keep-temp
  • Check that hydrogen atoms are correctly added and TER records are appropriate.

  • Ensure --ligand-mult is specified for non-singlet ligands (e.g., --ligand-mult 'HEM:1,NO:2'). The default spin multiplicity is 1 (singlet).

  • Try running antechamber manually on the extracted ligand PDB to diagnose the issue:

    antechamber -i ligand.pdb -fi pdb -o ligand.mol2 -fo mol2 -c bcc -nc -3 -at gaff2
  • For higher-accuracy partial charges, consider computing RESP charges from an HF/6-31G* calculation and providing custom frcmod/lib files instead of relying on AM1-BCC.


parm7/rst7 mismatch errors

Typical messages:

Atom count in parm7 (...) does not match input PDB (...)

or

RuntimeError: parm7 topology does not match the input structure

or

Coordinate shape mismatch for... got (N, 3), expected (M, 3)

Fix:

  • The parm7 file must correspond to exactly the same atoms (in the same order) as the input PDB.
  • Re-run mm-parm to regenerate the parm7 from the current PDB.
  • Do not edit or reorder PDB atoms after running mm-parm.
  • When re-running mm-parm, use the output PDB (<prefix>.pdb) as the input for subsequent calculations, since tleap may add or remove hydrogens.

parm7 element order does not match PDB

Symptoms:

  • oniom-export reports "Element sequence mismatch at atom index..."

Fix:

  • Use --no-element-check to disable the element check (verify results manually).
  • The correct fix is to use the same PDB for -i that was used when generating the parm7.

hessian_ff build problems

Build fails ("make" errors)

Typical symptoms:

  • make in hessian_ff/native/ produces compilation errors.
  • ImportError: cannot import name 'ForceFieldTorch' from 'hessian_ff'.
  • RuntimeError: hessian_ff build attempts failed: ...

Fixes to try:

  • Ensure you have a C++ compiler (g++ >= 9) installed:

    g++ --version
  • Ensure PyTorch headers are available:

    python -c "import torch; print(torch.utils.cmake_prefix_path)"
  • On HPC, load a compiler module:

    module load gcc/11
  • Clean and rebuild:

    conda install -c conda-forge ninja -y
    cd hessian_ff/native && make clean && make

hessian_ff import errors

Typical message:

ImportError: cannot import name 'ForceFieldTorch' from 'hessian_ff'

or:

RuntimeError: hessian_ff build attempts failed: ...
To rebuild hessian_ff native extensions in this environment:
  conda install -c conda-forge ninja -y
  cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make clean && make

Fix:

  • The C++ native extension needs to be built first:

    cd hessian_ff/native && make
  • Ensure the hessian_ff package is in your Python path (it should be if you installed mlmm-toolkit with pip install -e .).


B-factor layer assignment problems

Wrong layer assignments

Symptoms:

  • Atoms are assigned to unexpected layers.
  • ML region is too small or too large.

Fixes to try:

  • B-factor encoding: ML = 0.0, Movable-MM = 10.0, Frozen-MM = 20.0.
  • Inspect the layer-assigned PDB visually (color by B-factor in your molecular viewer).
  • Check that --model-pdb correctly defines the ML region atoms.
  • Adjust the distance cutoffs in define-layer:
  • --radius-freeze (default 8.0 Angstrom): controls Movable-MM/Frozen boundary.
  • If needed, control Hessian-target MM separately in calc options (hess_cutoff, hess_mm_atoms).
  • If using use_bfactor_layers: true in YAML, verify that B-factor values match the expected encoding (0.0, 10.0, 20.0 with tolerance 1.0).

B-factor values are not recognized

Typical symptoms:

  • Calculator treats all atoms as frozen or all as ML.
  • B-factor values are not one of {0.0, 10.0, 20.0}.

Fix:

  • Re-run define-layer to ensure correct B-factor encoding.
  • A tolerance of 1.0 is applied: B-factors near 0/10/20 map to ML/Movable/Frozen.
  • Do not manually edit B-factors to arbitrary values.

--detect-layer does not work as expected

Symptoms:

  • Automatic layer detection from B-factors produces unexpected ML/Movable/Frozen splits.
  • Running with --detect-layer without --model-pdb fails.

Fixes to try:

  • Ensure the input is a PDB (or an XYZ with --ref-pdb).
  • Re-run define-layer to explicitly assign B-factors, then use the generated PDB.
  • For distance-based control, specify hess_cutoff / movable_cutoff and switch to --no-detect-layer if needed.
  • Note that supplying --movable-cutoff disables --detect-layer.

Installation / environment problems

MLIP model download errors

Symptoms:

  • Errors about being unable to download model weights or missing authentication. For the default UMA backend, this typically means a missing Hugging Face login/token.

Fix:

  • Log in once per environment/machine:

    huggingface-cli login
  • On HPC, ensure your home directory (or HF cache directory) is writable from compute nodes.


CUDA / PyTorch mismatch

Symptoms:

  • torch.cuda.is_available() is false even though you have a GPU.
  • CUDA runtime errors at import time.

Fixes:

  • Install a PyTorch build matching your cluster CUDA runtime.

  • Confirm GPU visibility:

    nvidia-smi
    python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())"

DMF mode fails (cyipopt missing)

If you use DMF (--mep-mode dmf) and see errors importing IPOPT/cyipopt:

Fix:

  • Install cyipopt from conda-forge (recommended) before installing mlmm:

    conda install -c conda-forge cyipopt

Plot export fails (Chrome missing)

If figure export fails and you see Plotly/Chrome-related errors:

Fix:

  • Install a headless Chrome once:

    plotly_get_chrome -y

Calculation / convergence problems

CUDA out of memory (VRAM)

Symptoms:

  • torch.cuda.OutOfMemoryError: CUDA out of memory
  • System hangs or crashes during Hessian calculation.

ML/MM systems are typically larger than pure MLIP calculations, so VRAM pressure is higher.

Fixes to try (in order of likelihood):

  • Verify Frozen-MM layer: check that define-layer has correctly assigned Frozen-MM atoms (B=20.0). If the Frozen-MM region is too small, the Movable-MM region (and thus the Hessian) becomes unnecessarily large. Decrease --radius-freeze to expand the Frozen region.
  • Reduce ML region size: use a smaller extraction radius (--radius in extract) or manually define a smaller ML region PDB via --model-pdb.
  • Use FiniteDifference ML Hessian: set --hessian-calc-mode FiniteDifference (uses less VRAM but is slower).
  • Pre-define layers with define-layer and use use_bfactor_layers: true.
  • Use a GPU with more VRAM: 24 GB+ recommended for systems with 500+ ML atoms; 48 GB+ for 1000+ ML atoms.

TS optimization fails to converge

Symptoms:

  • TS optimization runs for many cycles without converging.
  • Multiple imaginary frequencies remain after optimization.

Fixes to try:

  • Switch optimizer modes: --opt-mode grad (Dimer) or --opt-mode hess (RS-I-RFO).
  • Enable flattening of extra imaginary modes: --flatten.
  • Increase max cycles: --max-cycles 20000.
  • Use tighter convergence: --thresh baker or --thresh gau_tight.
  • Adjust hess_cutoff to expand the range of atoms included in the Hessian calculation.

Optimizer stalls but the energy is no longer changing (MLIP force noise floor)

Symptoms:

  • opt/tsopt keeps running but the reported energy has been flat for many cycles (every few dozen steps shows |dE| < 1e-4 au).
  • Max/RMS forces sit just above the gau/baker thresholds and never drop further, even after thousands of cycles.
  • Summary log eventually reports convergence via the energy plateau fallback rather than the gradient preset.

Why it happens:

  • MLIPs have a finite numerical precision (force "noise floor"). For large ML/MM systems, that noise floor can exceed the standard gradient-based convergence thresholds (gau, baker, …), so the forces never drop below the preset even though the geometry is effectively stationary.

What to do:

  • Since v0.2.8, this is handled automatically: the shared opt block enables energy_plateau: true by default. When the energy range over the last 50 steps falls below 1.0e-4 au (~0.06 kcal/mol), the optimizer declares convergence and exits cleanly. No action is needed in the common case.
  • If you see the plateau fallback triggering too early on a system that is still clearly moving, tighten the thresholds in YAML:
    opt:
     energy_plateau_thresh: 1.0e-05  # stricter plateau tolerance (au)
     energy_plateau_window: 100      # require a longer flat stretch
  • To disable the fallback entirely (e.g., for benchmarking convergence behavior), set opt.energy_plateau: false — the optimizer will then rely solely on the thresh preset.
  • The plateau check is automatically skipped for chain-of-states (COS) optimizers (GS/DMF string optimizers), so path-opt / path-search are unaffected.

IRC does not terminate properly

Symptoms:

  • IRC stops before reaching a clear minimum.
  • Energy oscillates or gradient remains high.

Fixes to try:

  • Reduce step size: --step-size 0.05 (default is 0.10).
  • Increase max cycles: --max-cycles 200.
  • Check if the TS candidate has only one imaginary frequency before running IRC.

MEP search (GSM/DMF) fails or gives unexpected results

Symptoms:

  • Path search terminates with no valid MEP.
  • Bond changes are not detected correctly.

Fixes to try:

  • Increase --max-nodes (e.g., 15 or 20) for complex reactions.
  • Enable endpoint pre-optimization: --preopt.
  • Try the alternative MEP method: --mep-mode dmf (if GSM fails) or vice versa.
  • Adjust bond detection parameters in YAML (bond.bond_factor, bond.delta_fraction).

Performance / stability tips

  • Out of memory (VRAM): reduce ML region size, reduce Hessian-target MM region, reduce nodes (--max-nodes), or use lighter optimizer settings (--opt-mode grad).
  • Analytical ML Hessian is slow or OOM: use --hessian-calc-mode FiniteDifference for the ML region. Only use Analytical if you have ample VRAM (24 GB+ recommended for 300+ ML atoms).
  • MM Hessian: mm_fd: true (default) uses finite-difference for MM Hessian. Analytical MM Hessian (mm_fd: false) is faster for small systems but may require more memory.
  • MM Hessian is slow: set hess_cutoff to limit the number of Hessian-target MM atoms.
  • Large systems (2000+ atoms): ensure frozen atoms are properly set (Frozen layer, B=20) to reduce the movable DOF count. Use define-layer with appropriate cutoffs.
  • Multi-GPU: place ML on one GPU (ml_cuda_idx: 0) and MM on another (mm_device: cuda, mm_cuda_idx: 1) if available.
  • ML and MM parallel execution: by default, ML (GPU) and MM (CPU) run in parallel. Tune CPU thread count with mm_threads.

Backend-specific issues

ImportError when using --backend orb/mace/aimnet2

Symptom: ImportError: orb-models is required for the ORB backend

Fix: Install the optional dependency for the chosen backend:

pip install "mlmm-toolkit[orb]"      # ORB backend
pip install "mlmm-toolkit[aimnet]"  # AIMNet2 backend
pip install --no-deps mace-torch      # MACE backend

CUDA out of memory with non-UMA backends

Symptom: RuntimeError: CUDA out of memory when using ORB, MACE, or AIMNet2.

Fix: Non-UMA backends use finite-difference Hessians, which require more VRAM. Options:

  • Use --hessian-calc-mode FiniteDifference explicitly with a smaller hess_cutoff
  • Use ml_device: cpu in YAML (slower but avoids VRAM limits)

xTB not found when using --embedcharge

Symptom: XTBEmbedError: xTB command not found

Fix: Install xTB and ensure it's on $PATH:

conda install -c conda-forge xtb

How to report an issue

When asking for help, include:

  • The exact command line you ran
  • summary.log (or console output)
  • The smallest input files that reproduce the problem (if possible)
  • Your environment: OS, Python, CUDA, PyTorch versions
  • Whether AmberTools and hessian_ff are properly installed