Skip to content

[BUG] Runtime TRITON error with RF3 on Collab T4 GPU #240

@BJHardy

Description

@BJHardy

I am trying to run RF3 on a Google Collab T4 GPU following the example in the IPD Design Pipeline Collab but consistently get this error when I try to run prediction on a batch of sequences:

INFO:rf3.inference_engines.rf3:[rank: 0] Loading checkpoint from /root/.foundry/checkpoints/rf3_foundry_01_24_latest_remapped.ckpt...
WARNING:atomworks.ml:Using element type for atom names of atomized tokens.
INFO: Using bfloat16 Automatic Mixed Precision (AMP)
INFO:lightning.pytorch.utilities.rank_zero:Using bfloat16 Automatic Mixed Precision (AMP)
INFO:rf3.inference_engines.rf3:[rank: 0] Outputs will be written to /content/Predictions.
INFO:rf3.inference_engines.rf3:[rank: 0] Found 2 structures to predict!
INFO:rf3.inference_engines.rf3:[rank: 0] Predicting structure 1/2: seq0
WARNING:atomworks.ml:Cached data not found for ALA at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/A/ALA/ALA.pt
WARNING:atomworks.ml:Cached data not found for ARG at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/A/ARG/ARG.pt
WARNING:atomworks.ml:Cached data not found for ASN at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/A/ASN/ASN.pt
WARNING:atomworks.ml:Cached data not found for ASP at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/A/ASP/ASP.pt
WARNING:atomworks.ml:Cached data not found for ATP at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/A/ATP/ATP.pt
WARNING:atomworks.ml:Cached data not found for GLN at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/G/GLN/GLN.pt
WARNING:atomworks.ml:Cached data not found for GLU at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/G/GLU/GLU.pt
WARNING:atomworks.ml:Cached data not found for GLY at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/G/GLY/GLY.pt
WARNING:atomworks.ml:Cached data not found for HIS at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/H/HIS/HIS.pt
WARNING:atomworks.ml:Cached data not found for ILE at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/I/ILE/ILE.pt
WARNING:atomworks.ml:Cached data not found for LEU at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/L/LEU/LEU.pt
WARNING:atomworks.ml:Cached data not found for LYS at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/L/LYS/LYS.pt
WARNING:atomworks.ml:Cached data not found for MET at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/M/MET/MET.pt
WARNING:atomworks.ml:Cached data not found for PHE at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/P/PHE/PHE.pt
WARNING:atomworks.ml:Cached data not found for PRO at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/P/PRO/PRO.pt
WARNING:atomworks.ml:Cached data not found for SER at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/S/SER/SER.pt
WARNING:atomworks.ml:Cached data not found for THR at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/T/THR/THR.pt
WARNING:atomworks.ml:Cached data not found for TRP at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/T/TRP/TRP.pt
WARNING:atomworks.ml:Cached data not found for TYR at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/T/TYR/TYR.pt
WARNING:atomworks.ml:Cached data not found for VAL at /net/tukwila/lschaaf/datahub/MACE-OMOL-Jul2025/mace_embeddings/V/VAL/VAL.pt
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[/tmp/ipykernel_391/4213100572.py](https://localhost:8080/#) in <cell line: 0>()
      1 # Run RF3 prediction for all designed sequences
      2 # Set an output directory to save predicted strucutres into local file system
----> 3 rf3_all_outputs = inference_engine.run(inputs=rf3_inputs, out_dir="Predictions", annotate_b_factor_with_plddt=True)

50 frames
[/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/compiler.py](https://localhost:8080/#) in make_llir(self, src, metadata, options, capability)
    339         if os.environ.get("TRITON_DISABLE_LINE_INFO", "0") == "0":
    340             passes.llvmir.add_di_scope(pm)
--> 341         pm.run(mod)
    342         # LLVM-IR (MLIR) -> LLVM-IR (LLVM)
    343         llvm.init_targets()

RuntimeError: PassManager::run failed 

After some googling it sounds like this is an internal GPU issue and not code related, but I find it odd that it happens every time I try to run RF3.

Is this a known bug, and do you have any advice for running RF3 on Collab?

Thanks, Ben

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions