- 2025-07-23: HelixFold3.2 brings significant advancements in protein-related tasks on FoldBench, along with a marked reduction in atomic clashes.
![]() |
![]() |
The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible, restricting further development.
The PaddleHelix team is working on HelixFold3 to replicate the advanced capabilities of AlphaFold3. Insights from the AlphaFold3 paper inform our approach and build on our prior work with HelixFold, HelixFold-Single, HelixFold-Multimer, and HelixDock. Currently, HelixFold3's accuracy in predicting the structures of small molecule ligands, nucleic acids (including DNA and RNA), and proteins is comparable to that of AlphaFold3. We are committed to continuously enhancing the model's performance and rigorously evaluating it across a broader range of biological molecules. Please refer to our HelixFold3 technical report for more details.
The HelixFold3 server is available on the PaddleHelix website and supports two interaction modes:
1) Visualized interactive interface: Designed for user-friendly operations, allowing researchers to explore structural predictions intuitively.
2) API-based access: Facilitates high-throughput predictions, suitable for large-scale screening and design workflows.
The free version of the HelixFold3 server is restricted to non-commercial use, while the paid version offers unrestricted usage, enabling commercial applications. This flexibility ensures accessibility for academic research while supporting industrial needs with commercial-grade output options.
Specific environment settings are required to reproduce the results reported in this repo,
- Python: 3.10
- CUDA: 12.0
- CuDNN: 8.4.0
- NCCL: 2.14.3
- Paddle: 3.1.0
Those settings are recommended as they are the same as we used in our A100 machines for all inference experiments.
HelixFold3 depends on PaddlePaddle. Python dependencies available through pip
is provided in requirements.txt. kalign, the HH-suite and jackhmmer are
also needed to produce multiple sequence alignments. The download scripts require aria2c.
Locate to the directory of helixfold3 then run:
# install msa env
conda create -n msa_env -c conda-forge python=3.9
conda install -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0 -n msa_env -y
# install paddlepaddle and other requirements
conda create -n helixfold -c conda-forge python=3.10
conda activate helixfold
python3 -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python3 -m pip install -r requirements.txtNote: If you have a different version of python3 and cuda, please refer to here for the compatible PaddlePaddle dev package.
In order to run HelixFold3, the genetic databases and model parameters are required.
The parameters of HelixFold3 can be downloaded here,
please place the downloaded checkpoint in ./init_models/ directory.
The script scripts/download_all_data.sh can be used to download and set up all genetic databases with the following configs:
-
With
reduced_dbs:scripts/download_all_data.sh ./data reduced_dbs
will download a reduced version of the databases to be used with the
reduced_dbspreset. The total download size for the reduced databases is around 190 GB, and the total unzipped size is around 530 GB. -
With
full_dbs:NOTE: Support for full_dbs is not available yet and will be introduced in a future update.
There are some demo input under ./data/ for your test and reference. Data input is in the form of JSON containing several entities such as protein, ligand, dna, rna and ion. Proteins and nucleic acids inputs are their sequence.
HelixFold3 supports input ligand as SMILES or CCD id, please refer to /data/demo_6zcy_smiles.json and demo_output/demo_6zcy_smiles/
for more details about SMILES input. More flexible input will come in soon.
An example of input data is as follows:
{
"entities": [
{
"type": "protein",
"sequence": "MDTEVYESPYADPEEIRPKEVYLDRKLLTLEDKELGSGNFGTVKKGYYQMKKVVKTVAVKILKNEANDPALKDELLAEANVMQQLDNPYIVRMIGICEAESWMLVMEMAELGPLNKYLQQNRHVKDKNIIELVHQVSMGMKYLEESNFVHRDLAARNVLLVTQHYAKISDFGLSKALRADENYYKAQTHGKWPVKWYAPECINYYKFSSKSDVWSFGVLMWEAFSYGQKPYRGMKGSEVTAMLEKGERMGCPAGCPREMYDLMNLCWTYDVENRPGFAAVELRLRNYYYDVVNHHHHHH",
"count": 1
},
{
"type": "ligand",
"ccd": "QF8",
"count": 1
}
]
}The modification field is an optional parameter that specifies modified residues in a polymer sequence (protein, DNA, or RNA). It includes the following attributes:
index– The 1-based position of the residue to be modified.ccd– The Chemical Component Dictionary (CCD) code of the modified residue. (Currently, only modifications defined in the CCD database are supported.)type– The modification type. At present, only"residue_replace"is supported, but additional types will be introduced in future updates.
Here is an example modification input:
{
"entities": [
{
"type": "dna",
"sequence": "CCATTATAGC",
"count": 1,
"modification": [
{"type": "residue_replace", "ccd": "5CM", "index": 2},
{"type": "residue_replace", "ccd": "5CM", "index": 5}
]
},
{
"type": "dna",
"sequence": "GCTATAATGG",
"count": 1
}
]
}To run inference on a sequence or multiple sequences using HelixFold3's pretrained parameters, run e.g.:
- Inference on single GPU (change the settings in script BEFORE you run it)
sh run_infer.sh
The script is as follows,
#!/bin/bash
PYTHON_BIN="PATH/TO/YOUR/PYTHON"
ENV_BIN="PATH/TO/YOUR/ENV"
DATA_DIR="PATH/TO/DATA"
CUDA_VISIBLE_DEVICES=0 "$PYTHON_BIN" inference.py \
--jackhmmer_binary_path "$ENV_BIN/jackhmmer" \
--hhblits_binary_path "$ENV_BIN/hhblits" \
--hhsearch_binary_path "$ENV_BIN/hhsearch" \
--kalign_binary_path "$ENV_BIN/kalign" \
--hmmsearch_binary_path "$ENV_BIN/hmmsearch" \
--hmmbuild_binary_path "$ENV_BIN/hmmbuild" \
--nhmmer_binary_path "$ENV_BIN/nhmmer" \
--preset='reduced_dbs' \
--reduced_bfd_database_path "$DATA_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta" \
--uniprot_database_path "$DATA_DIR/uniprot/uniprot.fasta" \
--pdb_seqres_database_path "$DATA_DIR/pdb_seqres/pdb_seqres.txt" \
--uniref90_database_path "$DATA_DIR/uniref90/uniref90.fasta" \
--mgnify_database_path "$DATA_DIR/mgnify/mgy_clusters_2018_12.fa" \
--template_mmcif_dir "$DATA_DIR/pdb_mmcif/mmcif_files" \
--obsolete_pdbs_path "$DATA_DIR/pdb_mmcif/obsolete.dat" \
--ccd_preprocessed_path "$DATA_DIR/ccd_preprocessed_etkdg.pkl.gz" \
--rfam_database_path "$DATA_DIR/Rfam-14.9_rep_seq.fasta" \
--max_template_date=2021-09-30 \
--input_json data/demo_6zcy.json \
--output_dir ./output \
--model_name allatom_demo \
--init_model <PATH_TO_CHECKPOINTS_PDPARAMS> \
--infer_times 3 \
--precision "fp32"The descriptions of the above script are as follows:
- Replace
DATA_DIRwith your downloaded data path. - Replace
ENV_BINwith your conda virtual environment or any environment wherehhblits,hmmsearchand other dependencies have been installed. - Replace
PYTHON_BINwith your python binary wherepaddlepaddle-gpuhave been installed. --preset- Set'reduced_dbs'to use small bfd or'full_dbs'to use full bfd.--*_database_path- Path to datasets you have downloaded.--input_json- Input data in the form of JSON. Input pattern in./data/demo_*.jsonfor your reference.--output_dir- Model output path. The output will be in a folder named the same as your--input_jsonunder this path.--model_name- Model name in./helixfold/model/config.py. Different model names specify different configurations. Mirro modification to configuration can be specified inCONFIG_DIFFSin theconfig.pywithout change to the full configuration inCONFIG_ALLATOM.--infer_time- The number of inferences executed by model for single input. In each inference, the model will infer5times (diff_batch_size) for the same input by default. This hyperparameter can be changed bymodel.head.diffusion_module.test_diff_batch_sizewithin./helixfold/model/config.py--precision- Eitherbf16orfp32. Please check if your machine can supportbf16or not beforing changing it. For example,bf16is supported by A100 and H100 or higher version while V100 only supportsfp32.
The outputs will be in a subfolder of output_dir, including the computed MSAs, predicted structures,
ranked structures, and evaluation metrics. For a task of inferring twice with diffusion batch size 3,
assume your input JSON is named demo_data.json, the output_dir directory will have the following structure:
<output_dir>/
└── demo_data/
├── demo_data-pred-1-1/
│ ├── all_results.json
│ └── predicted_structure.cif
├── demo_data-pred-1-2/
├── demo_data-pred-1-3/
├── demo_data-pred-2-1/
├── demo_data-pred-2-2/
├── demo_data-pred-2-3/
|
├── demo_data-rank[1-6]/
│ ├── all_results.json
│ └── predicted_structure.cif
|
└── msas/
├── ...
└── ...
The contents of each output file are as follows:
msas/- A directory containing the files describing the various genetic tool hits that were used to construct the input MSA.demo_data-pred-X-Y- Prediction results ofdemo_data.jsonin X-th inference and Y-th diffusion batch, including predicted structures incifand a JSON file containing all metrics' results.demo_data-rank*- Ranked results of a series of predictions according to metrics.
We suggest a single GPU for inference has at least 32G available memory. The maximum number of tokens is around
1200 for inference on a single A100-40G GPU with precision bf16. The length of inference input tokens on a
single V100-32G with precision fp32 is up to 1000. Inferring longer tokens or entities with larger atom numbers
per token than normal protein residues like nucleic acids may cost more GPU memory.
For samples with larger tokens, you can reduce model.global_config.subbatch_size in CONFIG_DIFFS in helixfold/model/config.py to save more GPU memory but suffer from slower inference. model.global_config.subbatch_size is set as 96 by default. You can also
reduce the number of additional recycles by changing model.num_recycle in the same place.
We are keen on support longer token inference, it will come in soon.
HelixFold3's code and model parameters are available under the LICENSE for non-commercial use by individuals or non-commercial organizations only. Please check the usage restrictions before using HelixFold3.
[1] Abramson, J et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500. 10.1038/s41586-024-07487-w
[2] Jumper J, Evans R, Pritzel A, et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 577 (7792), 583–589. 10.1038/s41586-021-03819-2.
[3] Evans, R. et al. (2022). Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034
[4] Guoxia Wang, Xiaomin Fang, Zhihua Wu, Yiqun Liu, Yang Xue, Yingfei Xiang, Dianhai Yu, Fan Wang, and Yanjun Ma. Helixfold: An efficient implementation of alphafold2 using paddlepaddle. arXiv preprint arXiv:2207.05477, 2022
[5] Xiaomin Fang, Fan Wang, Lihang Liu, Jingzhou He, Dayong Lin, Yingfei Xiang, Kunrui Zhu, Xiaonan Zhang, Hua Wu, Hui Li, et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5(10):1087–1096, 2023
[6] Xiaomin Fang, Jie Gao, Jing Hu, Lihang Liu, Yang Xue, Xiaonan Zhang, and Kunrui Zhu. Helixfold-multimer: Elevating protein complex structure prediction to new heights. arXiv preprint arXiv:2404.10260, 2024.
[7] Lihang Liu, Donglong He, Xianbin Ye, Shanzhuo Zhang, Xiaonan Zhang, Jingbo Zhou, Jun Li, Hua Chai, Fan Wang, Jingzhou He, et al. Pre-training on large-scale generated docking conformations with helixdock to unlock the potential of protein-ligand structure prediction models. arXiv preprint arXiv:2310.13913, 2023.
If you use the code, data, or checkpoints in this repo, please cite the following:
@article{helixfold3,
title={Technical Report of HelixFold3 for Biomolecular Structure Prediction},
author={PaddleHelix Team},
journal = {arXiv},
doi = {https://doi.org/10.48550/arXiv.2408.16975},
year={2024}
}



