Update 2025-1-4: We have updated the pretrained fusion model on Zenodo at https://zenodo.org/records/14599105 to resolve the issue with the incorrect module import path during inference.
Update 2024-10-29: The pretrained models of MMSite are available at https://zenodo.org/records/14004698.
You can manage the environment by Anaconda. We have provided the environment configuration file environment.yml
for reference. You can create the environment by the following command:
conda env create -f environment.yml
You can follow the instructions in dataprocess/README.md
to prepare the data. In this .md
file, we provide the instruction to split the data when the clustering threshold is 10%. You can also change the threshold when you execute the mmseqs2 command.
In our MMSite, we use the pre-trained PLM and BLM models to initialize the features. You can download the pre-trained model from the Higging Face to reproduce the main results in our paper. You can put all the downloaded models in the pretrained_weights
folder.
You can specify the configuration in config.yaml
, including the paths of the pre-trained models and the data, training parameters, etc.
You can train the model by the following command (It takes about 7 hours to finish training on a single NVIDIA GeForce RTX 4090 GPU):
python train.py --config /path/to/config.yaml
Then, you will get best_model_fuse_xxx.pt
model in the runs/timestamp
folder, which is the final model.
You should put your data in the dataset/infer.tsv
with the format like dataset/infer_samples.tsv
. Then, you should specify the path of best_model_fuse_xxx.pt
in inference.py. Additionaly, you need to generate the textual descriptions via Prot2Text, and replace the corresponding configuration in config/config.yaml
with the path of generated generated_desc.json
. Finally, you can run the following command to get the prediction results:
python inference.py