RepliDecPlus integrate tools for predict phage replication cycle.
Current support: RepliDec, PhaBOX/phaTYP, BACPHLIP, DeePhage.
RepliDecPlus has 3 steps:
-
Running individual tools.
- RepliDec used for complete genomes and metagenomic assemblies;
- PhaBOX/phaTYP and DeePhage for metagenomic assemblies;
- BACPHLIP for complete genomes;
-
Collect resultes and scores from these tools.
- After running each software, we used a custom script to calculate the replication cycle of each input sequences in the same bin in PhaBOX/phaTYP and DeePhage. Becasue they will treat each sequence as a seperate query, which will cause sequences from same bin have multiple replication cycle.
-
Use the an in-house scoring system to re-calculate the confidence for the final prediction.
- Following the evaluation results, we have formulated a comprehensive scoring system. This system is instrumental in assigning appropriate weights to the confidence levels associated with each result, thereby facilitating the derivation of a refined final prediction.
We prepare the environment use Conda/Docker. You can choose one of them to use it.
## linux
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
Other platform please follow this download url: https://docs.conda.io/projects/miniconda/en/latest/
PS: Because some software can run only on Linux, so we recommand use linux based system.
1.2 Clone RepliDecPlus Git repository and set up a Conda environment and all necssary dependent packages
git clone https://github.com/pengSherryYel/ReplidecPlus.git
cd ReplidecPlus
sh ./prepare_env.sh
prepare_env.sh not only prepare the environment but also install all the related packages.
docker pull pengsherry/replidec_plus
After success prepare the environment and packages. There will be five conda environment genererted. All five enviroment will startswith "RP".
- RP_base: main environment
- RP_bacphlip:environment for BACPHIP
- RP_deephage: environment for DeePhage
- RP_phabox: environment for PhaTYP/PhaBOX
- RP_replidec: environment for RepliDec
To check the installation, you can use following commands to check
## for conda
conda info -e
## for docker
docker run pengsherry/replidec_plus conda info -e
current support: RepliDec, PhaBOX/phaTYP, BACPHLIP, DeePhage.
If you use conda, you can use these commands to run Please check below input section to prepare the input data.
## for Conda
conda activate RP_base
python ReplidecPlus/ReplidecPlus.py -i input.txt -r -p -b -d -t 10
If you set your work directory using (-o), and the output folder is not empty OR you want to rerun one of the software.
You can use the force parameter to force run part of software.
-rf: force rerun replidec; -pf: force rerun phaTYP;
-bf: force rerun bacphlip; -df: force rerun deephage
## for Conda (Folder not empty OR rerun)
conda activate RP_base
python ReplidecPlus/ReplidecPlus.py -i input.txt -r -rf -p -pf -b -bf -d -df -t 10
If you use Docker, you can use the command below.
Please pay attention for "-v `pwd`:/data". This will mount your local folder to folder /data within Docker image. You can
change to any local folder, but please keep /data if you use utility/fasta2list_forDocker.sh, for example: -v your_folder_include_input_sequence:/data. Because
utility/fasta2list_forDocker.sh will help you prepare the input file (TEXT format), and the default path of input sequences are
under /data.
You can also mount everything by yourself, for more detail, please check https://docs.docker.com/engine/storage/volumes/.
And please DO NOT FORGET to put your input in your mounted local folder, so that the software can access them.
## for Docker
docker run -v `pwd`:/data pengsherry/replidec_plus conda run -n RP_base python ReplidecPlus/ReplidecPlus.py -i
/data/input.txt -o /data/ReplidecPlus -r -p -b -d -t 10
-
TEXT
To support the binning results. we use text file as input (
-i). This file is a two columns tab seperated file.- first column: sampleID which will used as identifier in the output file.
- second column: sequence path(Nucleic Acids Sequences).
### NC_001447.1 $path/NC_001447.1.fasta NC_023556.1 $path/NC_023556.1.fasta -
FASTA
RepliDecPlus can not direct use fasta file. We prepare a scirpt to transform fasta file into text format
## for Conda sh ReplidecPlus/utility/fasta2list.sh your_query_seq.fasta sequence.list split_dir ## for Docker (this will add /data prefix for each sequence) docker run -v `pwd`:/data pengsherry/replidec_plus python ReplidecPlus/utility/fasta2list_forDocker.sh /data/your_query_seq.fasta /data/sequence.list /data/split_dir
There will be four folders generate under the path set by -o, default is current workdir. Moreover two important files that combine all predictions from these tools.
FOLDER: store the results from each tools
- bacphlip
- deephage
- phabox
- replidec
File: main outputs
-
ReplidecPlus.summary.detail.txt
Merged results of prediction detail from each tools.
-
ReplidecPlus.summary.final.txt
Final prediction of merged weighted results from each tools.
Replication cycle prediction for phage. current support: replidec, bacphlip,
deephage, phabox.
Usage: python ReplidecPlus.py -i input.txt -r -p -b -d
options:
-h, --help show this help message and exit
--version show program's version number and exit
-i I input file, two cloumn. sample seqence_path. tab
sepearte.
-o O path to deposit output folder and temporary files,
will create if doesn't exist [default= working
directory]
-t T thread number used in each software
-r, --replidec run replidec
-rp REPLIDEC_PARA, --replidec_parameter REPLIDEC_PARA
define replidec parameter
-rf, --replidecF force rerun replidec
-d, --deephage run deephage
-df, --deephageF force rerun deephage
-b, --bacphlip run bacphlip
-bf, --bacphlipF force rerun bacphlip
-p, --phabox run phaTYP from PhaBOX
-pp PHABOX_PARA, --phabox_parameter PHABOX_PARA
define phabox parameter
-pf, --phaboxF force rerun phaTYP
cd example
## for conda
conda activate RP_base
sh ../utility/fasta2list.sh sequences.fasta sequence.list sequence_split
python ../ReplidecPlus.py -i sequence.list -o example_repliplus -t 4 -r -b -p -d
## for docker
sh ../utility/fasta2list_forDocker.sh sequences.fasta sequence.list sequence_split
docker run -v `pwd`:/data pengsherry/replidec_plus conda run -n RP_base python ReplidecPlus/ReplidecPlus.py -i /data/sequence.list -o /data/ReplidecPlus -t 4 -r -b -p -d
- the minimum length of input sequence is 3k bp. If the length is too short, it will significantly infulece the prediction accuracy.
- RepliDec Plus will take long time to predict very large dataset. if possible, you can seperate the input query sequences into small ones. Then run them parallel. This will save a lot of time.
