Skip to content

pengSherryYel/Replidec

Repository files navigation

Replidec: Replication Cycle Decipher for Phages

PyPI Anaconda-Server Badge Anaconda-Server Badge

Aim

Use a Bayesian classifier combined with a homology search to predict a virus replication cycle

Install

Method 1: using Conda (Recommend using bioconda with the latest version)

conda create -n replidec
conda activate replidec
conda install -c conda-forge -c bioconda replidec
or
conda install -c denglab -c conda-forge -c bioconda replidec

Method 2: using Docker

docker pull quay.io/biocontainers/replidec:0.3.5--pyhdfd78af_0
docker run quay.io/biocontainers/replidec:0.3.5--pyhdfd78af_0 Replidec -h
## Example
docker run -v /your/host/data:/data/ quay.io/biocontainers/replidec:0.3.5--pyhdfd78af_0 Replidec -i data/your_inputfile -p
choose_mode_based_on_your_input_type -w data

Method 3: using pip

If you install using pip, please make sure that mmseqs, hmmsearch, and blastp are set to $PATH, these software can be equal to or higher than the version list below

  • MMseqs2 Version: 13.45111

  • HMMER 3.3.2 (Nov 2020)

  • Protein-Protein BLAST 2.5.0+

pip3 install Replidec

Usage: Overview

Replidec, Replication cycle prediction tool for prokaryotic viruses

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -p , --program        { multi_fasta | genome_table | protein_table }
                        
                        multi_fasta mode:
                        input is a fasta file and treat each sequence as one virus
                        
                        genome_table mode:
                        input is a tab separated file with two columns
                        ___1st column: sample name
                        ___2nd column: path to the genome sequence file of the virus
                        
                        protein_table mode:
                        input is a tab separated file with two columns
                        ___1st column: sample name
                        ___2nd column: path to the protein file of the virus
                        
  -i , --input_file     The input file, which can be a sequence file or an index table
  -w , --work_dir       Directory to store intermediate and final results (default = ./Replidec_results)
  -n , --file_name      Name of final summary file (default = prediction_summary.tsv)
  -t , --threads        Number of parallel threads (default = 10)
  -e , --hmmer_Eval     E-value threshold to filter hmmer result (default = 1e-5)
  -E , --hmmer_parameters 
                        Parameters used for hmmer (default = --noali --cpu 3)
  -m , --mmseq_Eval     E-value threshold to filter mmseqs2 result (default = 1e-5)
  -M , --mmseq_parameters 
                        Parameter used for mmseqs
                        (default = -s 7 --max-seqs 1 --alignment-mode 3 --alignment-output-mode 0 --min-aln-len 40 --cov-mode 0 --greedy-best-hits 1 --threads 3)
  -b , --blastp_Eval    E-value threshold to filter blast result (default =1e-5)
  -B , --blastp_parameter 
                        Parameters used for blastp (default = -num_threads 3)
  -d, --db_redownload   Remove and re-download database

Usage: Download database (-d)

The database used in Replidec will be downloaded automatically.

Location: will be downloaded at the location where Replidec is installed

If you want to redownload the database, the -d parameter can be used. The older database will be moved to "discarded_db" in the workdir(-w); This dir can be removed manually by the user.

Usage: Input (-i) and Propgram (-p)

The input file is different based on different programs

Replidec offers 3 different programs:

  1. 'multi_fasta'
  2. 'genome_table'
  3. 'protein_table',

multi_fasta mode:

  • input is a fasta file and treat each sequence as one virus.
    • Example: <your_path>/viral_contigs.fasta

      >contig_1
      TATCGATCGATCGATCGATCGATCGTACGTACGTACGTACG...
      >contig_2
      CATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG...
      ...
      

genome_table mode:

  • input is a tab separated file with two columns.

    • 1st column: sample name
    • 2nd column: path to the genome sequence file of the virus
    • Example: <your_path>/example_genomes.tsv
    contig_1    your/file/path/contig_1.fasta
    contig_2    your/file/path/contig_2.fasta
    contig_3    your/file/path/contig_3.fasta
    ...
    

protein_table mode:

  • input is a tab separated file with two columns

    • 1st column: sample name
    • 2nd column: path to the protein file of the virus
    • Example: <your_path>/example_proteins.tsv
    contig_1_prot	your/file/path/contig_1.fasta
    contig_2_prot	your/file/path/contig_2.fasta
    contig_3_prot   your/file/path/contig_3.fasta
    ...
    

Usage: Output (-w and -n)

The output directory can be assigned with -w , --work_dir , where the intermediate files and the final prediction results will be stored. The name of the final summary file can be assigned with the -n , --file_name argument.

At the end of the analysis, the output directory would contain the following:

  • BC_Inno: This directory contains the result file for dectect Innovirues
  • BC_mmseqs: This directory contains the result file for mapping result to our custom database
  • BC_pfam: This directory contains the result file for dectect the Integrase and Excisionase
  • BC_prodigal: This directory contains the result file for CDS prediction from genome or contig sequence. (If {-p protein_table} is used, this directory will not be created.)
  • prediction_summary.tsv: This file is the summary file of the prediction result. It contains multiple columns.
    • sample_name: identifier. Can be a sequence ID or the first column of the plain text input file.

    • integrase_number: the number of genes mapped to integrase meet the creteria(set by -c).

    • excisionase_number: the number of genes mapped to excisionase meet the creteria(set by -c).

    • pfam_label: if it contains integrase or excisionase, the label will be "Temperate". Otherwise "Virulent".

    • bc_temperate: conditional probability of temperate|genes.

    • bc_virulent: conditional probability of virulent|genes.

    • bc_label: if bc_temperate greater than bc_virulent, label will be "Temperate". Otherwise "Virulent".

    • final_label: if pfam_label and bc_label both is Temperate, then label will be "Temperate"; if an Innovirues marker gene exists, then label will be "Chronic"; otherwise "Virulent".

    • match_gene_number: the number of genes mapped to our custom database.

    • path: path of input faa file

Example (Data in test folder, please navigate to test folder first)

cd test

## Conda
## test passed - genome_table
replidec -p genome_table -i example/genome_test.small.index -w opt_folder_genome_table

## test passed - multi_fasta
replidec -p multi_fasta -i example/test.contig.small.fa -w opt_folder_multi_fasta

## test passed - protein_table
replidec -p protein_table -i example/example.small.list -w opt_folder_protein_table


## Docker
docker run -v /Your_path_clone_replidec/Replidec/test:/data/ quay.io/biocontainers/replidec:0.3.5--pyhdfd78af_0 Replidec -p multi_fasta -i /data/example/test.contig.small.new.fa -w /data/opt_folder_docker_multi_fasta

Issues

Database can not be downloaded automatically

If the dataset cannot be automatically downloaded from Zenodo due to regional access restrictions, you may manually add it instead. The same database has also been uploaded to OSF as an alternative source.

  1. Locate your Replidec installation path
    After installing Replidec via Conda or Docker, locate the installed directory. Typically, it can be found at: your_conda_path/envs/env_name/lib/python*/site-packages/Replidec

  2. Navigate to the Replidec folder
    Use the terminal to move into the directory:
    cd your_conda_path/envs/env_name/lib/python*/site-packages/Replidec

  3. Download the database manually from OSF (Project name: Replidec)
    Access the alternative download link here: 👉 https://osf.io/thpkb/files/osfstorage

  4. Extract the database
    After downloading, extract the contents of the archive into the Replidec directory, and a folder named "db" will be created: tar -zxvf db_v0.3.2.tar.gz
    ✅ Note: Make sure the extracted folder can be found in this path your_conda_path/envs/env_name/lib/python*/site-packages/Replidec/db.

For now, everything is fixed. Enjoy playing with Replidec!

About

Replication Cycle Decipher for Phages

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •