-
Notifications
You must be signed in to change notification settings - Fork 3
R. irregularis phased assembly
Here we will describe how to obtain a phased assembly of Rhizophagus irregularis with colora.
We are going to use the data from the BioProject PRJNA922099 from the strain G1. Here is the relative paper, and here the SRA accessions.
To download the data, we will use sra tools
that can be installed with conda/mamba.
mkdir colora_paper
cd colora_paper/
mkdir rhizo
cd rhizo
# hifi reads
mkdir hifi
cd hifi
screen -S rhizo
mamba activate sra-tools
prefetch SRR23080413 --max-size 50GB
fasterq-dump SRR23080413
# hic reads
cd ..
mkdir hic
cd hic
prefetch SRR23080414 --max-size 50GB
fasterq-dump SRR23080414
cd ~/colora_paper/rhizo
git clone https://github.com/LiaOb21/colora.git
cd ~/colora_paper/rhizo/colora
mkdir resources
cd ~/colora_paper/rhizo/hifi
gzip SRR23080413.fastq
cd ~/colora_paper/rhizo/hic
gzip SRR23080414_1.fastq
gzip SRR23080414_2.fastq
cd ~/colora_paper/rhizo/colora/resources
mkdir raw_hifi
cd raw_hifi
ln -s ~/colora_paper/rhizo/hifi/SRR23080413.fastq.gz
cd ~/colora_paper/rhizo/colora/resources
mkdir raw_hic
cd raw_hic
ln -s ~/colora_paper/rhizo/hic/SRR23080414_1.fastq.gz
ln -s ~/colora_paper/rhizo/hic/SRR23080414_2.fastq.gz
cd ~/software
git clone https://github.com/c-zhou/OatkDB.git
cd ~/colora_paper/rhizo/colora/resources
mkdir oatkDB
cd oatkDB
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3f
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3i
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3m
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3p
We will then set in our config.yaml
(N.B. according to the data provided in the paper the coverage should be around 161x. We should then set c as equal to ~161x5):
oatk:
k: 1001 # kmer size [1001]
c: 805 # minimum kmer coverage [3]
m: "resources/oatkDB/dikarya_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
optional_params:
"-p": "" # to use for species that have a plastid db
From the paper:
"PacBio sequencing produced an average of 24,647,320,191 bp of HiFi reads with a read median quality of Q33 and median read length of 13,138."
The coverage is calculated as 24,647,320,191 bp / 153,000,000 (i.e. genome size).
For R. irregularis the closest busco database is Mucoromycota.
Go to https://busco-data.ezlab.org/v5/data/lineages/ and download the database of interest.
cd ~/colora_paper/rhizo/colora/resources
mkdir busco_db
cd busco_db
wget https://busco-data.ezlab.org/v5/data/lineages/mucoromycota_odb10.2024-01-08.tar.gz
tar -xzf mucoromycota_odb10.2024-01-08.tar.gz
We will then set in our config.yaml
:
busco:
lineage: "resources/busco_db/mucoromycota_odb10" # lineage to be used for busco analysis
This db needs 500GB of memory, so just download it once and symlink it when needed. If you don't have enough memory, you can skip the decontamination step and avoid the download of the database.
Here is how you can easily obtain the database:
mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gxdb
cd gxdb
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest --dir ./gxdb
Another thing we need for fcs-gx is the tax id of our species. For R. irregularis the ncbi tax id is 588596.
We will then set in our config.yaml
:
fcsgx:
ncbi_tax_id: 588596
path_to_gx_db: "resources/gxdb"
This input is optional. We are not going to use it in this example.
This is how our config.yaml
should look like at the end (see colora/config/README.md for further details):
# config.yaml for real data
# Set memory and threads for high demanding rules
high:
mem_mb: 204800 # memory in MB
t: 50 # number of threads
# Set memory and threads for medium demanding rules
medium:
mem_mb: 20480 # memory in MB
t: 20 # number of threads
# Set memory and threads for low demanding rules
low:
mem_mb: 10240 # memory in MB
t: 8 # number of threads
# Path to hifi reads
hifi_path: "resources/raw_hifi/"
# Path to hic reads
hic_path: "resources/raw_hic/"
# Customisable parameters for kmc
kmc:
k: 19 # kmer size, it will be the same used for genomescope2
ci: 1 # exclude k-mers occurring less than <value> times (default: 2)
cs: 100000 #maximal value of a counter (default: 255)
# Customisable parameters for kmc_tools transform
kmc_tools:
cx: 100000 # exclude k-mers occurring more of than <value> times
# Customisable parameters for genomescope2
genomescope2:
optional_params:
"-p": "1"
"-l": ""
# Customisable parameters for oatk
oatk:
k: 1001 # kmer size [1001]
c: 805 # minimum kmer coverage [3]
m: "resources/oatkDB/dikarya_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
optional_params:
"-p": "" # to use for species that have a plastid db
# Customisable parameters for fastp
fastp:
optional_params:
"--cut_front": False # to use only with Arima Hi-C library prep kit generated data
"--cut_front_window_size": "" # to use only with Arima Hi-C library prep kit generated data
# Customisable parameters for hifiasm
hifiasm:
phased_assembly: True # set to true if you want to obtain a phased assembly
optional_params:
"--hom-cov": "181"
"-f": "" # used for small datasets
"-l": "" # purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
"--ul": "" # use this if you have also ont data yu want to integrate in your assembly
#Set this to False if you want to skip the fcsgx step:
include_fcsgx: True #inlcude this rule only if you have preiously downloaded the database (recommended to run fcsgx only on a HPC. It requires around 500 GB of space on your disk and a large RAM)
# Customisable parameters for fcsgx
fcsgx:
ncbi_tax_id: 588596
path_to_gx_db: "resources/gxdb"
# Set this to False if you want to skip purge_dups steps:
include_purge_dups: False
# Customisable parameters for arima mapping pipeline:
arima:
MAPQ_FILTER: 10
# Customisable parameters for yahs
yahs:
optional_params:
"-e": "" # you can specify the restriction enzyme(s) used by the Hi-C experiment
# Customisable parameters for quast
quast:
optional_params:
"--fragmented": ""
"--large": ""
# "-r": "resources/reference_genomes/yeast/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa"
# "-g": "resources/reference_genomes/yeast/Saccharomyces_cerevisiae.R64-1-1.101.gff3"
# Customisable parameters for busco
busco:
lineage: "resources/busco_db/mucoromycota_odb10" # lineage to be used for busco analysis
At this point, we should be able to run the pipeline with a really simple command:
cd ~/colora_paper/rhizo/colora/
screen -r rhizo # if you want to run it in the background
mamba activate snakemake
snakemake --software-deployment-method conda
N.B. if you are using a server where jobs are normally submitted through SLURM or other schedulers, you might consider setting up a snakemake profile in your system to handle job submission.