-
Notifications
You must be signed in to change notification settings - Fork 3
A. thaliana primary assembly
Here we will describe how to obtain a primary assembly of Arabidopsis thaliana with colora.
We are going to use the data from the BioProject PRJCA005809. Here is the relative paper, and here the repository of the project from where we can download the raw reads.
cd ~/colora_paper/ara_thal
git clone https://github.com/LiaOb21/colora.git
cd ~/colora_paper/ara_thal/colora
mkdir resources
cd resources
We can download the data of Arabidopsis using the wget
command.
Hi-C data:
mkdir raw_hic
cd raw_hic
wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302669/CRR302669_f1.fastq.gz
wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302669/CRR302669_r2.fastq.gz
N.B.: In this case we have to modify the names of the files to match the requirements of Colora (standard names for Hi-C reads are *_1.fastq.gz
and *_2.fastq.gz
):
mv CRR302669_f1.fastq.gz CRR302669_1.fastq.gz
mv CRR302669_r2.fastq.gz CRR302669_2.fastq.gz
Hifi data:
mkdir raw_hifi
cd raw_hifi
wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302668/CRR302668.fastq.gz
ONT data:
mkdir raw_ont
cd raw_ont
wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302667/CRR302667.fastq.gz
This step is not automatically performed by colora. Therefore, we suggest to quality inspect ONT reads and to trim them if needed before starting the workflow, if you want to use them.
We quality inspected the ONT reads with NanoPlot
using this command:
NanoPlot -t 8 --fastq CRR302667.fastq.gz --loglength -o nanoplot_reports --plots dot
And decided to trim read shorter than 500 bp using NanoFilt:
gunzip -c CRR302667.fastq.gz | NanoFilt -l 500 | gzip > CRR302667_trimmed.fastq.gz
cd ~/software
git clone https://github.com/c-zhou/OatkDB.git
cd ~/colora_paper/rhizo/colora/resources
mkdir oatkDB
cd oatkDB
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3p
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3p
For A. thaliana the closest busco database is Brassicales.
Go to https://busco-data.ezlab.org/v5/data/lineages/ and download the database of interest.
cd ~/colora_paper/ara_thal/colora/resources
mkdir busco_db
cd busco_db
wget https://busco-data.ezlab.org/v5/data/lineages/brassicales_odb10.2024-01-08.tar.gz
tar -xzf brassicales_odb10.2024-01-08.tar.gz
This database needs 500GB of memory, so just download it once and symlink it when needed. If you don't have enough memory, you can skip the decontamination step and avoid the download of the database.
Here is how you can easily obtain the database:
mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gxdb
cd gxdb
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest --dir ./gxdb
Another thing we need for fcs-gx is the tax id of our species. For A. thaliana the ncbi tax id is 3702.
We will then set in our config.yaml
:
fcsgx:
ncbi_tax_id: 3702
path_to_gx_db: "resources/gxdb"
This input is optional, you need this only if you want to compare your assembly with the reference fot your species. Link to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/
cd ~/colora_paper/ara_thal/colora/resources
mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.gff.gz
This is how our config.yaml
should look like at the end (see colora/config/README.md for further details):
# config.yaml for real data
# Set memory and threads for high demanding rules
high:
mem_mb: 153600 # memory in MB
t: 50 # number of threads
# Set memory and threads for medium demanding rules
medium:
mem_mb: 20480 # memory in MB
t: 20 # number of threads
# Set memory and threads for low demanding rules
low:
mem_mb: 10240 # memory in MB
t: 8 # number of threads
# Path to hifi reads
hifi_path: "resources/raw_hifi/"
# Path to hic reads
hic_path: "resources/raw_hic/"
# Customisable parameters for kmc
kmc:
k: 21 # kmer size, it will be the same used for genomescope2
ci: 1 # exclude k-mers occurring less than <value> times (default: 2)
cs: 1000000 #maximal value of a counter (default: 255)
# Customisable parameters for kmc_tools transform
kmc_tools:
cx: 1000000 # exclude k-mers occurring more of than <value> times
# Customisable parameters for genomescope2
genomescope2:
optional_params:
"-p": "1"
"-l": ""
# Customisable parameters for oatk
oatk:
k: 1001 # kmer size [1001]
c: 150 # minimum kmer coverage [3]
m: "resources/oatkDB/embryophyta_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
optional_params:
"-p": "resources/oatkDB/embryophyta_pltd.fam" # to use for species that have a plastid db
# Customisable parameters for fastp
fastp:
optional_params:
"--cut_front": False # to use only with Arima Hi-C library prep kit generated data
"--cut_front_window_size": "" # to use only with Arima Hi-C library prep kit generated data
# Customisable parameters for hifiasm
hifiasm:
phased_assembly: False # set to true if you want to obtain a phased assembly
optional_params:
"-f": "" # used for small datasets
"-l": "" # purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
"--ul": "/home/edg01/edg01/lia/colora_paper/ara_thal/colora/resources/raw_ont/CRR302667_trimmed.fastq.gz" # use this if you have also ont data you want to integrate in your assembly
#Set this to False if you want to skip the fcsgx step:
include_fcsgx: True #inlcude this rule only if you have previously downloaded the database (recommended to run fcsgx only on a HPC. It requires around 500 GB of space on your disk and a large RAM)
# Customisable parameters for fcsgx
fcsgx:
ncbi_tax_id: 3702
path_to_gx_db: "resources/gxdb"
# Set this to False if you want to skip purge_dups steps:
include_purge_dups: True
# Customisable parameters for arima mapping pipeline:
arima:
MAPQ_FILTER: 10
# Customisable parameters for yahs
yahs:
optional_params:
"-e": "" # you can specify the restriction enzyme(s) used by the Hi-C experiment
# Customisable parameters for quast
quast:
optional_params:
"--fragmented": ""
"--large": ""
"-r": "resources/reference/GCA_000001735.2_TAIR10.1_genomic.fna"
"-g": "resources/reference/GCA_000001735.2_TAIR10.1_genomic.gff"
# Customisable parameters for busco
busco:
lineage: "resources/busco_db/brassicales_odb10" # lineage to be used for busco analysis
At this point we should be able to run the pipeline with a really simple command:
cd ~/colora_paper/ara_thal/colora/
screen -r ara_thal # if you want to run it in the background
mamba activate snakemake
snakemake --software-deployment-method conda
N.B. if you are using a server where jobs are normally submitted through SLURM or other schedulers, you might consider setting up a snakemake profile in your system to handle job submission.