Skip to content

A. thaliana primary assembly

Lia Obinu edited this page Sep 10, 2024 · 1 revision

🐍 Arabidopsis thaliana example 🌱

Here we will describe how to obtain a primary assembly of Arabidopsis thaliana with colora.

We are going to use the data from the BioProject PRJCA005809. Here is the relative paper, and here the repository of the project from where we can download the raw reads.

1. Clone colora repo

cd ~/colora_paper/ara_thal
git clone https://github.com/LiaOb21/colora.git

2. Create the resource directory

cd ~/colora_paper/ara_thal/colora
mkdir resources 
cd resources

3. Download the data

We can download the data of Arabidopsis using the wget command.

Hi-C data:

mkdir raw_hic
cd raw_hic

wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302669/CRR302669_f1.fastq.gz
wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302669/CRR302669_r2.fastq.gz

N.B.: In this case we have to modify the names of the files to match the requirements of Colora (standard names for Hi-C reads are *_1.fastq.gz and *_2.fastq.gz):

mv CRR302669_f1.fastq.gz CRR302669_1.fastq.gz
mv CRR302669_r2.fastq.gz CRR302669_2.fastq.gz

Hifi data:

mkdir raw_hifi
cd raw_hifi

wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302668/CRR302668.fastq.gz

ONT data:

mkdir raw_ont
cd raw_ont

wget https://download.cncb.ac.cn/gsa/CRA004538/CRR302667/CRR302667.fastq.gz

4. Quality inspection of ONT data

This step is not automatically performed by colora. Therefore, we suggest to quality inspect ONT reads and to trim them if needed before starting the workflow, if you want to use them.

We quality inspected the ONT reads with NanoPlot using this command:

NanoPlot -t 8 --fastq CRR302667.fastq.gz --loglength -o nanoplot_reports --plots dot  

And decided to trim read shorter than 500 bp using NanoFilt:

gunzip -c CRR302667.fastq.gz | NanoFilt -l 500 | gzip > CRR302667_trimmed.fastq.gz 

5. Obtain Oatk database

cd ~/software
git clone https://github.com/c-zhou/OatkDB.git

cd ~/colora_paper/rhizo/colora/resources
mkdir oatkDB
cd oatkDB
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_mito.fam.h3p
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3f
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3i
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3m
ln -s ~/software/OatkDB/v20230921/embryophyta_pltd.fam.h3p

6. Obtain busco database

For A. thaliana the closest busco database is Brassicales.

Go to https://busco-data.ezlab.org/v5/data/lineages/ and download the database of interest.

cd ~/colora_paper/ara_thal/colora/resources
mkdir busco_db
cd busco_db
wget https://busco-data.ezlab.org/v5/data/lineages/brassicales_odb10.2024-01-08.tar.gz
tar -xzf brassicales_odb10.2024-01-08.tar.gz

7. Obtain ncbi FCS database

This database needs 500GB of memory, so just download it once and symlink it when needed. If you don't have enough memory, you can skip the decontamination step and avoid the download of the database.

Here is how you can easily obtain the database:

mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gxdb
cd gxdb
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest --dir ./gxdb

Another thing we need for fcs-gx is the tax id of our species. For A. thaliana the ncbi tax id is 3702.

We will then set in our config.yaml:

fcsgx:
  ncbi_tax_id: 3702
  path_to_gx_db: "resources/gxdb"

8. Download the reference genome

This input is optional, you need this only if you want to compare your assembly with the reference fot your species. Link to the reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/

cd ~/colora_paper/ara_thal/colora/resources
mkdir reference
cd reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/735/GCA_000001735.2_TAIR10.1/GCA_000001735.2_TAIR10.1_genomic.gff.gz

9. Setting up our config.yaml

This is how our config.yaml should look like at the end (see colora/config/README.md for further details):

# config.yaml for real data

# Set memory and threads for high demanding rules
high:
  mem_mb: 153600 # memory in MB
  t: 50 # number of threads

# Set memory and threads for medium demanding rules
medium:
  mem_mb: 20480 # memory in MB
  t: 20 # number of threads

# Set memory and threads for low demanding rules
low:
  mem_mb: 10240 # memory in MB
  t: 8 # number of threads

# Path to hifi reads
hifi_path: "resources/raw_hifi/"

# Path to hic reads
hic_path: "resources/raw_hic/"

# Customisable parameters for kmc
kmc:
  k: 21 # kmer size, it will be the same used for genomescope2
  ci: 1 # exclude k-mers occurring less than <value> times (default: 2)
  cs: 1000000 #maximal value of a counter (default: 255)

# Customisable parameters for kmc_tools transform
kmc_tools:
  cx: 1000000 # exclude k-mers occurring more of than <value> times

# Customisable parameters for genomescope2
genomescope2:
  optional_params:
    "-p": "1"
    "-l": ""

# Customisable parameters for oatk
oatk:
  k: 1001 # kmer size [1001]
  c: 150 #  minimum kmer coverage [3]
  m: "resources/oatkDB/embryophyta_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
  optional_params: 
    "-p": "resources/oatkDB/embryophyta_pltd.fam" # to use for species that have a plastid db

# Customisable parameters for fastp
fastp:
  optional_params:
    "--cut_front": False # to use only with Arima Hi-C library prep kit generated data
    "--cut_front_window_size": "" # to use only with Arima Hi-C library prep kit generated data

# Customisable parameters for hifiasm
hifiasm:
  phased_assembly: False # set to true if you want to obtain a phased assembly
  optional_params: 
    "-f": "" # used for small datasets
    "-l": "" # purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
    "--ul": "/home/edg01/edg01/lia/colora_paper/ara_thal/colora/resources/raw_ont/CRR302667_trimmed.fastq.gz" # use this if you have also ont data you want to integrate in your assembly

#Set this to False if you want to skip the fcsgx step:
include_fcsgx: True #inlcude this rule only if you have previously downloaded the database (recommended to run fcsgx only on a HPC. It requires around 500 GB of space on your disk and a large RAM)

# Customisable parameters for fcsgx
fcsgx:
  ncbi_tax_id: 3702
  path_to_gx_db: "resources/gxdb"

# Set this to False if you want to skip purge_dups steps:
include_purge_dups: True

# Customisable parameters for arima mapping pipeline:
arima:
  MAPQ_FILTER: 10

# Customisable parameters for yahs
yahs:
  optional_params: 
    "-e": "" # you can specify the restriction enzyme(s) used by the Hi-C experiment

# Customisable parameters for quast
quast:
  optional_params: 
    "--fragmented": ""
    "--large": ""
    "-r": "resources/reference/GCA_000001735.2_TAIR10.1_genomic.fna"
    "-g": "resources/reference/GCA_000001735.2_TAIR10.1_genomic.gff"

# Customisable parameters for busco
busco:
  lineage: "resources/busco_db/brassicales_odb10" # lineage to be used for busco analysis

10. Run colora

At this point we should be able to run the pipeline with a really simple command:

cd ~/colora_paper/ara_thal/colora/
screen -r ara_thal # if you want to run it in the background
mamba activate snakemake

snakemake --software-deployment-method conda

N.B. if you are using a server where jobs are normally submitted through SLURM or other schedulers, you might consider setting up a snakemake profile in your system to handle job submission.