R. irregularis phased assembly

🐍 Rhizophagus irregularis example 🍄

Here we will describe how to obtain a phased assembly of Rhizophagus irregularis with colora.

We are going to use the data from the BioProject PRJNA922099 from the strain G1. Here is the relative paper, and here the SRA accessions.

1. Download the data

To download the data, we will use sra tools that can be installed with conda/mamba.

mkdir colora_paper
cd colora_paper/
mkdir rhizo
cd rhizo

# hifi reads
mkdir hifi
cd hifi
screen -S rhizo
mamba activate sra-tools 
prefetch SRR23080413 --max-size 50GB 
fasterq-dump SRR23080413

# hic reads
cd ..
mkdir hic
cd hic
prefetch SRR23080414 --max-size 50GB
fasterq-dump SRR23080414

2. Clone colora repo

cd ~/colora_paper/rhizo
git clone https://github.com/LiaOb21/colora.git

3. Create the resource directory

cd ~/colora_paper/rhizo/colora
mkdir resources

4. Compress and link the reads to the resources folder

cd ~/colora_paper/rhizo/hifi
gzip SRR23080413.fastq

cd ~/colora_paper/rhizo/hic
gzip SRR23080414_1.fastq
gzip SRR23080414_2.fastq

cd ~/colora_paper/rhizo/colora/resources
mkdir raw_hifi
cd raw_hifi
ln -s ~/colora_paper/rhizo/hifi/SRR23080413.fastq.gz

cd ~/colora_paper/rhizo/colora/resources
mkdir raw_hic
cd raw_hic
ln -s ~/colora_paper/rhizo/hic/SRR23080414_1.fastq.gz
ln -s ~/colora_paper/rhizo/hic/SRR23080414_2.fastq.gz

5. Obtain Oatk database

cd ~/software
git clone https://github.com/c-zhou/OatkDB.git

cd ~/colora_paper/rhizo/colora/resources
mkdir oatkDB
cd oatkDB
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3f
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3i
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3m
ln -s ~/software/OatkDB/v20230921/dikarya_mito.fam.h3p

We will then set in our config.yaml (N.B. according to the data provided in the paper the coverage should be around 161x. We should then set c as equal to ~161x5):

oatk:
  k: 1001 # kmer size [1001]
  c: 805 #  minimum kmer coverage [3]
  m: "resources/oatkDB/dikarya_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
  optional_params: 
    "-p": "" # to use for species that have a plastid db

From the paper:

"PacBio sequencing produced an average of 24,647,320,191 bp of HiFi reads with a read median quality of Q33 and median read length of 13,138."

The coverage is calculated as 24,647,320,191 bp / 153,000,000 (i.e. genome size).

6. Obtain busco database

For R. irregularis the closest busco database is Mucoromycota.

Go to https://busco-data.ezlab.org/v5/data/lineages/ and download the database of interest.

cd ~/colora_paper/rhizo/colora/resources
mkdir busco_db
cd busco_db
wget https://busco-data.ezlab.org/v5/data/lineages/mucoromycota_odb10.2024-01-08.tar.gz
tar -xzf mucoromycota_odb10.2024-01-08.tar.gz

We will then set in our config.yaml:

busco:
  lineage: "resources/busco_db/mucoromycota_odb10" # lineage to be used for busco analysis

7. Obtain ncbi FCS database

This db needs 500GB of memory, so just download it once and symlink it when needed. If you don't have enough memory, you can skip the decontamination step and avoid the download of the database.

Here is how you can easily obtain the database:

mamba create -n ncbi_fcsgx ncbi-fcs-gx
mamba activate ncbi_fcsgx
cd colora/resources
mkdir gxdb
cd gxdb
sync_files.py get --mft https://ftp.ncbi.nlm.nih.gov/genomes/TOOLS/FCS/database/latest/all.manifest --dir ./gxdb

Another thing we need for fcs-gx is the tax id of our species. For R. irregularis the ncbi tax id is 588596.

We will then set in our config.yaml:

fcsgx:
  ncbi_tax_id: 588596
  path_to_gx_db: "resources/gxdb"

8. Download the reference genome

This input is optional. We are not going to use it in this example.

9. Setting up our `config.yaml`

This is how our config.yaml should look like at the end (see colora/config/README.md for further details):

# config.yaml for real data

# Set memory and threads for high demanding rules
high:
  mem_mb: 204800 # memory in MB
  t: 50 # number of threads

# Set memory and threads for medium demanding rules
medium:
  mem_mb: 20480 # memory in MB
  t: 20 # number of threads

# Set memory and threads for low demanding rules
low:
  mem_mb: 10240 # memory in MB
  t: 8 # number of threads

# Path to hifi reads
hifi_path: "resources/raw_hifi/"

# Path to hic reads
hic_path: "resources/raw_hic/"

# Customisable parameters for kmc
kmc:
  k: 19 # kmer size, it will be the same used for genomescope2
  ci: 1 # exclude k-mers occurring less than <value> times (default: 2)
  cs: 100000 #maximal value of a counter (default: 255)

# Customisable parameters for kmc_tools transform
kmc_tools:
  cx: 100000 # exclude k-mers occurring more of than <value> times

# Customisable parameters for genomescope2
genomescope2:
  optional_params:
    "-p": "1"
    "-l": ""

# Customisable parameters for oatk
oatk:
  k: 1001 # kmer size [1001]
  c: 805 #  minimum kmer coverage [3]
  m: "resources/oatkDB/dikarya_mito.fam" # mitochondria gene annotation HMM profile database [NULL]
  optional_params: 
    "-p": "" # to use for species that have a plastid db

# Customisable parameters for fastp
fastp:
  optional_params:
    "--cut_front": False # to use only with Arima Hi-C library prep kit generated data
    "--cut_front_window_size": "" # to use only with Arima Hi-C library prep kit generated data

# Customisable parameters for hifiasm
hifiasm:
  phased_assembly: True # set to true if you want to obtain a phased assembly
  optional_params:
    "--hom-cov": "181"
    "-f": "" # used for small datasets
    "-l": "" # purge level. 0: no purging; 1: light; 2/3: aggressive [0 for trio; 3 for unzip]
    "--ul": "" # use this if you have also ont data yu want to integrate in your assembly

#Set this to False if you want to skip the fcsgx step:
include_fcsgx: True #inlcude this rule only if you have preiously downloaded the database (recommended to run fcsgx only on a HPC. It requires around 500 GB of space on your disk and a large RAM)

# Customisable parameters for fcsgx
fcsgx:
  ncbi_tax_id: 588596
  path_to_gx_db: "resources/gxdb"

# Set this to False if you want to skip purge_dups steps:
include_purge_dups: False

# Customisable parameters for arima mapping pipeline:
arima:
  MAPQ_FILTER: 10

# Customisable parameters for yahs
yahs:
  optional_params: 
    "-e": "" # you can specify the restriction enzyme(s) used by the Hi-C experiment

# Customisable parameters for quast
quast:
  optional_params: 
    "--fragmented": ""
    "--large": ""
#    "-r": "resources/reference_genomes/yeast/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa"
#    "-g": "resources/reference_genomes/yeast/Saccharomyces_cerevisiae.R64-1-1.101.gff3"

# Customisable parameters for busco
busco:
  lineage: "resources/busco_db/mucoromycota_odb10" # lineage to be used for busco analysis

10. Run colora

At this point, we should be able to run the pipeline with a really simple command:

cd ~/colora_paper/rhizo/colora/
screen -r rhizo # if you want to run it in the background
mamba activate snakemake

snakemake --software-deployment-method conda

N.B. if you are using a server where jobs are normally submitted through SLURM or other schedulers, you might consider setting up a snakemake profile in your system to handle job submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R. irregularis phased assembly

🐍 Rhizophagus irregularis example 🍄

1. Download the data

2. Clone colora repo

3. Create the resource directory

4. Compress and link the reads to the resources folder

5. Obtain Oatk database

6. Obtain busco database

7. Obtain ncbi FCS database

8. Download the reference genome

9. Setting up our `config.yaml`

10. Run colora

Contents

Clone this wiki locally

R. irregularis phased assembly

🐍 Rhizophagus irregularis example 🍄

1. Download the data

2. Clone colora repo

3. Create the resource directory

4. Compress and link the reads to the resources folder

5. Obtain Oatk database

6. Obtain busco database

7. Obtain ncbi FCS database

8. Download the reference genome

9. Setting up our config.yaml

10. Run colora

Contents

Clone this wiki locally

9. Setting up our `config.yaml`