SLR-superscaffolder

Introduction

This is a scaffold assembler designed for stLFR reads[1]. It uses the co-barcoding information from stLFR reads to assemble contigs to scaffolds.

Here is an illustration of this pipeline:

User's Guide

Installation

How to download the source codes.

Clone all codes in 1 step.

git clone --recursive  https://github.com/BGI-QingDao/SLR-superscaffolder.git YOUR-DOWNLOAD-DIR --depth 10

Clone step by step

git clone https://github.com/BGI-QingDao/SLR-superscaffolder.git YOUR-DOWNLOAD-DIR
cd YOUR-DOWNLOAD-DIR
git submodule init
git submodule update

How to compile source codes and install executable files.

cd YOUR-DOWNLOAD-DIR
./install.sh YOUR-INSTALL-DIR

Notice: the intall.sh will create a new folder named by YOUR-INSTALL-DIR.

Structure of files in YOUR-INSTALL-DIR

├── scaffold                    # The pipeline script folder.
│   ├── __common_function.sh    # The utils script.
│   ├── clean_prepare.m4        # The template clean script.
│   ├── conf.ini                # The default configuration file.
│   ├── prepare.sh              # The preparation script that generates 1-step executable bash script according to template script and configuration file.
│   ├── run.sh                  # The template 1-step executable script.
│   ├── step_1_prepare_info.m4  # The template step 1 script. Run bwa-mem and parse the sam format mapping information into a series of files
│   ├── step_2_order.m4         # The template step 2 script. Determine the order of contigs to generate basic scaffolds.
│   ├── step_3_orientation.m4   # The template step 3 script. Determine the orientation those contigs in above scaffolds.
│   ├── step_4_pe_fill.m4       # The template step 4 script. Fill short contigs into above scaffolds.
│   ├── step_5_gapsize.m4       # The template step 5 script. Estimate gap sizes between those contigs in above scaffolds.
│   └── step_6_gen_seq.m4       # The template step 6 script. Generate the scaffold sequences based on all above information.
└── bin                         # The bin folder. Below utils are all binary executable files called by above scripts.
│   ├── BinCluster
│   ├── ChopBin
│   ├── FakeSOAPContig
│   ├── FillTrunkByPE
│   ├── GapSize
│   ├── MST
│   ├── Orientation
│   ├── PEGraph
│   ├── ParseReadName
│   ├── Sam2ReadOnContig
│   ├── ScaffInfo2Seq
│   ├── SeedCluster
│   ├── SplitInfo
│   ├── StaticsticUnique
│   ├── Trunk2ScaffInfo
│   └── test
└── stLFR_barcode_split         # a backup script for splitting reads and obtain barcodes
    ├── barcode_list.txt        # the barcode whitelist of stLFR reads
    ├── split_barcode.pl        # the splitting main script
    └── split_barcode.sh        # a easy-to-use swap of split_barcode.pl

Preliminary

Two input data are required: the stLFR reads and the contigs.

the stLFR reads are required as 2 files :
- your-prefix.read1.your-suffix
- your-prefix.read2.your-suffix
We assume your stLFR reads and barcode information have already been splitted.

If your data have not been splitted yet, then use the split barcode script below:

# if your raw stLFR reads contain more than 1 lane, you need to cat all lines into a single file first!
YOUR-INSTALL-DIR/stLFR_barcode_split/split_barcode.sh raw_read1.fq.gz raw_read2.fq.gz

Also, you can try "1.fq_BarcodeSplit" step from stLFR_v1(https://github.com/MGI-tech-bioinformatics/stLFR_v1.git)

The 1st line of the resulting fastq file after read splitting should look like:

@CL100050407L1C002R064_8855#514_1207_1392/1     7       1

the "#514_1207_1392" part is the digital representation of original barcode sequence.

the contig must follow the SOAPdenovo contig format, which contains 2 files :
- your-prefix.contig
- your-prefix.ContigIndex

Contigs assembled using various assemblers are acceptable. We recommend MaSuRCA[2].

If your contigs generated by an assembler other than SOAPdenovo series, then you can easily convert the contigs to SOAPdenovo format by running:

YOUR-INSTALL-DIR/bin/FakeSOAPContig < your-contig-sequence-file 1>your-prefix.contig 2>your-prefix.ContigIndex

Notice : a file named "fakesoap.name2index.map.txt" will automatic generate that contains the mapping information about original name to new contig_id. Each re-run of FakeSOAPContig command will overwrite this file!

General usage

1st. Prepare the conf.ini file

cd YOUR-PROJECT-DIR
cp YOUR-INSTALL-DIR/scaffold/conf.ini ./your-conf.ini
vim conf.ini # and configure it!

As the input contig quality greatly influences the scaffolding, we recommend to use a high-quality contig set with larger N50 size (At least >10kb).

The default parameters for the bin size (MST_BIN_SIZE, HT_BIN_SIZE, GAP_BIN_SIZE) were optimized for contig N50 20kb-100kb. If your input contigs are shorter, please reduce the three bin sizes (MST_BIN_SIZE ~ a half of contig N50, HT_BIN_SIZ ~ GAP_BIN_SIZE ~ a half of MST_BIN_SIZE).

Other parameters have been optimized based on the stLFR co-barcoded sequencing platform, and do not need to be modified.

2nd. Generate the pipeline and work folder

YOUR-INSTALL-DIR/scaffold/prepare.sh ./your-conf.ini

This command will create a new work folder named by PROJECT_NAME from ./your-conf.ini, which has all pipeline scripts.

Make sure there is no folder using the same name in current path upon running this command.

3rd. Run the pipeline

cd your-work-folder
./run.sh >log_pipeline 2>&1 &

Notice : If any error happens in the middle and the running exits upon the last step, you can re-run run.sh and it will automatically detect and skip all completed previous steps.

Notice : Independent execution of each step is supported. But to get the correct final output, you must re-run all subsequent steps. That is, re-run step-[n+1 , 6] if you re-run step-n.

4th. Organize the final file structure for cleanning.

Notice : please do this ONLY after a sanity-check of everything and make sure the output is reasonable.

./clean_prepare.sh

This command will store files in three categories : output , logs and tmps .

Your can remove the logs and tmps folders to free disk.

Output Results

The final output contains 2 files :

YOUR-PREFIX.scaff_seqs       # The final scaffold sequence file
YOUR-PREFIX.scaff_agp       # Details about how we assemble INPUT into OUTPUT. in AGP[4] format.

Configuration

###############################################################################
# Project Settings.
#
PROJECT_NAME="work_dir"      #name of work dirctory
THREADS=15                   #num of threads

###############################################################################
# Toos settings 
#
STLFR_ASSEMBLER_DIR=YOUR-INTALL-DIR         # stLFR Scaffold Assembler installation directory 
BWA_DIR=YOUR-BWA-DIR                        # BWA installation directory

###############################################################################
# Input data settings
#   
#   MAKE SURE ALL PATH ARE ABSOLUTELY PATH.
#   DO NOT USE "../"  "~/" "./" !!!
#
## for input stLFR reads
R1="/home/chr19/chr19_reads1.fq.clean.gz"   # the read1 of stLFR reads
R2="/home/chr19/chr19_reads2.fq.clean.gz"   # the read2 of stLFR reads
## for input contig
SOAP_DIR="/home/soap_contig"                # the input contigs directory
SOAP_K=63                                   # the used-k-value of SOAPdenovo. If contigs are not generated using SOAPdenovo, then keep it as 63. 
PREFIX="chr19_soap2"                        # the prefix of you contig/ContigIndex

###############################################################################
# Control parameter settings 
#
SAMPLING_RATIO=1.0                          # random sample barcode to reduce clusting time. valid value : [0.1,1.0]
###################################################
# for step 1 bwa mem
BWA_K=53                                    # the mapping k-value for bwa

###################################################
# for step 2 order
MST_BIN_SIZE=7000                           # the unit bin size for order.
MST_BIN_CLUSTER=0.1                         # the bin cluster threshold for order.
MST_CLUSTER=0.1                             # the order detect threshold.

###################################################
# for step 3 orientation
HT_BIN_SIZE=3500                            # the unit bin size for orientation.
HT_BIN_CLUSTER=0.1                          # the bin cluster threshold for orientation.
RANK=4                                      # the orientaion detect rank.

###################################################
# for step 4 pe_fill
# please make sure that SEED_MIN >= HT_BIN_SIZE * 2 
MAX_INSERT_SIZE=5000                        # the max allowed insert size.
PE_SEED_MIN=1000                            # the min contig size to fill.
PE_SEARCH_MAX=5000                          # the max search length before stop.
PE_MIN_JOINBARCODES=10                      # the min joint barcode count.
PE_MIN_COUNT=3                              # the min joint PE count.
PE_FILL=5                                   # the N size between PE-joint contigs.
###################################################
# for step 5 gap_size
GAP_BIN_SIZE=4000                           # the unit bin size for gap

###################################################
# for step 6 parameters for generate sequence 
#
MIN_SCONTIG=300                             # the min sequence length that allows to write into the final scaffolds.
MIN_N=10                                    # the min N size between 2 gapped contigs.
MIN_C=1                                     # the min N size between 2 overlaped contigs.

MinHash

Calculating Jaccard similarities for N*N bins is compute-intensive for large genome (> 1GB).

To reduce the running time, user can open the minhash strategy( random sample barcodes ) by set SAMPLING_RATIO=n (0.1<n<1.0).

Notice: based on our benchmark, the minhash strategy can dramaticly ( faster than linear ) reduce the running-time for large genome but will sightly reduce the quality of final results.

Miscellaneous

Requirements
- Linux system && Bash
- gcc ( with std11 support )
- make && m4 ( default for almost every linux distribution)
Dependency
- We use bwa[3] to map stLFR reads against contigs.
  - you may replace this to any tools that can generate the correct SAM output.
Limitations
- We do hope "not bad" contigs, whose N50 >= 15K with "not bad" accuracy.
Resources
- The memory peak for human-like-genome-size is 150G.
- Some steps support multi-thread :
  - bwa part .
  - BinCluster part .

Reference

[1] Efficient and unique co-barcoding of second-generation sequencing reads from long DNA molecules enabling cost effective and accurate sequencing, haplotyping, and de novo assembly

[2] Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

[3] Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

[4] AGP Specification

Contact

for algrothim details & discussion
- please contact [email protected] or [email protected]
for code details & bug report
- please contact [email protected]
- or please creat an issue in github.

Name		Name	Last commit message	Last commit date
Latest commit History 778 Commits
example		example
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
image.png		image.png
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLR-superscaffolder

Introduction

Table of Contents

User's Guide

Installation

Preliminary

General usage

Output Results

Configuration

MinHash

Miscellaneous

Reference

Contact

About

Releases

Packages

Contributors 4

Languages

License

BGI-Qingdao/SLR-superscaffolder

Folders and files

Latest commit

History

Repository files navigation

SLR-superscaffolder

About

Resources

License

Stars

Watchers

Forks

Languages