This script calculates the quality value (QV) of genome assemblies based on strobemers for error-tolerant searching and matching. The use of strobemers in text similarity searches allows for more flexibility in matching, even with mutations or small shifts in the text. This pipeline is designed to use the strobemers to increase the matching possibilities of error-prone third-generation long-read genome assembly sequences to the accurate NGS read sets, thus tolerating the sequencing errors of long reads and evaluating the true assembly errors.
- gcc/g++ (version>4.8.5 to support c++11)
- make
- Meryl v1.4.1
- Strobemers
git clone https://github.com/BGI-Qingdao/strobeQV.git ./strobeQV
cd ./strobeQV/sources
make
export PATH=$PWD:$PATH
-
!! strobeQV requires accurate NGS reads in a single file, one haplotype-collapsed or two haplotype-resolved genome assemblies for the same sample !!
-
!! If you would change the settings of Strobemer, then please input all 4 Strobemer Options at the same time !!
sh strobeQV.sh <ngs.read.fq.gz> <asm1.fasta> [asm2.fasta] <out> [strobe_n] [strobe_k] [strobe_w] [strobe_m]
Usage: sh strobeQV.sh <ngs.read.fq.gz> <asm1.fasta> [asm2.fasta] <out> [strobe_n] [strobe_k] [strobe_w] [strobe_m]
<read.fq.gz>: accurate NGS read file of this sample
<asm1.fasta>: assembly 1 of the same sample
[asm2.fasta]: assembly 2, optional
<out>.qv: QV of asm1, asm2 and both (asm1+asm2)
`Strobemer Options`
[strobe_n]: number of strobes [2], optional
[strobe_k]: strobe length, limited to 32 [7], optional
[strobe_w]: window size [14], optional
[strobe_m]: minstrobe mode [randstrobe], optional
< >
: required
[ ]
: optional
# Run strobeQV
sh strobeQV.sh ngs.reads.fq.gz mat_asm.fasta pat_asm.fasta test
# I have one single NGS read file
mkdir strobeQV_eva
cd strobeQV_eva
ln -s DATA_PATH/YOUR_NGS_FILE ./ngs.reads.fq.gz
ln -s DATA_PATH/YOUR_ASM_FILE ./asm.fasta
sh INSTALL_PATH/strobeQV/strobeQV.sh ngs.reads.fq.gz asm.fasta mixed
# I have multiple NGS read files
cat DATA_PATH/YOUR_NGS_FILE1 DATA_PATH/YOUR_NGS_FILE2 > DATA_PATH/YOUR_NGS_FILE
mkdir strobeQV_eva
cd strobeQV_eva
ln -s DATA_PATH/YOUR_NGS_FILE ./ngs.reads.fq.gz
ln -s DATA_PATH/YOUR_ASM_FILE ./asm.fasta
sh INSTALL_PATH/strobeQV/strobeQV.sh ngs.reads.fq.gz asm.fasta mixed
# I have one single NGS read file
mkdir strobeQV_eva
cd strobeQV_eva
ln -s DATA_PATH/YOUR_NGS_FILE ./ngs.reads.fq.gz
ln -s DATA_PATH/YOUR_ASM_FILE1 ./mat_asm.fasta
ln -s DATA_PATH/YOUR_ASM_FILE2 ./pat_asm.fasta
sh INSTALL_PATH/strobeQV/strobeQV.sh ngs.reads.fq.gz mat_asm.fasta pat_asm.fasta diploid
# I have multiple NGS read files
cat DATA_PATH/YOUR_NGS_FILE1 DATA_PATH/YOUR_NGS_FILE2 > DATA_PATH/YOUR_NGS_FILE
mkdir strobeQV_eva
cd strobeQV_eva
ln -s DATA_PATH/YOUR_NGS_FILE ./ngs.reads.fq.gz
ln -s DATA_PATH/YOUR_ASM_FILE1 ./mat_asm.fasta
ln -s DATA_PATH/YOUR_ASM_FILE2 ./pat_asm.fasta
sh INSTALL_PATH/strobeQV/strobeQV.sh ngs.reads.fq.gz mat_asm.fasta pat_asm.fasta diploid
- !! The QV evaluation results are stored in the "< out >.qv" file !!
ASM_NAME # STROBEMERS ONLY IN ASM # TOTAL STROBEMERS IN ASM # QV # ERROR_RATE
For example:
mat_asm 18252 82877672 48.0321 1.57322e-05
pat_asm 14834 82985222 48.9383 1.27693e-05
both 33086 165862894 48.4619 1.42497e-05