Collate fastq file before splitting #331

hexylena · 2025-03-25T13:31:36Z

It was reported to me that the _R1/_R2 from samtools fastq were not collated properly, that a single read was appearing in two wildly different places in R1/R2 which is completely silly.

I have tried to reproduce this but thus far have been unable to with a small subset

$ samtools view -b FILE.bam chrM > tmp.bam
$ du -h tmp.bam
560K    tmp.bam
$ samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n tmp.bam
[M::bam2fq_mainloop] discarded 480 singletons
[M::bam2fq_mainloop] processed 608 reads
$ diff <(grep ^@D paired1.fq) <(grep ^@D paired2.fq)
$

(Note the complete lack of difference between ordering.)

But if we look at the output of files which have come out of this tool, there are clear differences:

$ zless R1.fastq.gz  | grep '^@' | head -n 3
@D_____________________:1108:3364:16050
@D_____________________:2113:10647:9989
@D_____________________:2208:9374:82968
$ zless R2.fastq.gz  | grep '^@' | head -n 3
@D_____________________:1108:3364:16050
@D_____________________:1214:3361:56060
@D_____________________:1309:8329:98995

these were produced by the command

$ set -e
$ mkdir -p "$(dirname split/R1.fastq.gz)"
$ samtools fastq \
    -1 split/R1.fastq.gz \
    -2 split/R2.fastq.gz \
    -n \
    --threads 1 \
    /mnt/miniwdl/out.bam

This is indeed documented behaviour however:

If the input contains read-pairs which are to be interleaved or
written to separate files in the same order, then the input should be
first collated by name. Use samtools collate or samtools sort -n to
ensure this.

https://www.htslib.org/doc/samtools-fasta.html#DESCRIPTION

So it makes some sense to collate, or at some point ensure that the BAMs are sorted.

I think there is a discussion to be had over whether automatic collation in sensible or a waste of runtime, but on the other hand, this is maybe a small footgun and eliminating it would make sense to reduce the potential failure modes (give our focus on reducing risk and all.)

Checklist

Pull request details were added to CHANGELOG.md.
Documentation was updated (if required).
parameter_meta was added/updated (if required).
Submodule branches are on develop or a tagged commit.

hexylena · 2025-03-25T13:33:22Z

(I've left the documentation tasks open until someone has an opinion on the merging of this. If there's a signal that it's desired, I'll write the changelog entry.)

DavyCats · 2025-03-25T13:55:36Z

I think it would be good to add note in the changelog, as this is essentially a bug fix. Eg. "Fixed an issue where the samtools fastq task could produce desynced R1 and R2 files. It now uses samtools collate to group reads by readname in order to avoid this issue."

hexylena · 2025-03-25T13:59:34Z

Absolutely!

@d

It was reported to me that the _R1/_R2 from `samtools fastq` were not collated properly, that a single read was appearing in two wildly different places in R1/R2 which is completely silly. I have tried to reproduce this but thus far have been unable to: $ samtools view -b FILE.bam chrM > tmp.bam $ du -h tmp.bam 560K tmp.bam $ samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n tmp.bam [M::bam2fq_mainloop] discarded 480 singletons [M::bam2fq_mainloop] processed 608 reads $ diff <(grep ^@d paired1.fq) <(grep ^@d paired2.fq) $ Note the complete lack of difference between ordering. But if we look at the output of files which have come out of this tool, there are clear differences: $ zless R1.fastq.gz | grep '^@' | head -n 3 @D_____________________:1108:3364:16050 @D_____________________:2113:10647:9989 @D_____________________:2208:9374:82968 $ zless R2.fastq.gz | grep '^@' | head -n 3 @D_____________________:1108:3364:16050 @D_____________________:1214:3361:56060 @D_____________________:1309:8329:98995 these were produced by the command $ set -e $ mkdir -p "$(dirname split/R1.fastq.gz)" $ samtools fastq \ -1 split/R1.fastq.gz \ -2 split/R2.fastq.gz \ -n \ --threads 1 \ /mnt/miniwdl/out.bam This is indeed documented behaviour however: > If the input contains read-pairs which are to be interleaved or > written to separate files in the same order, then the input should be > first collated by name. Use samtools collate or samtools sort -n to > ensure this. > > https://www.htslib.org/doc/samtools-fasta.html#DESCRIPTION So it makes some sense to collate, or at some point ensure that the BAMs are sorted. I think there is a discussion to be had over whether automatic collation in sensible or a waste of runtime, but on the other hand, this is maybe a small footgun and eliminating it would make sense to reduce the potential failure modes (give our focus on reducing risk and all.)

hexylena force-pushed the fastq-collation branch from 35d7c0f to 38c5c9a Compare March 25, 2025 14:01

Merge branch 'develop' into fastq-collation

c8ecc6c

rhpvorderman approved these changes Mar 28, 2025

View reviewed changes

rhpvorderman merged commit 8bf0599 into develop Mar 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collate fastq file before splitting #331

Collate fastq file before splitting #331

hexylena commented Mar 25, 2025 •

edited

Loading

hexylena commented Mar 25, 2025

DavyCats commented Mar 25, 2025

hexylena commented Mar 25, 2025

Collate fastq file before splitting #331

Collate fastq file before splitting #331

Conversation

hexylena commented Mar 25, 2025 • edited Loading

Checklist

hexylena commented Mar 25, 2025

DavyCats commented Mar 25, 2025

hexylena commented Mar 25, 2025

hexylena commented Mar 25, 2025 •

edited

Loading