Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collate fastq file before splitting #331

Merged
merged 2 commits into from
Mar 28, 2025
Merged

Collate fastq file before splitting #331

merged 2 commits into from
Mar 28, 2025

Conversation

hexylena
Copy link
Contributor

@hexylena hexylena commented Mar 25, 2025

It was reported to me that the _R1/_R2 from samtools fastq were not collated properly, that a single read was appearing in two wildly different places in R1/R2 which is completely silly.

I have tried to reproduce this but thus far have been unable to with a small subset

$ samtools view -b FILE.bam chrM > tmp.bam
$ du -h tmp.bam
560K    tmp.bam
$ samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n tmp.bam
[M::bam2fq_mainloop] discarded 480 singletons
[M::bam2fq_mainloop] processed 608 reads
$ diff <(grep ^@D paired1.fq) <(grep ^@D paired2.fq)
$

(Note the complete lack of difference between ordering.)

But if we look at the output of files which have come out of this tool, there are clear differences:

$ zless R1.fastq.gz  | grep '^@' | head -n 3
@D_____________________:1108:3364:16050
@D_____________________:2113:10647:9989
@D_____________________:2208:9374:82968
$ zless R2.fastq.gz  | grep '^@' | head -n 3
@D_____________________:1108:3364:16050
@D_____________________:1214:3361:56060
@D_____________________:1309:8329:98995

these were produced by the command

$ set -e
$ mkdir -p "$(dirname split/R1.fastq.gz)"
$ samtools fastq \
    -1 split/R1.fastq.gz \
    -2 split/R2.fastq.gz \
    -n \
    --threads 1 \
    /mnt/miniwdl/out.bam

This is indeed documented behaviour however:

If the input contains read-pairs which are to be interleaved or
written to separate files in the same order, then the input should be
first collated by name. Use samtools collate or samtools sort -n to
ensure this.

https://www.htslib.org/doc/samtools-fasta.html#DESCRIPTION

So it makes some sense to collate, or at some point ensure that the BAMs are sorted.

I think there is a discussion to be had over whether automatic collation in sensible or a waste of runtime, but on the other hand, this is maybe a small footgun and eliminating it would make sense to reduce the potential failure modes (give our focus on reducing risk and all.)

Checklist

  • Pull request details were added to CHANGELOG.md.
  • Documentation was updated (if required).
  • parameter_meta was added/updated (if required).
  • Submodule branches are on develop or a tagged commit.

@hexylena
Copy link
Contributor Author

(I've left the documentation tasks open until someone has an opinion on the merging of this. If there's a signal that it's desired, I'll write the changelog entry.)

@DavyCats
Copy link
Contributor

I think it would be good to add note in the changelog, as this is essentially a bug fix. Eg. "Fixed an issue where the samtools fastq task could produce desynced R1 and R2 files. It now uses samtools collate to group reads by readname in order to avoid this issue."

@hexylena
Copy link
Contributor Author

Absolutely!

It was reported to me that the _R1/_R2 from `samtools fastq` were not
collated properly, that a single read was appearing in two wildly
different places in R1/R2 which is completely silly.

I have tried to reproduce this but thus far have been unable to:

    $ samtools view -b FILE.bam chrM > tmp.bam
    $ du -h tmp.bam
    560K    tmp.bam
    $ samtools fastq -1 paired1.fq -2 paired2.fq -0 /dev/null -s /dev/null -n tmp.bam
    [M::bam2fq_mainloop] discarded 480 singletons
    [M::bam2fq_mainloop] processed 608 reads
    $ diff <(grep ^@d paired1.fq) <(grep ^@d paired2.fq)
    $

Note the complete lack of difference between ordering.

But if we look at the output of files which have come out of this tool,
there are clear differences:

    $ zless R1.fastq.gz  | grep '^@' | head -n 3
    @D_____________________:1108:3364:16050
    @D_____________________:2113:10647:9989
    @D_____________________:2208:9374:82968
    $ zless R2.fastq.gz  | grep '^@' | head -n 3
    @D_____________________:1108:3364:16050
    @D_____________________:1214:3361:56060
    @D_____________________:1309:8329:98995

these were produced by the command

    $ set -e
    $ mkdir -p "$(dirname split/R1.fastq.gz)"
    $ samtools fastq \
        -1 split/R1.fastq.gz \
        -2 split/R2.fastq.gz \
        -n \
        --threads 1 \
        /mnt/miniwdl/out.bam

This is indeed documented behaviour however:

> If the input contains read-pairs which are to be interleaved or
> written to separate files in the same order, then the input should be
> first collated by name. Use samtools collate or samtools sort -n to
> ensure this.
>
> https://www.htslib.org/doc/samtools-fasta.html#DESCRIPTION

So it makes some sense to collate, or at some point ensure that the BAMs
are sorted.

I think there is a discussion to be had over whether automatic collation
in sensible or a waste of runtime, but on the other hand, this is maybe
a small footgun and eliminating it would make sense to reduce the
potential failure modes (give our focus on reducing risk and all.)
@rhpvorderman rhpvorderman merged commit 8bf0599 into develop Mar 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants