Skip to content

Missing Fastq reads #2385

@SidWeng

Description

@SidWeng

adam-core version: 0.33.0
Spark version: 3.3.0
Scala version: 2.12

I read FASTQ BGZ file with following code :

spark.sparkContext.newAPIHadoopFile(url, classOf[SingleFastqInputFormat], classOf[Void], classOf[Text], conf)

It works fine if the file is about 70 GB.
However when file size is about 170 GB, some reads are missing (the missing reads are well-formed).
And the missing reads can be found if read the file line by line

spark.sparkContext.newAPIHadoopFile(url, classOf[TextInputFormat], classOf[Void], classOf[Text], conf)

Is there any limitation about SingleFastqInputFormat or any advice can help me debug this issue ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions