Alignment of FASTQ files for HG002 #34

Osmluke · 2024-12-04T12:32:03Z

I apologise in advance if this is a redundant or foolish question, I am still relatively new to fastq to bam alignment but would seriously appreciate any guidance on the following questions for SV caller benchmarking using HG002 v0.6 as my truth set. For evaluating the callers I was originally going to use

Under: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/

I see that there are multiple Samples: 2A1, 2A2, 2F1, 2F2, etc., is each individual fastq within these sub-directories 30x coverage? Or do they add up to 30x per sub-directory?

I'm looking to create a bam file of about 30x coverage that's not biased by read group or library, how would you recommend going about this? Would it be better to merge all R1 files together and R2 files together and then down-sample each one to 30x post-merging? Or is there a better approach?

In addition to this, how would one carry out the down-sampling?

Once again, I apologise if these questions are redundant but any help would be greatly appreciated.

Thank you for your time and patience.

nate-d-olson · 2024-12-04T20:12:17Z

Hi, No need to apologize. We have a new draft SV benchmark that is more accurate, https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.019-20241113/, that you may want to use instead of the v0.6 benchmark. Check out the README for a description of the benchmark sets and recommendations for benchmarking SVs with Truvari. This is a draft benchmark; all our analysis so far indicates it is very accurate, but it has not gone through our formal evaluation process. Please let us know if you find any errors in the benchmark set.

The HiSeq dataset is a bit older. I recommend usingsets from Google Health (links below) one of the data. Let me know if you have any questions or run into issues accessing the data or benchmarking SVs.

Novaseq
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R1.fastq.gz
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R2.fastq.gz

HiSeqX (4000)
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz

Osmluke · 2024-12-05T06:48:10Z

Thank you very much for the information, I'll try out the newer draft for the time being, and I'll experiment with aligning the Novaseq fastq files for now. All of this is very greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment of FASTQ files for HG002 #34

Alignment of FASTQ files for HG002 #34

Osmluke commented Dec 4, 2024

nate-d-olson commented Dec 4, 2024

Osmluke commented Dec 5, 2024 •

edited

Loading

Alignment of FASTQ files for HG002 #34

Alignment of FASTQ files for HG002 #34

Comments

Osmluke commented Dec 4, 2024

nate-d-olson commented Dec 4, 2024

Osmluke commented Dec 5, 2024 • edited Loading

Osmluke commented Dec 5, 2024 •

edited

Loading