Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment of FASTQ files for HG002 #34

Open
Osmluke opened this issue Dec 4, 2024 · 2 comments
Open

Alignment of FASTQ files for HG002 #34

Osmluke opened this issue Dec 4, 2024 · 2 comments

Comments

@Osmluke
Copy link

Osmluke commented Dec 4, 2024

I apologise in advance if this is a redundant or foolish question, I am still relatively new to fastq to bam alignment but would seriously appreciate any guidance on the following questions for SV caller benchmarking using HG002 v0.6 as my truth set. For evaluating the callers I was originally going to use

Under: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002_HiSeq300x_fastq/140528_D00360_0018_AH8VC6ADXX/Project_RM8391_RM8392/

I see that there are multiple Samples: 2A1, 2A2, 2F1, 2F2, etc., is each individual fastq within these sub-directories 30x coverage? Or do they add up to 30x per sub-directory?

I'm looking to create a bam file of about 30x coverage that's not biased by read group or library, how would you recommend going about this? Would it be better to merge all R1 files together and R2 files together and then down-sample each one to 30x post-merging? Or is there a better approach?

In addition to this, how would one carry out the down-sampling?

Once again, I apologise if these questions are redundant but any help would be greatly appreciated.

Thank you for your time and patience.

@nate-d-olson
Copy link

Hi, No need to apologize. We have a new draft SV benchmark that is more accurate, https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.019-20241113/, that you may want to use instead of the v0.6 benchmark. Check out the README for a description of the benchmark sets and recommendations for benchmarking SVs with Truvari. This is a draft benchmark; all our analysis so far indicates it is very accurate, but it has not gone through our formal evaluation process. Please let us know if you find any errors in the benchmark set.

The HiSeq dataset is a bit older. I recommend usingsets from Google Health (links below) one of the data. Let me know if you have any questions or run into issues accessing the data or benchmarking SVs.

Novaseq
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R1.fastq.gz
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R2.fastq.gz

HiSeqX (4000)
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz
https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz

@Osmluke
Copy link
Author

Osmluke commented Dec 5, 2024

Thank you very much for the information, I'll try out the newer draft for the time being, and I'll experiment with aligning the Novaseq fastq files for now. All of this is very greatly appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants