-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VG STATS: 2000 Properly Paired Inquiry #4434
Comments
I think the feature request for a vg tool for marking/removing duplicates might be already tracked in #2662? The request is also mentioned in #3283. So people definitely want it. Giraffe insists that your paired end fragment distribution (which you can set manually with It sounds like the fragment length distribution, as learned from the non-deduplicated input reads, didn't satisfy this constraint, which is why Giraffe produced that What was the full warning message from Giraffe? It should have printed a couple more lines giving statistics about the fragment length distribution it observed: What is the fragment length distribution for this sequencing run supposed to be? If you actually are trying to run data where the fragments are meant to be very short relative to the reads, either we have to change the clusterer to not have this constraint, or you need to lie to Giraffe about your fragment length distribution with As for the statistics on the result showing 2000 reads marked properly paired, I think what is happening is that the initial 1000 pairs that we map with loose constraints, to try and learn the fragment length distribution, are being marked properly paired. But then all the reads mapped after that are falling back to single-ended mapping because of the problem that the warning indicates, and none of those are being marked properly paired. So we are actually producing output with 2000 reads that are marked properly paired. |
@adamnovak Hi, thank you for your response! The full warning message was: I am not setting a fragment length distribution for this sequencing run, just using the default. Thank you for this suggestion! |
@xchang1 I feel like we might be foggy on whether the "fragment length" is between the inner or outer ends of the reads in the fragment. Because it makes sense to have 150bp reads on a fragment with 114 bp between their inner ends, and it's weird that the clusterer wouldn't like that. |
The fragment length should always be the distance between the outer ends. The problem might be that we find the minimum distance between the outer ends, rather than the minimum distance between the inner ends plus the lengths of the reads. So the topologically minimum distance might end up being smaller than the lengths of the reads if there's a big deletion edge. I wouldn't expect that to happen a lot though. @AbdelF6 what graph are you using? Is it just the CHM13 reference? |
Hi @xchang1, yes I am using the CHM13-T2T reference to run on vg. I downloaded the indexes from this Github link: https://github.com/human-pangenomics/hpp_pangenome_resources/blob/main/hprc-v1.0-mc.md. |
Ah ok. I wouldn't expect that problem on the HPRC minigraph-cactus graph. |
Oh also how long are your reads? Maybe they're just short and |
Hi @xchang1, the reads are around 75 nucleotides base paired ends. But, we have run vg giraffe with other files, all with the same read lengths (around 75 nucleotide base paired ends), and we did not get this 2000 properly paired statistic or single-end mapping error for those runs. If the read length is around 75, what would you recommend we set the distance limit to? |
Maybe try 100? |
1. What were you trying to do?
VG stats on 3 files after aligning them using vg giraffe and the T2T-CHM13 reference.
2. What did you want to happen?
vg giraffe to use paired-end mapping and properly paired values to be accurate
3. What actually happened?
Got an error for vg giraffe:
warning[vg::giraffe]: Cannot cluster reads with a fragment distance smaller than read distance
warning[vg::giraffe]: Falling back on single-end mapping
For VG stats, had a properly paired value of 2000 for all 3 files
After removing duplicates form the files using samtools markdup, no errors about cluster reads came up. I was wondering what the 2000 value means (is this a default output for when single-end mapping occurs? Also, is there a command in vg that can allow duplicates to be removed/notify user of duplicates in file?
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen?
Files can be sent to be downloaded and for the command to be tested
6. What does running
vg version
say?vg version v1.60.0 "Annicco"
The text was updated successfully, but these errors were encountered: