Questions about output metrics #26

lingrongjin · 2024-11-05T12:59:30Z

Hi Jim,

Thanks for developing the tool and I found it quite easy to use. I have a few concept-related questions regarding the output metrics for sylph profile that I hope you could clarify.

From my understanding, containment ANI is calculated as the number of k-mers of a reference genome contained in a given sample (i.e., 95% containment ANI means 95% of the k-mers of the reference genome is contained in the sample).
"sequence_abundance" is calculated as the number of reads assigned to each genome divided by the total number of classified reads? I noticed that sequence_abundance sum up to 100 for most of my samples, but I expect there to be some reads that cannot be mapped to the reference genomes.

Please correct me if I'm wrong in any of these concepts and thank you for your help!

bluenote-1577 · 2024-11-05T16:52:16Z

Hi @lingrongjin

It isn't as simple as 95% k-mers contained -> 95% ANI. Your general idea is right but there is a formula we use; see the paper's first figure.
That's the right idea, but sylph does not classify reads. You're right, there will be reads that can not be classified. Try using the -u option in this case, which will scale sequence_abundance by the number of "unknown" reads. See [FEATURE REQUESTS] - post here for suggestions/feature requests #6 (comment)

Thanks

lingrongjin · 2024-11-06T08:49:51Z

Thanks for the explanation!

lingrongjin closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about output metrics #26

Questions about output metrics #26

lingrongjin commented Nov 5, 2024

bluenote-1577 commented Nov 5, 2024

lingrongjin commented Nov 6, 2024

Questions about output metrics #26

Questions about output metrics #26

Comments

lingrongjin commented Nov 5, 2024

bluenote-1577 commented Nov 5, 2024

lingrongjin commented Nov 6, 2024