Skip to content

Question: Custom clustering cutoff? #406

@shiraz-shah

Description

@shiraz-shah

When we used VAMB5 on a large data set of 1500 samples, we found that VAMB clusters are at the strain-level. Most species are often split into a dozen clusters.

When later we profile the clusters using MAGinator, we find that marker-gene mappings are unspecific and can't distinguish between the clusters, so cluster-level abundances end up being highly correlated within the same species. Also, clusters are so narrow that they become quite sample-specific, so we end up having to agglomerate our data to the species-level anyway, in order to have enough statistical power to associate against metadata.

I get the impression that you spent a lot of effort tuning the clustering cutoff, and also like the idea of the operational taxonomic unit being so data-driven. But we have collaborators who merge their VAMB bins into species-level clusters using something as crude as dRep, and they're getting much more meaningful statistics against metadata than us.

Could it be 1) a bug or 2) that larger data sets somehow push VAMB clustering towards being too fine-grained? And 3) could it make sense with customizable clustering cutoff?

We are happy to benchmark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    investigationDiscussion about metagenomics or binning

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions