-
Notifications
You must be signed in to change notification settings - Fork 49
Description
When we used VAMB5 on a large data set of 1500 samples, we found that VAMB clusters are at the strain-level. Most species are often split into a dozen clusters.
When later we profile the clusters using MAGinator, we find that marker-gene mappings are unspecific and can't distinguish between the clusters, so cluster-level abundances end up being highly correlated within the same species. Also, clusters are so narrow that they become quite sample-specific, so we end up having to agglomerate our data to the species-level anyway, in order to have enough statistical power to associate against metadata.
I get the impression that you spent a lot of effort tuning the clustering cutoff, and also like the idea of the operational taxonomic unit being so data-driven. But we have collaborators who merge their VAMB bins into species-level clusters using something as crude as dRep, and they're getting much more meaningful statistics against metadata than us.
Could it be 1) a bug or 2) that larger data sets somehow push VAMB clustering towards being too fine-grained? And 3) could it make sense with customizable clustering cutoff?
We are happy to benchmark.