-
Notifications
You must be signed in to change notification settings - Fork 1
1 Features
Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.
Vclust offers multiple similarity measures between two genome sequences:
- ANI: The number of identical nucleotides across local alignments divided by the total length of the alignments.
- Global ANI (gANI): The number of identical nucleotides across local alignments divided by the length of the query/reference genome.
- Total ANI (tANI): The number of identical nucleotides between query-reference and reference-query genomes divided by the sum length of both genomes. tANI is equivalent to the VIRIDIC's intergenomic similarity.
- Coverage (alignment fraction): The proportion of the query/reference sequence aligned with the reference/query sequence.
- Number of local alignments: The number of local alignments between the two genome sequences.
- Ratio between genome lengths: The length of the shorter genome divided by the longer one.
Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.
- Single-linkage
- Complete-linkage
- UCLUST
- CD-HIT (Greedy incremental)
- Greedy set cover (adopted from MMseqs2)
- Leiden algorithm [optional]
Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.
For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Deduplicate (remove identical sequences) across multiple datasets
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions