Skip to content

Available features

Fabio Cumbo edited this page Jul 7, 2025 · 15 revisions

The MetaSBT framework provides a set of subroutines that can be listed with the --help argument as shown below:

$ metasbt --help
usage:
    metasbt <command> [<args>]

    The metasbt commands are:
    db              List and retrieve public MetaSBT databases;
    index           Index a set of reference genomes and build the first baseline of a MetaSBT database;
    kraken          Export a MetaSBT database into a custom kraken database;
    pack            Build a compressed tarball with a MetaSBT database and report its sha256;
    profile         Profile an input genome and report the closest cluster at all the seven taxonomic levels
                    and the closest genome in a MetaSBT database;
    sketch          Sketch the input genomes;
    summarize       Summarize the content of a MetaSBT database and report some statistics;
    test            Check for dependencies and run unit tests;
                    This must be used by code maintainers only;
    unpack          Unpack a local MetaSBT tarball database;
    update          Update a MetaSBT database with new metagenome-assembled genomes.

positional arguments:
    command         metasbt command

optional arguments:
    -h, --help      show this help message and exit
    -v, --version   Show version and exit.

It follows a description of each of the available subroutines.

1. db: list and retrieve public databases

The db subroutine is used to query the MetaSBT-DBs repository and list all the public available MetaSBT databases:

$ metasbt db --list

It also allows to retrieve one of these public databases and store them locally as compressed tarballs:

$ mkdir ~/MetaSBT-DBs
$ metasbt db --download Viruses --folder ~/MetaSBT-DBs

Available options

Option Default Mandatory Description
--download The database name
--folder Current directory Store the selected database under this folder
--list False List official public MetaSBT databases
--version The most recent version The database version

2. index: create a baseline with reference genomes

The index subroutine allows to organize and index a set of reference genomes from isolate sequencing based on their taxonomic classification. It makes use of howdesbt to rapidly index the input genomes and create a Sequence Bloom Tree for each of the species. Higher taxonomic levels are indexed by building new sequence bloom trees considering only the root nodes of the trees at the immediate lower taxonomic level.

The following command will trigger the generation of a MetaSBT database with a provided set of reference genomes:

$ metasbt index --workdir ~/MetaSBT-DBs \
                --database Viruses \
                --references ~/genomes.tsv \
                --dereplicate 0.01 \
                --increase-filter-size 50.0 \
                --completeness 50.0 \
                --contamination 5.0 \
                --nproc 32

Here, we process a set of reference genomes listed in genomes.tsv by first removing duplicates using a dereplication threshold of 0.01 on their ANI distance, and removing low quality genomes based on the thresholds of completeness and contamination of 50% and 5% respectively.

Please note that the genomes.tsv is a two-columns tab-separate-values file the paths to the genome files in FASTA format under the first column, and their assigned full taxonomic labels under the second column like in the example below:

~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_020554925.1_ASM2055492v1_genomic.fna  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_000450545.2_ASM45054v2_genomic.fna    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000450565.2_ASM45056v2_genomic.fna    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000014845.1_ASM1484v1_genomic.fna     k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

Available options

Option Default Mandatory Description
--completeness 0.0 Percentage threshold on genomes completeness
--contamination 100.0 Percentage threshold on genomes contamination
--database The database name
--dereplicate 0.0 Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0
--filter-size This is the size of the bloom filters. It automatically estimates a proper bloom filter size if not provided
--increase-filter-size 50.0 Increase the estimated filter size by the specified percentage. It is highly recommended to increase the filter size by a good percentage in case you are planning to update the index with new genomes
--kmer-size The kmer size. It automatically estimates a proper kmer size if not provided
--limit-kmer-size 32 Limit the estimation of the optimal kmer size with Kitsune to this size at most
--min-kmer-occurrences 2 Minimum number of occurrences of kmers to be considered for estimating the bloom filter size and for building the bloom filter files
--nproc All the available CPUs Process the input genomes in parallel
--pack False Pack the database into a compressed tarball
--references Path to the tab-separated-values file with the list of reference genomes. It must contain two columns. The first one with the path to the actual reference genome. The second one with their fully defined taxonomic label
--workdir Path to the working directory

3. kraken: build a custom Kraken database

This is used to build a custom Kraken database based on a MetaSBT database and its genomes classification, unlocking the quantitative profiling of known and still unknown species in metagenomic samples:

$ metasbt kraken --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genomes ~/genomes.txt \
                 --ncbi-names ~/names.dmp \
                 --ncbi-nodes ~/nodes.dmp

In this case, the genomes.txt file must contain the paths to the FASTA files all all the genomes in the MetaSBT database, while names.dmp and nodes.dmp are contain information about the taxonomic levels and their relationships as defined by the NIH and they are both part of the tarball available at https:// ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

Available options

Option Default Mandatory Description
--database The database name
--genomes Path to the file with the list of paths to the genomes. Genomes must be in the MetaSBT database in order to be processed
--ncbi-names Path to the NCBI names.dmp file
--ncbi-nodes Path to the NCBI nodes.dmp file
--workdir Path to the working directory

4. pack: pack a database into a compressed tarball

This command is intended to pack a database into a compressed tarball, ready to be stored securely and shared with collaborators, also reporting the SHA-256 hash on the resulting tar.gz file for sanity check purposes:

$ metasbt pack --workdir ~/MetaSBT-DBs \
               --database Viruses

Available options

Option Default Mandatory Description
--database The database name
--workdir Path to the working directory

5. profile: characterize genomes and metagenome-assembled genomes

The profile subroutine allows to characterize an input genome according to the closest lineage in the database. It allows to process only one genome in input at a time:

$ metasbt profile --workdir ~/MetaSBT-DBs \
                --database Viruses \
                --genome ~/genome.fna \
                --nproc 32

The actual profiles are stored in the profiles folder under the tmp directory of the selected database as tab-separated-values files reporting the closest clusters in the database under all the seven taxonomic levels, alongside their ANI distance.

Available options

Option Default Mandatory Description
--database The database name
--genome Path to the input genome. It is required if --genomes is not specified
--genomes Path to the file with the list of paths to the input genomes. It is required if --genomes is not specified
--workdir Path to the working directory

6. sketch: sketch genomes into bloom filters

This is used to sketch a set of genomes into bloom filters. You may not need to run this command since index and update also take care of sketching genomes. The resulting bloom filters are stored into the sketches folder under the database directory:

$ metasbt sketch --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genome ~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna \
                 --nproc 32

Available options

Option Default Mandatory Description
--database The database name
--genome Path to the input genome. Required if --genomes is not provided
--genomes Path to the file with a list of paths to the input genomes. Required if --genome is not provided
--workdir Path to the working directory

7. summarize: summarize the content of a database

The summarize utility extract a few numbers from a specified database. In particular, it reports the total number of cluster at all the seven taxonomic levels specifying how many known and unknown clusters have been defined so far, in addition to the total number of reference genomes and metagenome-assembled genomes, and the density of the database root node used to establish whether its bloom filter is saturated, and thus no more genomes can be added to the database.

$ metasbt summarize --workdir ~/MetaSBT-DBs \
                    --database Viruses

Available options

Option Default Mandatory Description
--database The database name
--workdir Path to the working directory

8. test: run unit tests

This is used to test the MetaSBT features.

Note

This is not a functional subroutine and it is intended to be used by code maintainers only.

$ metasbt test --references ~/genomes/references.tsv \
               --mags ~/genomes/mags.txt

The command reported above runs a series of unit tests on all the MetaSBT features listed in this page (note the use of the all parameter). A set of reference genomes and metagenome-assembled genomes must always be provided.

This can be used to also run specific tests by replacing all with the name of a subroutine (e.g., db, index, kraken, pack, profile, etc.)

Available options

Option Default Mandatory Description
--feature all The feature name
--references Path to the file with the list of paths to the reference genomes and their taxonomies
--mags Path to the file with the list of paths to the metagenome-assembled genomes

9. unpack: install a database

This is used to extract a database as a compressed tarball under a specific location:

$ metasbt unpack --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --tarball ~/Viruses-20250115.tar.gz

Available options

Option Default Mandatory Description
--database Rename the extracted database with whatever name is specified under this argument
--tarball The database compressed tarball
--workdir Path to the working directory

10. update: update a database with new genomes

This subroutine can be used to add new metagenome-assembled genomes (MAGs) to a database.

In case of new MAGs, it first profile them by comparing the input genomes with those already present in the database. An input genome is assigned to the closest genome cluster in the database if it falls within the closest cluster boundaries. It could happen that an input genome is kept unassigned because too far from everything in the database. In this case, all the unassigned genomes are clustered together leading to the definition of new clusters at all the seven taxonomic levels:

$ metasbt update --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genomes ~/genomes.txt \
                 --dereplicate 0.01 \
                 --completeness 50.0 \
                 --contamination 5.0 \
                 --nproc 32

Available options

Option Default Mandatory Description
--completeness 0.0 Percentage threshold on genomes completeness
--contamination 100.0 Percentage threshold on genomes contamination
--database The database name
--dereplicate 0.0 Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0
--genome Path to the input genome. Required if --genomes is not provided
--genomes Path to the file with a list of paths to the input genomes. Required if --genome is not provided
--nproc All the available CPUs Process the input genomes in parallel
--pack False Pack the database into a compressed tarball
--pruning-threshold 0.0 Threshold for pruning the Sequence Bloom Tree while profiling input genomes
--uncertainty 20.0 Uncertainty percentage for considering multiple best hits while profiling input genomes
--workdir Path to the working directory

Clone this wiki locally