Available features

The MetaSBT framework provides a set of subroutines that can be listed with the --help argument as shown below:

$ metasbt --help
usage:
    metasbt <command> [<args>]

    The metasbt commands are:
    db              List and retrieve public MetaSBT databases;
    index           Index a set of reference genomes and build the first baseline of a MetaSBT database;
    kraken          Export a MetaSBT database into a custom kraken database;
    pack            Build a compressed tarball with a MetaSBT database and report its sha256;
    profile         Profile an input genome and report the closest cluster at all the seven taxonomic levels
                    and the closest genome in a MetaSBT database;
    sketch          Sketch the input genomes;
    summarize       Summarize the content of a MetaSBT database and report some statistics;
    test            Check for dependencies and run unit tests;
                    This must be used by code maintainers only;
    unpack          Unpack a local MetaSBT tarball database;
    update          Update a MetaSBT database with new metagenome-assembled genomes.

positional arguments:
    command         metasbt command

optional arguments:
    -h, --help      show this help message and exit
    -v, --version   Show version and exit.

It follows a description of each of the available subroutines.

1. `db`: list and retrieve public databases

The db subroutine is used to query the MetaSBT-DBs repository and list all the public available MetaSBT databases:

$ metasbt db --list

It also allows to retrieve one of these public databases and store them locally as compressed tarballs:

$ mkdir ~/MetaSBT-DBs
$ metasbt db --download Viruses --folder ~/MetaSBT-DBs

Available options

Option	Default	Description
`--download`		The database name
`--folder`	Current directory	Store the selected database under this folder
`--list`	`False`	List official public MetaSBT databases
`--version`	The most recent version	The database version

2. `index`: create a baseline with reference genomes

The index subroutine allows to organize and index a set of reference genomes from isolate sequencing based on their taxonomic classification. It makes use of howdesbt to rapidly index the input genomes and create a Sequence Bloom Tree for each of the species. Higher taxonomic levels are indexed by building new sequence bloom trees considering only the root nodes of the trees at the immediate lower taxonomic level.

The following command will trigger the generation of a MetaSBT database with a provided set of reference genomes:

$ metasbt index --workdir ~/MetaSBT-DBs \
                --database Viruses \
                --references ~/genomes.tsv \
                --dereplicate 0.01 \
                --increase-filter-size 50.0 \
                --completeness 50.0 \
                --contamination 5.0 \
                --nproc 32

Here, we process a set of reference genomes listed in genomes.tsv by first removing duplicates using a dereplication threshold of 0.01 on their ANI distance, and removing low quality genomes based on the thresholds of completeness and contamination of 50% and 5% respectively.

Please note that the genomes.tsv is a two-columns tab-separate-values file the paths to the genome files in FASTA format under the first column, and their assigned full taxonomic labels under the second column like in the example below:

~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_020554925.1_ASM2055492v1_genomic.fna  k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_000450545.2_ASM45054v2_genomic.fna    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000450565.2_ASM45056v2_genomic.fna    k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000014845.1_ASM1484v1_genomic.fna     k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli

Available options

Option	Default	Mandatory	Description
`--completeness`	`0.0`		Percentage threshold on genomes completeness
`--contamination`	`100.0`		Percentage threshold on genomes contamination
`--database`		⚑	The database name
`--dereplicate`	`0.0`		Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0
`--filter-size`			This is the size of the bloom filters. It automatically estimates a proper bloom filter size if not provided
`--increase-filter-size`	`50.0`		Increase the estimated filter size by the specified percentage. It is highly recommended to increase the filter size by a good percentage in case you are planning to update the index with new genomes
`--kmer-size`			The kmer size. It automatically estimates a proper kmer size if not provided
`--limit-kmer-size`	`32`		Limit the estimation of the optimal kmer size with Kitsune to this size at most
`--min-kmer-occurrences`	`2`		Minimum number of occurrences of kmers to be considered for estimating the bloom filter size and for building the bloom filter files
`--nproc`	All the available CPUs		Process the input genomes in parallel
`--pack`	`False`		Pack the database into a compressed tarball
`--references`		⚑	Path to the tab-separated-values file with the list of reference genomes. It must contain two columns. The first one with the path to the actual reference genome. The second one with their fully defined taxonomic label
`--workdir`		⚑	Path to the working directory

3. `kraken`: build a custom Kraken database

This is used to build a custom Kraken database based on a MetaSBT database and its genomes classification, unlocking the quantitative profiling of known and still unknown species in metagenomic samples:

$ metasbt kraken --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genomes ~/genomes.txt \
                 --ncbi-names ~/names.dmp \
                 --ncbi-nodes ~/nodes.dmp

In this case, the genomes.txt file must contain the paths to the FASTA files all all the genomes in the MetaSBT database, while names.dmp and nodes.dmp are contain information about the taxonomic levels and their relationships as defined by the NIH and they are both part of the tarball available at https:// ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

Available options

Option	Mandatory	Description
`--database`	⚑	The database name
`--genomes`	⚑	Path to the file with the list of paths to the genomes. Genomes must be in the MetaSBT database in order to be processed
`--ncbi-names`	⚑	Path to the NCBI names.dmp file
`--ncbi-nodes`	⚑	Path to the NCBI nodes.dmp file
`--workdir`	⚑	Path to the working directory

4. `pack`: pack a database into a compressed tarball

This command is intended to pack a database into a compressed tarball, ready to be stored securely and shared with collaborators, also reporting the SHA-256 hash on the resulting tar.gz file for sanity check purposes:

$ metasbt pack --workdir ~/MetaSBT-DBs \
               --database Viruses

Available options

Option	Default	Mandatory	Description
`--database`		⚑	The database name
`--workdir`		⚑	Path to the working directory

5. `profile`: characterize genomes and metagenome-assembled genomes

The profile subroutine allows to characterize an input genome according to the closest lineage in the database. It allows to process only one genome in input at a time:

$ metasbt profile --workdir ~/MetaSBT-DBs \
                --database Viruses \
                --genome ~/genome.fna \
                --nproc 32

The actual profiles are stored in the profiles folder under the tmp directory of the selected database as tab-separated-values files reporting the closest clusters in the database under all the seven taxonomic levels, alongside their ANI distance.

Available options

Option	Mandatory	Description
`--database`	⚑	The database name
`--genome`		Path to the input genome. It is required if `--genomes` is not specified
`--genomes`		Path to the file with the list of paths to the input genomes. It is required if `--genomes` is not specified
`--workdir`	⚑	Path to the working directory

6. `sketch`: sketch genomes into bloom filters

This is used to sketch a set of genomes into bloom filters. You may not need to run this command since index and update also take care of sketching genomes. The resulting bloom filters are stored into the sketches folder under the database directory:

$ metasbt sketch --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genome ~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna \
                 --nproc 32

Available options

Option	Mandatory	Description
`--database`	⚑	The database name
`--genome`		Path to the input genome. Required if `--genomes` is not provided
`--genomes`		Path to the file with a list of paths to the input genomes. Required if `--genome` is not provided
`--workdir`	⚑	Path to the working directory

7. `summarize`: summarize the content of a database

The summarize utility extract a few numbers from a specified database. In particular, it reports the total number of cluster at all the seven taxonomic levels specifying how many known and unknown clusters have been defined so far, in addition to the total number of reference genomes and metagenome-assembled genomes, and the density of the database root node used to establish whether its bloom filter is saturated, and thus no more genomes can be added to the database.

$ metasbt summarize --workdir ~/MetaSBT-DBs \
                    --database Viruses

Available options

Option	Default	Mandatory	Description
`--database`		⚑	The database name
`--workdir`		⚑	Path to the working directory

8. `test`: run unit tests

This is used to test the MetaSBT features.

Note

This is not a functional subroutine and it is intended to be used by code maintainers only.

$ metasbt test --references ~/genomes/references.tsv \
               --mags ~/genomes/mags.txt

The command reported above runs a series of unit tests on all the MetaSBT features listed in this page (note the use of the all parameter). A set of reference genomes and metagenome-assembled genomes must always be provided.

This can be used to also run specific tests by replacing all with the name of a subroutine (e.g., db, index, kraken, pack, profile, etc.)

Available options

Option	Default	Mandatory	Description
`--feature`	`all`		The feature name
`--references`		⚑	Path to the file with the list of paths to the reference genomes and their taxonomies
`--mags`		⚑	Path to the file with the list of paths to the metagenome-assembled genomes

9. `unpack`: install a database

This is used to extract a database as a compressed tarball under a specific location:

$ metasbt unpack --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --tarball ~/Viruses-20250115.tar.gz

Available options

Option	Mandatory	Description
`--database`		Rename the extracted database with whatever name is specified under this argument
`--tarball`	⚑	The database compressed tarball
`--workdir`	⚑	Path to the working directory

10. `update`: update a database with new genomes

This subroutine can be used to add new metagenome-assembled genomes (MAGs) to a database.

In case of new MAGs, it first profile them by comparing the input genomes with those already present in the database. An input genome is assigned to the closest genome cluster in the database if it falls within the closest cluster boundaries. It could happen that an input genome is kept unassigned because too far from everything in the database. In this case, all the unassigned genomes are clustered together leading to the definition of new clusters at all the seven taxonomic levels:

$ metasbt update --workdir ~/MetaSBT-DBs \
                 --database Viruses \
                 --genomes ~/genomes.txt \
                 --dereplicate 0.01 \
                 --completeness 50.0 \
                 --contamination 5.0 \
                 --nproc 32

Available options

Option	Default	Mandatory	Description
`--completeness`	`0.0`		Percentage threshold on genomes completeness
`--contamination`	`100.0`		Percentage threshold on genomes contamination
`--database`		⚑	The database name
`--dereplicate`	`0.0`		Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0
`--genome`			Path to the input genome. Required if `--genomes` is not provided
`--genomes`			Path to the file with a list of paths to the input genomes. Required if `--genome` is not provided
`--nproc`	All the available CPUs		Process the input genomes in parallel
`--pack`	`False`		Pack the database into a compressed tarball
`--pruning-threshold`	`0.0`		Threshold for pruning the Sequence Bloom Tree while profiling input genomes
`--uncertainty`	`20.0`		Uncertainty percentage for considering multiple best hits while profiling input genomes
`--workdir`		⚑	Path to the working directory

MetaSBT | Releases | Wiki | MetaSBT-DBs | License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Available features

1. `db`: list and retrieve public databases

Available options

2. `index`: create a baseline with reference genomes

Available options

3. `kraken`: build a custom Kraken database

Available options

4. `pack`: pack a database into a compressed tarball

Available options

5. `profile`: characterize genomes and metagenome-assembled genomes

Available options

6. `sketch`: sketch genomes into bloom filters

Available options

7. `summarize`: summarize the content of a database

Available options

8. `test`: run unit tests

Available options

9. `unpack`: install a database

Available options

10. `update`: update a database with new genomes

Available options

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Available features

1. db: list and retrieve public databases

Available options

2. index: create a baseline with reference genomes

Available options

3. kraken: build a custom Kraken database

Available options

4. pack: pack a database into a compressed tarball

Available options

5. profile: characterize genomes and metagenome-assembled genomes

Available options

6. sketch: sketch genomes into bloom filters

Available options

7. summarize: summarize the content of a database

Available options

8. test: run unit tests

Available options

9. unpack: install a database

Available options

10. update: update a database with new genomes

Available options

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1. `db`: list and retrieve public databases

2. `index`: create a baseline with reference genomes

3. `kraken`: build a custom Kraken database

4. `pack`: pack a database into a compressed tarball

5. `profile`: characterize genomes and metagenome-assembled genomes

6. `sketch`: sketch genomes into bloom filters

7. `summarize`: summarize the content of a database

8. `test`: run unit tests

9. `unpack`: install a database

10. `update`: update a database with new genomes