-
Notifications
You must be signed in to change notification settings - Fork 3
Available features
The MetaSBT framework provides a set of subroutines that can be listed with the --help argument as shown below:
$ metasbt --help
usage:
metasbt <command> [<args>]
The metasbt commands are:
db List and retrieve public MetaSBT databases;
index Index a set of reference genomes and build the first baseline of a MetaSBT database;
kraken Export a MetaSBT database into a custom kraken database;
pack Build a compressed tarball with a MetaSBT database and report its sha256;
profile Profile an input genome and report the closest cluster at all the seven taxonomic levels
and the closest genome in a MetaSBT database;
sketch Sketch the input genomes;
summarize Summarize the content of a MetaSBT database and report some statistics;
test Check for dependencies and run unit tests;
This must be used by code maintainers only;
unpack Unpack a local MetaSBT tarball database;
update Update a MetaSBT database with new metagenome-assembled genomes.
positional arguments:
command metasbt command
optional arguments:
-h, --help show this help message and exit
-v, --version Show version and exit.
It follows a description of each of the available subroutines.
The db subroutine is used to query the MetaSBT-DBs repository and list all the public available MetaSBT databases:
$ metasbt db --list
It also allows to retrieve one of these public databases and store them locally as compressed tarballs:
$ mkdir ~/MetaSBT-DBs
$ metasbt db --download Viruses --folder ~/MetaSBT-DBs
| Option | Default | Mandatory | Description |
|---|---|---|---|
--download |
The database name | ||
--folder |
Current directory | Store the selected database under this folder | |
--list |
False |
List official public MetaSBT databases | |
--version |
The most recent version | The database version |
The index subroutine allows to organize and index a set of reference genomes from isolate sequencing based on their taxonomic classification. It makes use of howdesbt to rapidly index the input genomes and create a Sequence Bloom Tree for each of the species. Higher taxonomic levels are indexed by building new sequence bloom trees considering only the root nodes of the trees at the immediate lower taxonomic level.
The following command will trigger the generation of a MetaSBT database with a provided set of reference genomes:
$ metasbt index --workdir ~/MetaSBT-DBs \
--database Viruses \
--references ~/genomes.tsv \
--dereplicate 0.01 \
--increase-filter-size 50.0 \
--completeness 50.0 \
--contamination 5.0 \
--nproc 32Here, we process a set of reference genomes listed in genomes.tsv by first removing duplicates using a dereplication threshold of 0.01 on their ANI distance, and removing low quality genomes based on the thresholds of completeness and contamination of 50% and 5% respectively.
Please note that the genomes.tsv is a two-columns tab-separate-values file the paths to the genome files in FASTA format under the first column, and their assigned full taxonomic labels under the second column like in the example below:
~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_020554925.1_ASM2055492v1_genomic.fna k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Lachnospiraceae|g__Mediterraneibacter|s__Ruminococcus_torques
~/genomes/GCA_000450545.2_ASM45054v2_genomic.fna k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000450565.2_ASM45056v2_genomic.fna k__Bacteria|p__Bacillota|c__Clostridia|o__Eubacteriales|f__Peptostreptococcaceae|g__Clostridioides|s__Clostridioides_difficile
~/genomes/GCA_000014845.1_ASM1484v1_genomic.fna k__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
| Option | Default | Mandatory | Description |
|---|---|---|---|
--completeness |
0.0 |
Percentage threshold on genomes completeness | |
--contamination |
100.0 |
Percentage threshold on genomes contamination | |
--database |
⚑ | The database name | |
--dereplicate |
0.0 |
Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0 | |
--filter-size |
This is the size of the bloom filters. It automatically estimates a proper bloom filter size if not provided | ||
--increase-filter-size |
50.0 |
Increase the estimated filter size by the specified percentage. It is highly recommended to increase the filter size by a good percentage in case you are planning to update the index with new genomes | |
--kmer-size |
The kmer size. It automatically estimates a proper kmer size if not provided | ||
--limit-kmer-size |
32 |
Limit the estimation of the optimal kmer size with Kitsune to this size at most | |
--min-kmer-occurrences |
2 |
Minimum number of occurrences of kmers to be considered for estimating the bloom filter size and for building the bloom filter files | |
--nproc |
All the available CPUs | Process the input genomes in parallel | |
--pack |
False |
Pack the database into a compressed tarball | |
--references |
⚑ | Path to the tab-separated-values file with the list of reference genomes. It must contain two columns. The first one with the path to the actual reference genome. The second one with their fully defined taxonomic label | |
--workdir |
⚑ | Path to the working directory |
This is used to build a custom Kraken database based on a MetaSBT database and its genomes classification, unlocking the quantitative profiling of known and still unknown species in metagenomic samples:
$ metasbt kraken --workdir ~/MetaSBT-DBs \
--database Viruses \
--genomes ~/genomes.txt \
--ncbi-names ~/names.dmp \
--ncbi-nodes ~/nodes.dmp
In this case, the genomes.txt file must contain the paths to the FASTA files all all the genomes in the MetaSBT database, while names.dmp and nodes.dmp are contain information about the taxonomic levels and their relationships as defined by the NIH and they are both part of the tarball available at https:// ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
⚑ | The database name | |
--genomes |
⚑ | Path to the file with the list of paths to the genomes. Genomes must be in the MetaSBT database in order to be processed | |
--ncbi-names |
⚑ | Path to the NCBI names.dmp file | |
--ncbi-nodes |
⚑ | Path to the NCBI nodes.dmp file | |
--workdir |
⚑ | Path to the working directory |
This command is intended to pack a database into a compressed tarball, ready to be stored securely and shared with collaborators, also reporting the SHA-256 hash on the resulting tar.gz file for sanity check purposes:
$ metasbt pack --workdir ~/MetaSBT-DBs \
--database Viruses
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
⚑ | The database name | |
--workdir |
⚑ | Path to the working directory |
The profile subroutine allows to characterize an input genome according to the closest lineage in the database. It allows to process only one genome in input at a time:
$ metasbt profile --workdir ~/MetaSBT-DBs \
--database Viruses \
--genome ~/genome.fna \
--nproc 32The actual profiles are stored in the profiles folder under the tmp directory of the selected database as tab-separated-values files reporting the closest clusters in the database under all the seven taxonomic levels, alongside their ANI distance.
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
⚑ | The database name | |
--genome |
Path to the input genome. It is required if --genomes is not specified |
||
--genomes |
Path to the file with the list of paths to the input genomes. It is required if --genomes is not specified |
||
--workdir |
⚑ | Path to the working directory |
This is used to sketch a set of genomes into bloom filters. You may not need to run this command since index and update also take care of sketching genomes. The resulting bloom filters are stored into the sketches folder under the database directory:
$ metasbt sketch --workdir ~/MetaSBT-DBs \
--database Viruses \
--genome ~/genomes/GCA_020560505.1_ASM2056050v1_genomic.fna \
--nproc 32
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
⚑ | The database name | |
--genome |
Path to the input genome. Required if --genomes is not provided |
||
--genomes |
Path to the file with a list of paths to the input genomes. Required if --genome is not provided |
||
--workdir |
⚑ | Path to the working directory |
The summarize utility extract a few numbers from a specified database. In particular, it reports the total number of cluster at all the seven taxonomic levels specifying how many known and unknown clusters have been defined so far, in addition to the total number of reference genomes and metagenome-assembled genomes, and the density of the database root node used to establish whether its bloom filter is saturated, and thus no more genomes can be added to the database.
$ metasbt summarize --workdir ~/MetaSBT-DBs \
--database Viruses
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
⚑ | The database name | |
--workdir |
⚑ | Path to the working directory |
This is used to test the MetaSBT features.
Note
This is not a functional subroutine and it is intended to be used by code maintainers only.
$ metasbt test --references ~/genomes/references.tsv \
--mags ~/genomes/mags.txt
The command reported above runs a series of unit tests on all the MetaSBT features listed in this page (note the use of the all parameter). A set of reference genomes and metagenome-assembled genomes must always be provided.
This can be used to also run specific tests by replacing all with the name of a subroutine (e.g., db, index, kraken, pack, profile, etc.)
| Option | Default | Mandatory | Description |
|---|---|---|---|
--feature |
all |
The feature name | |
--references |
⚑ | Path to the file with the list of paths to the reference genomes and their taxonomies | |
--mags |
⚑ | Path to the file with the list of paths to the metagenome-assembled genomes |
This is used to extract a database as a compressed tarball under a specific location:
$ metasbt unpack --workdir ~/MetaSBT-DBs \
--database Viruses \
--tarball ~/Viruses-20250115.tar.gz
| Option | Default | Mandatory | Description |
|---|---|---|---|
--database |
Rename the extracted database with whatever name is specified under this argument | ||
--tarball |
⚑ | The database compressed tarball | |
--workdir |
⚑ | Path to the working directory |
This subroutine can be used to add new metagenome-assembled genomes (MAGs) to a database.
In case of new MAGs, it first profile them by comparing the input genomes with those already present in the database. An input genome is assigned to the closest genome cluster in the database if it falls within the closest cluster boundaries. It could happen that an input genome is kept unassigned because too far from everything in the database. In this case, all the unassigned genomes are clustered together leading to the definition of new clusters at all the seven taxonomic levels:
$ metasbt update --workdir ~/MetaSBT-DBs \
--database Viruses \
--genomes ~/genomes.txt \
--dereplicate 0.01 \
--completeness 50.0 \
--contamination 5.0 \
--nproc 32
| Option | Default | Mandatory | Description |
|---|---|---|---|
--completeness |
0.0 |
Percentage threshold on genomes completeness | |
--contamination |
100.0 |
Percentage threshold on genomes contamination | |
--database |
⚑ | The database name | |
--dereplicate |
0.0 |
Dereplicate genomes based on their ANI distance according to the specified threshold. The dereplication process is triggered in case of a threshold >0.0 | |
--genome |
Path to the input genome. Required if --genomes is not provided |
||
--genomes |
Path to the file with a list of paths to the input genomes. Required if --genome is not provided |
||
--nproc |
All the available CPUs | Process the input genomes in parallel | |
--pack |
False |
Pack the database into a compressed tarball | |
--pruning-threshold |
0.0 |
Threshold for pruning the Sequence Bloom Tree while profiling input genomes | |
--uncertainty |
20.0 |
Uncertainty percentage for considering multiple best hits while profiling input genomes | |
--workdir |
⚑ | Path to the working directory |
MetaSBT | Releases | Wiki | MetaSBT-DBs | License
- Home
- Getting started
-
Available features
db: list and retrieve public databasesindex: create a baseline with reference genomeskraken: build a custom Kraken databasepack: pack a database into a compressed tarballprofile: characterize genomes and metagenome-assembled genomessketch: sketch genomes into bloom filterssummarize: summarize the content of a databasetest: run unit testsunpack: install a databaseupdate: update a database with new genomes
- Retrieving genomes from NCBI GenBank