KRANK consists of two main subprograms: build
and query
.
The subprogram krank build [OPTIONS]
reads input reference sequences and builds a KRANK library.
Once you have a library (or multiple libraries), krank query [OPTIONS]
performs queries against it (or against multiple libraries at the same time).
If you would rather use available libraries, you can download relevant ones and skip building your custom library.
Running each subprogram with --help
will display available subprograms together with brief descriptions regarding the program and its arguments (see).
KRANK currently does not have pre-built binaries. You need to compile with make. Requirements are as follows:
- libcurl library (this is only needed for downloading reference sequences using URLs, otherwise not needed),
- OpenMP for parallelism,
- C++11 support,
- zlib library.
That is all you need. Simply compile by running make
.
git clone https://github.com/bo1929/KRANK.git
make -C KRANK
KRANK/krank build --help
KRANK/krank query --help
KRANK can query against a single or multiple libraries.
Each KRANK library is essentially a directory containing a searchable data structure in binary format.
See a list of available libraries to download one or multiple based on your needs and quickly get started with your queries.
In the next section, we discuss the library building procedure for custom libraries, this might be useful if you have some specific new genomes that you want to be considered during the search.
Below, a basic command to query against one library (stored in the directory $LIBRARY_DIRECTORY
) is given.
krank query \
-l $LIBRARY_DIRECTORY -o $OUTPUT_DIRECTORY -q $QUERY_PATHS \
--num-threads $NUM_THREADS
You can substitute $NUM_THREADS
simply with the number of cores available.
The value of $OUTPUT_DIRECTORY
for option -o
(or --output-dir
) should be the path to the directory where you wish the results to be saved.
The option -q
(or --query-file
) can be one of the following: i) a path to the FASTA/FASTQ file containing the query sequences/reads, ii) a path to a tab-separated file containing a two-column mapping from a query ID to the corresponding query filepath to the FASTA/FASTQ file.
query_1 ./query1.fna
query_2 ./query2.fq
As mentioned earlier, you can use multiple libraries to expand your reference set by specifying corresponding directory paths after -l
(or --library-dir
, e.g., -l $LIBRARY_DIRECTORY1 $LIBRARY_DIRECTORY2
).
Querying against a file generates two output files by default: a taxonomic label assignment for each sequence (i.e., short/long read or contig) and a summary of the taxonomic composition of the entire query file, i.e., relative abundances of taxa at different ranks.
The naming convention for output reports consists of a prefix + query ID -- or prefix + filename without the extension if a single query file is given: classification_info-*
and abundance_profile-*
. The first couple of lines of the classification report should look like the below example.
SEQ_ID RANK TAXON_ID TAXON_NAME PREDICTION_SCORE MATCH_SCORE
AIJN01000024.1-771547 species 201 Campylobacter lari 0.744 0.101
AIJN01000024.1-369645 species 201 Campylobacter lari 0.978 11.4
SEQ_ID
simply corresponds to the sequence identifier in FASTA/FASTQ.
If PREDICTION_SCORE
is not close to 1, it means that KRANK also found some k-mers related to other taxa.
If it is close to 1, the prediction is not ambiguous, i.e., there is no other potential assignment.
However, this does not necessarily imply that the assignment is confident.
MATCH_SCORE
quantifies the confidence, and the highest possible value is the number of k-mers in the sequence (e.g., 122 for short reads in the default parameter configuration).
KRANK internally filters assignments with MATCH_SCORE
smaller than 0.05.
You can disable this by setting --tvote-threshold 0
.
Note that, if specified, a total vote threshold should be given for each library.
For instance, if you had --library-dir $LIBRARY_DIRECTORY1 $LIBRARY_DIRECTORY2
, there need to be two thresholds: --tvote-threshold $TVOTE_THRESHOLD1 $TVOTE_THRESHOLD2
.
For taxa appearing only in $LIBRARY_DIRECTORY1
or $LIBRARY_DIRECTORY2
, the corresponding total vote threshold will be used.
All values greater than 0.05 should be considered sufficiently high.
Below 0.05 is still worth considering especially for novel/distant queries that are of interest, or for working with vertebrates.
For relative abundances, an example for the output report would look as below.
RANK TAXON_ID TAXON_NAME READ_COUNT READ_ABUNDANCE CELL_ABUNDANCE
class 117743 Flavobacteriia 3 0.012 0.0256103
class 1236 Gammaproteobacteria 6 0.024 0.0495194
class 183924 Thermoprotei 2 0.008 0.00916362
class 186801 Clostridia 4 0.016 0.0277098
READ_COUNT
columns are simply the counts of query sequences/reads that are assigned to a particular taxon.
READ_ABUNDANCE
is the READ_COUNT
value divided by the total number of sequences given in the query/sample.
CELL_ABUNDANCE
is the corrected abundance value that incorporates genome sizes on top of READ_ABUNDANCE
, and this is the value that you should probably consider.
Note that the neither READ_ABUNDANCE
nor CELL_ABUNDANCE
columns do not sum up to 1 for a fixed taxonomic rank (e.g., phylum).
This is because some sequences can be left unassigned to any taxon.
You can further normalize these values for each taxonomic level.
If you would like to benchmark KRANK against some other tools, you can use OPAL, it has an option (-n
) which automatically does this normalization and enables comparison between normalized and un-normalized profiles.
More and more diverse libraries will be added soon.
- RefSeq bacterial and archael genomes:
- WoL-v1 dataset - archaeal and bacterial genomes:
- Human reference libraries:
For the time being only libraries for archael genomes and bacterial genomes are avaible (and one human pangenome -- potentially for contamination removal). We recommend using them after contination removal for the best performance. If you have less than 8GB, just use WoL-v1 lightweight, if you have up to 16GB you add this library (query against these two simultanously --library-dir $LIBRARY_DIRECTORY1 $LIBRARY_DIRECTORY2
). If you have up to 32GB, just use RefSeq high-sensitivity, if you have up to 64GB memory add this library.
To build a reference library, KRANK requires two input files: a taxonomy and a mapping from taxon IDs to filepaths (or URLs for FTP) of input reference sequences.
The path to the directory in which the krank library will be created (or updated) should be specified with --library-dir
or -l
. Note that you have to create the empty directory first.
The input reference sequences could be any type: assembled sequences (genomes, contigs, scaffolds, etc.) or sets of k-mers1 (but not a mixture).
562 /path/to/file/escherichia_coli-0001.fq
562 /path/to/file/escherichia_coli-0002.fq.gz
54736 /path/to/file/salmonell_abongori-0001.fa
2287 https://ftp.ncbi.nlm.nih.gov/path/to/genome.fna.gz
Note that, you can specify more than one file per taxon ID, but not vice versa.
Full paths are always preferred., filenames are not relevant and can be anything.
This file must be given with the option --input-file /path/to/file/mapping
argument2.
The taxonomy consists of two files, namely nodes.dmp
and names.dmp
, and the path to the directory containing both must be given to the option -t
(or --taxonomy-dir
).
The format of these files is given here.
The latest version of the NCBI taxonomy can be found here.
The keys (the first column) in the --input-file
must appear in the taxonomy, i.e., the first field in the nodes.dmp
and names.dmp
.
Building a library is a relatively expensive but one-time operation.
It consists of two steps: library initialization and batch building.
The subprogram krank build
can either initialize a library or construct all/some batches of the library.
To initialize a library from reference sequences using default parameters for a library of size 32GB, run the below command.
krank build \
-l $LIBRARY_DIRECTORY -t $TAXONOMY_DIRECTORY -i $MAPPING_FILE --max-memory 32000 \
--batch-size 8 --from-scratch --num-threads $NUM_THREADS
You can benefit from parallel processing to a great extent by setting --num-threads
to the number of available cores you have.
Note that the overhead is very little, the speed-up will increase with the number of cores you use.
The option --from-scratch
specifies that this command is intended to initialize a non-existing library.
Hence, KRANK will not be looking for an already initialized library at $LIBRARY_DIRECTORY
.
When --max-memory
is given, KRANK selects a good configuration of -k
, -h
and -b
to satisfy this memory constraint.
In this example, the maximum memory is 32GB and KRANK will set -k 30
, -h 14
and -b 19
.
The option --batch-size
sets the number of batches that the table will be split in log scale.
For --batch-size 8
, it is
After initializing the library, you will need to process each batch independently.
This can be achieved by running separate krank build
commands by setting the --target-batch
.
For instance, to construct the library for the first batch out of 128, run the below command.
krank build \
-l $LIBRARY_DIRECTORY -t $TAXONOMY_DIRECTORY -i $MAPPING_FILE \
--from-library --batch-size 8 --target-batch 1 \
--num-threads $NUM_THREADS
Notice that, when an initialized library is targeted, the --from-library
flag is given.
Otherwise (initialization with --from-scratch
, default), input reference k-mers would be processed again, and hash keys would be overwritten, possibly computed with a new random seed.
Hence, the --from-library
option must be given to build an already-initialized library.
Alternatively, you can just build all batches with a single command by setting the --target-batch
to 0.
krank build \
-l $LIBRARY_DIRECTORY -t $TAXONOMY_DIRECTORY -i $MAPPING_FILE \
--from-library --batch-size 8 --target-batch 0 \
--num-threads $NUM_THREADS
However, the running time would be 128 times more compared to the previous command used for a single batch.
Suppose the memory on a machine (a single cluster node or a personal computer) allows to run 4 batches in parallel.
Then, we could process batches in parallel by using xargs
as below.
printf '%d\n' {1..128} | xargs -I{} -P4 krank build \
-l $LIBRARY_DIRECTORY -t $TAXONOMY_DIRECTORY -i $MAPPING_FILE \
--from-library --batch-size 8 --target-batch {} \
--num-threads $NUM_THREADS
The most ideal scenario would be having access to a cluster, and submitting independent jobs for each batch, letting the scheduler do the job.
But first, you need to initialize a library, without giving the --target-batch
option.
Note that, KRANK first reads input and stores k-mer encodings and hash keys in the library as separate files and loads from those files in batches during the table building.
Thus, when the final library is built, you can delete them (i.e., files with the prefix lsh_enc_vec-*
) by running find $LIBRARY_DIRECTORY -type f -name "*lsh_enc_vec*" -delete
.
Two parameters define the shape of the final hash table: -h
and -b
(--num-positions
and --num-columns
, respectively).
KRANK uses -b 16
and set -h
to
The other two flags are related to k-mer selection and gradual k-mer filtering.
Setting --kmer-ranking 0
(default is 1) changes k-mer selection to random; and KRANK does not use its heuristic to filter k-mers.
The other flag is --free-size
, which disables our sampling strategy (option --adaptive-size
is on by default).
If you want to use multiple libraries to query against, using --free-size
for one of them might improve the performance.
To emulate CONSULT-II and to entirely skip the hierarchical k-mer selection step (which is only suggested if you have a large number of reference genomes), use --fast-mode
.
Otherwise, to gradually filter some k-mers to fit your input reference to the library size specified, use --selection-mode
.
Running krank
, krank build
, and krank query
with --help
will give you the list of options, their description, and default values.
KRANK version: v0.3.2
Memory-bound & accurate taxonomic classification and profiling.
Usage: ./krank [OPTIONS] SUBCOMMAND
Options:
--help
--log,--no-log{false} Extensive logging, might be too much and helpful for troubleshooting.
--verbose,--no-verbose{false}
Increased verbosity and progress report.
--seed INT Random seed for the LSH and other parts that require randomness.
Subcommands:
build Builds a reference library with given k-mers sets or reference genomes.
query Query given sequences with respect to given reference libraries.
The option --log
is likely to be more than needed and is probably not useful unless you have an error.
It would be helpful to include logs if you need to create a new issue.
The default verbosity should be sufficient (--verbose
).
Note that these options (--help
, --verbose
, --log
and --seed
) cannot be given after the subcommand (e.g., krank query --seed 0 [OPTIONS]
will throw an error).
KRANK version: v0.3.2
Builds a reference library with given k-mers sets or reference genomes.
Usage: ./krank build [OPTIONS]
Options:
--help
-l,--library-dir TEXT REQUIRED
Path to the directory containing the library.
-t,--taxonomy-dir TEXT:DIR REQUIRED
Path to the directory containing the taxonomy files (nodes.dmp and names.dmp).
-i,--input-file TEXT:FILE REQUIRED
Path to the file containing paths and taxon IDs of references.
--from-library,--from-scratch{false}
Are k-mers already encoded and stored in the library?
Default: --from-scratch, and it reads k-mer sets or sequences from given input paths.
If --from-library is given, KRANK will try to read k-mers from an already-initialized library.
--input-kmers,--input-sequences{false}
Are given input files k-mers sets (extracted with some external tool) or sequences (genomes, contigs etc.)?
If sequences, k-mers sets will be extracted internally.
Ignored if --from-library given.
Default: --input-sequences.
-m,--max-memory UINT The maximum memory limit for the final library (in megabyte - MB).
This will override -h (--num-positions), -b (--num-columns) and -k (--kmer-length).
KRANK will pick a good configuration of k, h and b values for you.
Note that this is only the final library size.
Memory requirements may exceed this limit but you can always decrease needed memory per batch by increasing -s.
-k,--kmer-length UINT Length of k-mers. Default: 29.
-w,--window-length UINT Length of minimizer window. Default: k+3.
-h,--num-positions UINT Number of positions for the LSH. Default: 13.
-b,--num-columns UINT Number of columns of the table. Default: 16.
-s,--batch-size UINT Number of bits to divide the table into batches. Default: 6, i.e., 64 batches.
--target-batch UINT The specific library batch to be built.
If 0, all batches will be processed one by one.
If not given, the library will only be initialized after reading the input data and encoding k-mers.
--kmer-ranking ENUM:value in {random->0,representative->1} OR {0,1}
Which strategy will be used for k-mer ranking? (0: random_kmer, 1: representative_kmer)
Default: representative_kmer, selected based on coverage heuristic.
--lca ENUM:value in {hard->0,soft->1} OR {0,1}
This option determines LCA computation method for k-mer labels? (0: hard_lca, 1: soft_lca)
Default: soft_lca, computed using CONSULT-II's heuristic.
--adaptive-size,--free-size{false}
Use size constraint heuristic while gradually building the library.
--num-threads UINT Number of threads to use for OpenMP-based parallelism.
--fast-mode,--selection-mode{false}
The default mode is --selection-mode which traverses the taxonomy and selects k-mers accordingly.
When --fast-mode is given, tree traversal will be skipped, and the final library will be built at the root.
With --kmer-ranking random_kmer, this is equivalent to CONSULT-II.
If --fast-mode is given, --adaptive-size will be ignored and have no effect.
Note --fast-mode is significantly faster.
--update-annotations,--build-tables{false}
When --update-annotations option is given, KRANK tries to update soft LCAs of k-mers by going over reference genomes.
Default --build-tables selects k-mers, builds tables, and also computes soft LCAs.
--remove-intermediate,--keep-intermediate{false}
When --keep-intermediate is given, KRANK will not delete batch data stored on the disk.
You may need to manually remove many small files (potentially tens of thousands). This may be desired if one wants to use --update-annotations later.
KRANK version: v0.3.2
Query given sequences with respect to given reference libraries.
Usage: ./krank query [OPTIONS]
Options:
--help
-l,--library-dir TEXT ... REQUIRED
Path(s) to the directory containing the library. Note that multiple libraries could be given to this option.
-o,--output-dir TEXT:DIR Path to the directory to output result files. Default: the current working directory.
-q,--query-file TEXT:FILE REQUIRED
Path to the tab-separated file containing paths and IDs of query FASTA/FASTQ files.
--total-vote-threshold,--tvote-threshold FLOAT ...
The minimum total vote threshold(s) to classify, the order should match the order of the given libraries. Default: 0.05.
--max-match-distance,--max-match-hdist UINT
The maximum Hamming distance for a k-mer to be considered as a match. Default: 5.
--save-match-info,--no-match-info{false}
Save matching information to --output-dir for each query, this flag is not given by default. There is no practical need to give this flag. This is for debugging purposes and maybe for alternative down-stream analyses.
--num-threads UINT Number of threads to use for OpenMP-based parallelism.
Şapcı, A. O. B., & Mirarab, S. (2024). Memory-bound k-mer selection for large and evolutionary diverse reference libraries. Genome Research, gr.279339.124. https://doi.org/10.1101/gr.279339.124
@article{sapci_krank_2024,
title = {Memory-bound k-mer selection for large and evolutionary diverse reference libraries},
issn = {1088-9051, 1549-5469},
url = {https://genome.cshlp.org/content/early/2024/08/29/gr.279339.124},
doi = {10.1101/gr.279339.124},
language = {en},
urldate = {2024-08-29},
journal = {Genome Research},
author = {Şapcı, Ali Osman Berk and Mirarab, Siavash},
month = aug,
year = {2024},
pages = {gr.279339.124},
}
Footnotes
-
Although for most users it is neither practical nor useful, it is possible to use an external tool to extract k-mer sets, such as Jellyfish, and give k-mer sets directly as the input data. These reference sequences might be in both FASTA and FASTQ formats. Both
gzip
compressed or raw files are allowed. Whatever input type you prefer, you need to provide filepaths or URLs (or a mixture of them). All you have to provide is a mapping between taxon IDs (i.e., species ID) and paths/URLs, which will be a tab-separated file. Some lines from such a file would look similar to this. ↩ -
If the input references are sets of k-mers, then use the
--input-kmers
flag (the default is the complement option--input-sequences
). Use-k
or--kmer-length
to specify the k-mer lengths, and-w
or--window-length
for minimizer window length. See this section for a discussion of these parameters. Simply set -k and -w to the same value if you do not want to use minimizers3. ↩ -
If
--input-kmers
is given, the length of each line must be equal to -w. Otherwise, it is undefined behavior. ↩