A PBWT-based algorithm for identifying all Maximal Perfect Haplotype Blocks with Wildcards (MPHBw).
SDSL by @SimonGog is required to build the project. More information available at GitHub Repo.
You can either install it system-wide with the package manager (for example with sudo apt-get install libsdsl-dev on Ubuntu and similar), or using a conda environment (for example with conda create --prefix ./local-env --channel conda-forge --override-channels compilers make sdsl-lite, albeit mamba on miniforge is suggested), or installing SDSL manually and modifying the CXXFLAGS in the Makefile (not suggested).
Run
make
inside the main folder, to build wild-pbwt, gen, and err. You could also selectively build a tool by running
make <tool-name>
For example run make wild-pbwt to only build wild-pbwt tool.
The wild-pbwt binary under the bin subfolder computes all the maximal haplotype blocks with wildcards from a given input, with the extended pBWT
./bin/wild-pbwt -f <filename> -a <t-alleles> [-c|-o y] [-v y] [-b <block_size>] [-g <buffer_size>]where:
<filename>is the input file containing the haplotype panel in ASCII format where each line represents a single haplotype and each column is a variation site. Wildcards are represented with character*.<t-alleles>is the alphabet size (not counting*), i.e., the maximum number of alleles in a single site<block_size>is the minimum block size required to count a maximal block. It defaults to 2. It requires the-oflag.<buffer_size>is the buffer_size of the file stream buffer. No need to tweak this parameter in a normal use case. Adjusting this value according to the input column size could result in a performance improvement.
If ran with -o flag, blocks will be output to standard output. Blocks colud be saved to external file adding > output_file.txt at the end of the wild-pbwt command.
If ran with -c flag, blocks will not be output, only counted.
To specify a block_size, run with -o (-c not permitted).
If ran with -v flag, the extended pBWT execution will be output to the standard error.
The gen binary under the bin subfolder allows the user to generate a matrix M (haplotypes) x N (SNPs) with a specified wildcard rate.
./bin/gen <save_directory> <t-alleles> <haplotypes_count> <SNPs_count> <wild_rate> If no wild_rate is given, a plain random matrix will be generated.
If a wild_rate is given, a plain random matrix will be generated alongside a version of that matrix with missing data.
Example:
mkdir data
./bin/gen data 3 1000 25000 5Will generate 2 matrices under the newly created data folder, tri-allelic with 1000 haplotypes and 25000 SNPs; one with no missing data and a copy of the first one with 5% missing data.
Example, insted:
./bin/gen data 2 1000 52000Will generate a matrix under data folder, bi-allelic with 1000 haplotypes and 52000 SNPs.
The err binary under the bin subfolder allows the user to generate a matrix with missing data, from an input matrix. The rate of missing data insertion for each locus is specified by wild_rate%.
./bin/err <input_matrix> <wild_rate> <path_to_output_matrix>Example:
./bin/err data/input_matrix 3 data/output_matrix Will generate output_matrix under data folder as a copy of input_matrix with 3% missing data.