Skip to content

vikshiv/mumemto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub release (latest by date) GitHub install with bioconda

mumemto: finding multi-MUMs and MEMs in pangenomes

logo

Mumemto is a tool for analyzing pangenome sequence collections. It identifies maximal unique/exact matches (multi-MUMs and multi-MEMs) present across a collection of sequences. Mumemto can visualize pangenome synteny, identify misassemblies, and provide a unifiying structure to a pangenome.

This method is uses the prefix-free parse (PFP) algorithm for suffix array construction on large, repetitive collections of text. The main workflow of mumemto is to compute the PFP over a collection of sequences, and identify multi-MUMs while computing the SA/LCP/BWT of the input collection. Note that this works best with highly repetitive texts (such as a collection of closely related genomes, likely intra-species such as a pangenome).


If you use Mumemto, please cite:

Shivakumar, V. S., & Langmead, B. (2025). Mumemto: efficient maximal matching across pangenomes. Genome Biology, 26(1), 169.

If you use the partition-merging approach (for parallelization, etc.), please cite:

Shivakumar, V. S., & Langmead, B. (2025). Partitioned Multi-MUM finding for scalable pangenomics. bioRxiv, 2025-05.

Paper available at https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03644-0.

Merging algorithm described in https://www.biorxiv.org/lookup/doi/10.1101/2025.05.20.654611.

For detailed usage, see the Mumemto wiki

Installation

Conda installation (recommended)

Mumemto is available on bioconda. Note conda installation requires python 3.9+. We recommend using a new environment:

### conda ###
conda create -n mumemto_env python #3.10 or higher
conda activate mumemto_env

conda install -c conda-forge bioconda::mumemto

pip installation

Mumemto can be installed using pip locally. We recommend using a new conda or virtual environment for this.

git clone https://github.com/vikshiv/mumemto
cd mumemto; pip install .

Docker/Singularity

Mumemto is available on docker and singularity.

Tip

You may need to bind a local directory to access files in the container, which may cause issues when globbing input files. Input filelist + docker/singularity bind mount is recommended.

### if using docker ###
docker pull vshiv123/mumemto:latest
docker run vshiv123/mumemto:latest -h

### if using singularity ###
singularity pull mumemto.sif docker://vshiv123/mumemto:latest
./mumemto.sif -h
 # or any subcommand, e.g. ./mumemto.sif viz

Compile from scratch

To build from scratch, download the source code and use cmake/make. After running the make command below, the mumemto executable will be found in the build/ folder. The following are dependencies: cmake, g++, gcc

git clone https://github.com/vikshiv/mumemto
cd mumemto

mkdir build 
cd build && cmake ..
make install

Note: For the python scripts, you may need to install dependencies separately. The following are dependencies: matplotlib, numpy, tqdm, plotly (for interactive plots), and numba (for the coverage script).

Quick start

To visualize the synteny across the FASTA files in a directory assemblies/ (each sequence is a separate fasta file):

mumemto assemblies/*.fa -o pangenome
mumemto viz -i pangenome

Getting started

Find multi-MUMs and MEMs

By default, mumemto computes multi-MUMs across a collection, without additional parameters.

mumemto -o <output_prefix> [input_fasta [...]]

Command line options

Use the -h flag to list additional options and usage: mumemto -h.

Mumemto options enable the computation of various different classes of exact matches:

visual_guide

The multi-MUM properties can be loosened to find different types of matches with three main flags:

  • -k determines the minimum number of sequences a match must occur in (e.g. for finding MUMs across smaller subsets)
  • -f controls the maximum number of occurences in each sequence (e.g. finding duplication regions)
  • -F controls the total number of occurences in the collection (e.g. filtering out matches that occur frequently due to low complexity)

Tip

-k is flexible in input format. The user can specify a positive integer, indicating the minimum number of sequences a match should appear in. Passing a negative integer indicates a subset size relative to N, the number of sequences in the collection (i.e. N - k). For instance, to specify a match must appear in at least all sequences except one, we could pass -k -1. Similarly, passing negative values to -F specifies limits relative to N. Note: when setting -F and -f together, the max total limit will be the smaller of F and N * f.

Here are some example use cases:

	 # Find all strict multi-MUMs across a collection
     mumemto [OPTIONS] [input_fasta [...]] (equivalently -k 0 -f 1 -F 0)
	 # Find partial multi-MUMs in all sequences but one
     mumemto -k -1 [OPTIONS] [input_fasta [...]]
	 # Find multi-MEMs that appear at most 3 times in each sequence
     mumemto -f 3 [OPTIONS] [input_fasta [...]]
	 # Find all MEMs that appear at most 100 times within a collection
     mumemto -f 0 -k 2 -F 100 [OPTIONS] [input_fasta [...]]

Multi-MUM merging (new in v1.3)

The output multi-MUMs from Mumemto can be merged between runs in v1.3. There are two methods to do this: anchor-based (-Mn) and string-based (-M). Anchor-based merging requires the first sequence in each partition to be the same. String-based merging does not require any overlap between partitions, however is generally slower.

Running Mumemto with -M or -Mn generates a threshold file, *.thresh and *.thresh_rev for string-based merging and *.athresh for anchor-based merging. To merge partitions, run:

mumemto merge p1.mums p2.mums <...> -o <out_prefix>.mums

The merge script automatically detects which type of merging is possible and creates an output using out_prefix which is identical to if Mumemto was run on the union of the input partitions.

Note

Merging is currently limited to strict multi-MUMs. However, partial multi-MUMs for local partitions can be found using string-based merging incrementally.

Note

In v1.3.4, the string-based threshold file format changed and is not backwards compatible. To convert string-merging threshold files from v1.3.3 or earlier, use the provided mumemto/convert_thresh.py.

Tip

Using either merge mode enables a dynamic updating of multi-MUMs. You can incrementally add assemblies as the pangenome grows and update the global set of multi-MUMs across the collection.

I/O format

The mumemto command takes in a list of fasta files as positional arguments and then generates output files using the output prefix. Alternatively, you can provide a file-list, which specifies a list of fastas (one per line). Passing in fastas as positional arguments will auto-generate a filelist that defines the order of the sequences in the output.

Tip

The output *.lengths file can also serve as an input filelist to re-run expts.

Example of file-list file:

/path/to/ecoli_1.fna
/path/to/salmonella_1.fna
/path/to/bacillus_1.fna
/path/to/staph_2.fna

Format of the output *.mums file:

[MUM length] [comma-delimited list of offsets within each sequence, in order of filelist] [comma-delimited strand indicators (one of +/-)]

If the maximum number of occurences per sequence is set to 1 (indicating MUMs), a *.mums file is generated. This contains each MUM as a separate line, where the first value is the match length, and the second is a comma-delimited list of positions where the match begins in each sequence. An empty entry indicates that the MUM was not found in that sequence (only applicable with -k flag). The MUMs are sorted in the output file lexicographically based on the match sequence.

Format of the output *.mems file:

[MEM length] [comma-delimited list of offsets for each occurence] [comma-delimited list of sequence IDs, as defined in the filelist] [comma-delimited strand indicators (one of +/-)]

If more than one occurence is allowed per sequence, the output format is in *.mems format. This contains each MEM as a separate line with the following fields: (1) the match length, (2) a comma-delimited list of offsets within a sequence, (3) the corresponding sequence ID for each offset given in (2). Similar to above, MEMs are sorted in the output file lexicographically based on the match sequence.

Bumbl format:

Mumemto can also output a binary format (*.bumbl) for faster I/O with large MUM files. Most Mumemto commands accept either *.mums or *.bumbl files interchangeably. Mumemto can output a *.bumbl file using the ``-b` flag. To convert between formats:

mumemto convert -m input.mums -o output.bumbl  # text to binary
mumemto convert -b input.bumbl -o output.mums  # binary to text

Tip

For large pangenomes, using *.bumbl files can significantly reduce file sizes and improve loading times for visualization and analysis.

Visualization

potato_synteny

Potato pangenome (assemblies from [Cheng et al., 2025])

Mumemto can visualize multi-MUMs in a synteny-like format, highlighting conservation and genomic structural diversity within a collection of sequences.

After running mumemto on a collection of FASTAs, you can generate a visualization using:

mumemto viz (-i PREFIX | -m MUMFILE)

Use mumemto viz -h to see options for customizability. As of now, only strict and partial multi-MUMs are supported (rare multi-MEM support coming soon), thus a *.mums output is required.

An interactive plot (with plotly, still experimental) can be generated with mumemto viz --interactive.

Getting Help

If you run into any issues or have any questions, please feel free to reach out to us either (1) through GitHub Issues or (2) reach out to me at vshivak1 [at] jhu.edu

Acknowledgements

Portions of code from this repo were adapted from pfp-thresholds, written by Massimiliano Rossi and cliffy, written by Omar Ahmed.

Citing Mumemto and reproducing results

Preprint: https://doi.org/10.1101/2025.01.05.631388

Scripts to reproduce the results in the preprint are found in this repo: https://github.com/vikshiv/mumemto-reproducibility-scripts

About

Mumemto: multi-MUM and MEM finding across pangenomes

Resources

License

Stars

Watchers

Forks

Packages

No packages published