Comparex

Comparex is a new algorithm for finding the Longest Common Subsequence (LCS) between two sequences. It builds on the foundational insights of the classic Myers algorithm—well-known for its efficient diff computation—while introducing improvements that can make LCS computation faster and more intuitive under certain conditions.

This README will walk you step-by-step from the fundamentals of Myers algorithm to how Comparex works, along with examples to get you started quickly.

Introduction
What is the Longest Common Subsequence?
Review of the Classic Myers Algorithm
- Key Concepts
- Myers Algorithm Steps
Inplace Twicks
Comparex: New Longest Subsequence Algorithm
- Design Goals
- Algorithm Outline
Installation
Quick Start Example
Step-by-Step Guide to Comparex Implementation
Performance Considerations
Contributing
License
Further Reading

Introduction

Diff algorithms are ubiquitous in software development (e.g., version control systems like Git). The Myers algorithm is often considered the gold standard for computing the difference between two sequences efficiently.

However, for the specific task of Longest Common Subsequence (LCS), we can sometimes streamline the process further. Comparex is an attempt to achieve a simpler, often more memory-efficient approach while retaining the essence of the Myers diff logic.

This guide is intended for:

Students and researchers wanting a deeper look into sequence-comparison algorithms.
Developers who need to integrate an LCS or diff algorithm into their applications and want an alternative to the classic approach.

What is the Longest Common Subsequence?

A subsequence of a sequence is a new sequence derived from the original by removing some elements (possibly none) without changing the order of the remaining elements.

The Longest Common Subsequence (LCS) of two sequences is the subsequence of the greatest length common to both sequences.

For example, for sequences:

A: [A, B, C, D, G, H]
B: [A, E, D, F, H, R]

One LCS is [A, D, H], having length 3.

Review of the Classic Myers Algorithm

Key Concepts

Edit Distance: The minimum number of edits (insertions, deletions, substitutions) needed to transform one sequence into another.
Longest Common Subsequence: Highly related to the edit distance problem. In fact, an LCS problem can be transformed into an edit distance problem or vice versa.

The Myers algorithm is generally known for:

Linear space complexity in its optimized form.
Greedy extension steps that expand a "furthest-reaching path."

Myers Algorithm Steps

At a high level, Myers' algorithm can be broken down into:

Track "Furthest Reaching" Positions: For each possible difference d in the length of sequences, track the furthest index in the sequences that has been matched.
Compute Paths: Use these furthest-reaching positions to advance matching indices through both sequences.
Traceback: Reconstruct the actual edits or the common subsequence from the stored information.

While often used to compute "diffs," the algorithm can also yield an LCS by tracking the matched parts.

Myers himself drew a similar figure to illustrate how the algorithm works.

While this is sufficient to illustrate the idea behind the algorithm, it is slightly incorrect. Myers suggested that the path adds a weight of 1 for a mismatch and 1 for a match, making diagonal movements exactly twice as cheap. To be more precise, a graph that represents this more accurately would look like this:

Inplace twicks

In the in-place tweaks, I will list the changes made to the code of the classic algorithm. These adjustments do not result in any performance degradation and enable the easy implementation of CompareX features. Additionally, they ensure a fair comparison between the original algorithm and the new one.

Non-negative k loop

In the original Myers algorithm, the k index loops from -d to d. In our modification, we adjust this to loop from 0 to 2d. This change allows us to stay within the realm of unsigned numbers, effectively doubling the range of values the algorithm can handle. The minor calculation adjustments required for this modification do not result in any observable performance impact.

Parametrizable size

Now that the algorithm operates entirely with unsigned numbers, it becomes straightforward to make the size parametrizable, allowing seamless switching between uint32_t, uint64_t, or other desired types.

Bitset for the Turns

During forward tracing, it is essential to track the value of x, which inherently determines the value of y. However, during backtracing, this is unnecessary. The only requirement is to record the turns taken. This method reduces the memory footprint by approximately 32x or 64x, depending on the size of size_type. By managing significantly less memory, it also enhances performance.

Note: This is a significant idea. It separates the concept of forward traversal of data from the data needed for backtracking, which will become important later on.

Shrink the Turns Vector Size

To further reduce the memory footprint, we make the following observation:

We can divide the traversal graph into three sections:

The section where moving forward introduces a new item from both sequences, A and B. Here, the size of the anti-diagonal increases by 1 with each turn.
The section where one sequence continues introducing new items while the other has finished. In this area, the size of the diagonal remains constant.
The section where neither sequence introduces new items, causing the size to decrease.

With some mathematical adjustments, we can precisely maintain the size of the turns vector to track only the necessary data.

To illustrate this, let's make the sequences differ more in size.

Also, note that region 2 may appear on either side, depending on whether A or B is longer.

If the two strings are of equal length, this region 2 will be empty.

Limit the scanning

Using the same logic from the previous point, we can not only reduce the size of the turn vectors but also apply the same approach to scanning. This eliminates a large number of expansion attempts that we know will fail because they fall outside the boundaries of one sequence or the other.

Comparex: New Longest Subsequence Algorithm

Design Goals

Algorithm Outline

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
.vscode		.vscode
cli		cli
doc		doc
lib		lib
.bazelrc		.bazelrc
.bazelversion		.bazelversion
.clang-format		.clang-format
.gitignore		.gitignore
BUILD.bazel		BUILD.bazel
LICENSE		LICENSE
MODULE.bazel		MODULE.bazel
MODULE.bazel.lock		MODULE.bazel.lock
README.md		README.md
f.txt		f.txt
format.sh		format.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comparex

Table of Contents

Introduction

What is the Longest Common Subsequence?