Skip to content

Conversation

@flatheadmill
Copy link


This pull request adds spatial indexing to Tantivy using block kd-trees. The implementation tessellates polygons into triangles and indexes the triangles in the block kd-tree.

Search supports three query types: "intersects" finds documents whose geometries overlap a query rectangle, "within" finds documents whose geometries fall entirely inside the query bounds, and "contains" finds documents whose geometries completely enclose the query region. I've attempted to follow Tantivy's established patterns. Spatial fields use a new FieldType::Spatial variant. I used PreTokenizedString as a go-by.

The user provided f64 geographic coordinates are serialized using XOR compression that exploits spatial locality. When consecutive vertices are close together, their bit representations share high-order bits, and XOR reveals this redundancy as zeros that varint encoding can compress. The compression falls back to uncompressed when data is incompressible. Looked at alternatives, bitshuffle+zstd might be nice for large polygons, but small polygons would be smaller than a zstd block. The XOR is simple and a good place to start.

The block kd-tree implementation uses bulk-loaded immutable construction with recursive median partitioning, creating a somewhat balanced block kd-tree with 512 triangles per leaf. The tree stores triangles. Polygons are tessellated into triangles prior to indexing. f64 dimensions are converted to i32 prior to tessellation and the block kd-tree stores the dimensions as i32. Triangles are stored in an encoded format that places a bounding box for the triangle in the first four words, followed by the spare x/y, boundary flags and finally the doc_id. The boundary flags indicate if a side of the triangle shares a boundary with the tessellated polygon and is used for "within" and "contains" testing.

The tree construction uses Rust's standard library select_nth_unstable_by_key for median partitioning, operating directly on &mut [Triangle] slices. Tree construction is a simple recursive descent based around this standard library function. The recursive descent receives slices of progressively smaller sub-ranges until reaching leaf size. This approach works identically whether the triangles come from a Vec during initial indexing or from a memory-mapped file during merge.

Merge is implemented by writing these Triangle structures into a temporary file, memory mapping the temporary file, and then performing an unsafe cast of the memory to &mut [Traingle]. Because the file is memory mapped, the select_nth_unstable_by_key tree construction can rely on the operating system to manage the memory when merging large segments.

If you are zealously opposed to unsafe code then &mut [Triangle] and select_nth_unstable_by_key have to be rewritten, but I don't see how it is a problem. The Vec<Triangle> is Rust-safe. The temporary file is entirely implemented as a transient step in the block kd-tree merge. Ought to be able to add a compile switch for endianess. It is not a persistent file. It is temporary only for the merge. It doesn't not have to have the same endianess on different architectures.

This implementation is certainly Lucene inspired. The triangle encoding is a 1:1 port with the exception of replacing orient with the robust orient2d implementation of Shewchuk's 2d orient. The rest is more inspiration than port. Porting from Lucene 1:1 not sensible. Lucene needs to operate on mapped memory via Java arrays, it goes to great lengths to avoid the creation of POJOs to keep the garbage collector out of the picture. The Rust implementation can read more like Rust.

f64 to i32 is Lucene inspired, roughly centimeter precision. The bulk-loaded partition construction is Lucene inspired, but the in place sorting and reuse of the standard library is a feature of Rust. Delta encoding for i32/u32 is Lucene inspired. Currently using barycentric point-in-triangle instead of Lucene's orientation tests.

Dependences added are robust for orient2d and i_triangle for its integer-based Delaunay triangulation that tracks boundary edges.

I have avoiding making anything "pluggable" and bonded tightly to Tantivy and the chosen 3rd-party libraries for simplicity and easy reading.


In the above you'll see a discussion of how the use of select_nth_unstable_by_key with &mut [Triangle] against a memory mapped file that has been cast to &mut [Triangle], please consider it.

Please consider the minimalist Geometry enum. Rather than create yet another object hierarchy of Point, Polygon, BoundingBox, et. al. I used underlying representations that can be easily handed off to GeoRust or similar. I didn't want to make GeoRust or GeoJSON parsing a Tantivy dependency. The block kd-tree is indented for bounding box and point queries, not for the full range of queries you would find in PostGIS, like polygon in polygon. Such a test could be performed while iterating the search results.

Note that the block kd-tree is not going work with geometries that cross the antimeridian, but this is a known issue, and data sets will probably already split a polygon that crosses the antimeridian into a multi-polygon with a polygon for each side of the antimeridian. This is even in the GeoJSON spec.

Note that there is a Spatial field and a Geometry type stored in that field. Would it be better to have a Geographic or Geo field since the implementation is geo specific?

As I write this, "contains" and "within" are implemented in the block kd-tree but not exposed through the interfaces. I will implement "disjoint." There are todo!()s to implement in the Field::Spatial(_) match arms. There are some tree balancing improvements that I'd like to add. I'd like to put a lot of Geofabrik data through it and see how well it performs. However, it is ready for consideration by the Tantivy maintainers.


Run cargo run --example geo_json, a minimal round-trip through the index.

Implement a SPATIAL flag for use in creating a spatial field.
Encodes triangles with the bounding box in the first four words,
enabling efficient spatial pruning during tree traversal without
reconstructing the full triangle. The remaining words contain an
additional vertex and packed reconstruction metadata, allowing exact
triangle recovery when needed.
The `triangulate` function takes a polygon with floating-point lat/lon
coordinates, converts to integer coordinates with millimeter precision
(using 2^32 scaling), performs constrained Delaunay triangulation, and
encodes the resulting triangles with boundary edge information for block
kd-tree spatial indexing.

It handles polygons with holes correctly, preserving which triangle
edges lie on the original polygon boundaries versus internal
tessellation edges.
Implemented byte-wise histogram selection to find median values without
comparisons, enabling efficient partitioning of spatial data during
block kd-tree construction. Processes values through multiple passes,
building histograms for each byte position after a common prefix,
avoiding the need to sort or compare elements directly.
Implemented a `Surveyor` that will evaluate the bounding boxes of a set
of triangles and determine the dimension with the maximum spread and the
shared prefix for the values of dimension with the maximum spread.
Implements dimension-major bit-packing with zigzag encoding for signed i32
deltas, enabling compression of spatially-clustered triangles from 32-bit
coordinates down to 4-19 bits per delta depending on spatial extent.
Implement an immutable bulk-loaded spatial index using recursive median
partitioning on bounding box dimensions. Each leaf stores up to 512
triangles with delta-compressed coordinates and doc IDs. The tree
provides three query types (intersects, within, contains) that use exact
integer arithmetic for geometric predicates and accumulate results in
bit sets for efficient deduplication across leaves.

The serialized format stores compressed leaf pages followed by the tree
structure (leaf and branch nodes), enabling zero-copy access through
memory-mapped segments without upfront decompression.
Lossless compression for floating-point lat/lon coordinates using XOR
delta encoding on IEEE 754 bit patterns with variable-length integer
encoding. Designed for per-polygon random access in the document store,
where each polygon compresses independently without requiring sequential
decompression.
The triangulation function in `triangle.rs` is now called
`delaunay_to_triangles` and it accepts the output of a Delaunay
triangulation from `i_triangle` and not a GeoRust multi-polygon. The
translation of user polygons to `i_triangle` polygons and subsequent
triangulation will take place outside of `triangle.rs`.
Implemented a geometry document field with a minimal `Geometry` enum.
Now able to add that Geometry from GeoJSON parsed from a JSON document.
Geometry is triangulated if it is a polygon, otherwise it is correctly
encoded as a degenerate triangle if it is a point or a line string.
Write accumulated triangles to a block kd-tree on commit.

Serialize the original `f64` polygon for retrieval from search.

Created a query method for intersection. Query against the memory mapped
block kd-tree. Return hits and original `f64` polygon.

Implemented a merge of one or more block kd-trees from one or more
segments during merge.

Updated the block kd-tree to write to a Tantivy `WritePtr` instead of
more generic Rust I/O.
Ended up using `select_nth_unstable_by_key` from the Rust standard
library instead.
Read node structures using `from_le_bytes` instead of casting memory.
After an inspection of columnar storage, it appears that this is the
standard practice in Rust and in the Tantivy code base. Left the
structure alignment for now in case it tends to align with cache
boundaries.
@flatheadmill flatheadmill marked this pull request as draft November 4, 2025 08:10
@fulmicoton
Copy link
Collaborator

@flatheadmill This is a super interesting PR. A bit of contributions has accumulated, so it might take a bit of time to get it reviewed. To make things smoother, would you be ok for a quick call to walk me through the code?

@flatheadmill
Copy link
Author

Of course. Doesn't have to be quick for my sake. Check your email.

Addressed all `todo!()` markers created when adding `Spatial` field type
and `Geometry` value type to existing code paths:

- Dynamic field handling: `Geometry` not supported in dynamic JSON
  fields, return `unimplemented!()` consistently with other complex
  types.
- Fast field writer: Panic if geometry routed incorrectly (internal
  error.)
- `OwnedValue` serialization: Implement `Geometry` to GeoJSON
  serialization and reference-to-owned conversion.
- Field type: Return `None` for `get_index_record_option()` since
  spatial fields use BKD trees, not inverted index.
- Space usage tracking: Add spatial field to `SegmentSpaceUsage` with
  proper integration through `SegmentReader`.
- Spatial query explain: Implement `explain()` method following pattern of
  other binary/constant-score queries.

Fixed `MultiPolygon` deserialization bug: count total points across all
rings, not number of rings.

Added clippy expects for legitimate too_many_arguments cases in geometric
predicates.
@flatheadmill
Copy link
Author

For all the talk of k-dimensions, the end result of writing the tree is a run-of-the-mill binary tree whose search functions are very easy to understand. In the end, we have a binary tree of inner nodes that have a left and right pointer to child nodes and a bounding box. During a search the bounding box allows us to eliminate sub-trees that do not intersect. The binary tree terminates in leaf notes that have a bounding box and a pointer to where triangles are stored. If the path you follow intersects your sought bounding box, you scan the triangles referenced by the leaf node.


Triangles? Yes, the polygons you add to the index are tessellated into triangles. The i_triangle crate used for tessellation in this pull request has an animation of triangulation in its GitHub README. The animation is all you need to understand what we're trying to accomplish. We are tokenizing a polygon, turning it into triangles. We know we've matched a document if we match a triangle in the polygon. Triangles are described with three points and therefore easy to store. It was not necessary to study tessellation for this implementation, it was enough to simply choose an implementation that met requirements; it works on integers and tracks boundary edges.

With the polygon broken down into triangles we can store a triangle and the DocId associated with the polygon. We also store boundary edges that the triangle shares with the tessellated polygon. We can test if a polygon is within a bounding box by testing if a boundary edge crosses the bounding box. If it does, the polygon is marked as excluded, if it doesn't it is included. Take the difference for the set of documents within the bounding box. For intersection all we need to know is whether the bounding box overlaps the triangle in any way.

In this way, this pull request implements an orthogonal range query to search for 2-dimensional shapes within axis-aligned partitions, i.e. bounding-box search.


To grasp the kd-concept quickly, it is enough to see a simple tree and the space it partitions. Otherwise, you could read a Medium article. You will come away with an understanding of how to partition two-dimensional space. You will know how to search for a point in a two-dimensional kd-tree. It's a matter of descending the tree and choosing a path based on the depth of the tree. There are plenty of little examples of this 2-dimensional kd-tree of points in every programming language.

You will be left wondering how to search for a triangle in two-dimensional space. Nowhere will you find a description of how to do this. A search for answers is going to trip over the fact that the word "dimension" is overloaded. Our minimal kd-tree stores 1-dimensional spatial data in a tree that it describes as 2-dimensional. When we use the term "dimension" to describe a kd-tree we are describing the number of partitioning dimensions of the tree, not the spatial dimension of the indexed forms.

What I found, what finally clicked, was a single "teach a man to fish" answer to a StackOverflow question. A 4-dimensional point can represent a bounding box where the partitioning dimentions are the min and max and the x and y. I wrote a program to create a kd-tree with 4-dimensions and was able to search for a box with a box "using the range (a0,+inf) x (-inf, a1) x (b0, +inf) x (-inf, b1)".

Meanwhile, I had been stepping through the Lucene search where I'd see the bounding boxes in the nodes during search and knew that Lucene was depending on the properties of a kd-tree to build the tree, but not to search the tree. It did not search the tree using a 4-dimensional point, it used bounding boxes instead. It would build the bounding boxes as it partitioned the tree.


Another muddle on the way to understanding is the concept of a "block" kd-tree. Seeing BKDReader.java and BKDWriter.java in Lucene implied that Lucene has implemented a BKD-tree as defined in Bkd-Tree: A Dynamic Scalable kd-Tree and so I assumed that whatever was in the paper was in Lucene. The paper describes an optimization based on creating LSM-like sets of progressively larger of kd-trees. Kd-trees are difficult to update, which is what this paper addresses, but Lucene and Tantivy already address this problem with segments. There's only one kd-tree for a spatial field per segment and that tree is rewritten when segments merge.

I never looked at the paper again once I began stepping through the Lucene code. There I saw the bounding box in kd-tree node used for search and affirmed my hyper-point as bounding box understanding by seeing that Lucene was indeed indexing the 2-demential (spatial) triangle bounding box using a 4-dimensional (partitions) tree. I saw too, that a single tree was built by bulk-partitioning. I never referenced the original paper myself, but here I am standing on the shoulders of giants, reading the theory in practice, so I'm not the one to ask how the paper influenced the design of Lucene, I can only talk about how Lucene influenced the design of this pull request.

However, since we are not creating a LSM-like structure nor using the serialization strategy presented in the paper, I'm going to rename bkd.rs to kd.rs to put an end to red herring for future readers. The term "block" in the paper references an I/O optimization to the serialized tree structure that is not employed by this pull request nor Lucene.


That's enough red herrings. The tree is a 4-dimensional kd-tree built with a recursive partitioning of an array of triangles. The 4-dimensions represent the four positions of the two points in a bounding box. We partition our array of triangles on the kd-tree dimension with the widest spread, the greatest distance between the min and max value. We calculate a bounding box to encapsulate the triangles in the array and then call our partition function on the left and right partitions. We create an inner node that has a bounding box and a left and right child reference. When an array is less than a certain size, that's a leaf.

If you look at the write_leaf_pages function, you'll see the partitioning. As of this commit, that is all the k-dimensional code, the rest is bounding box based. In fact, kd-tree is probably a misnomer. We're building a binary bounding box tree (a name I just now made up) using k-dimensional partitioning.


P.S. — As you can see, the implementation in this pull request does not implement a generic kd-tree like Lucene. We do not keep the partition dimension in the node, we throw it away and use just the bounding box. Lucene uses its kd-tree for range queries, but I believe Tantivy has "fast fields" for that use case. It has some code to do kNN with kd-trees, but the primary index it uses for kNN is a Hierarchical Navigable Small World index. I'd speculate that the kd-tree kNN was an offering that was superseded by HNSW.

This reverts commit 19eab16.

Restore radix select in order to implement a merge solution that will
not require a temporary file.
Existing code behaved as if the result of `select_nth_unstable_by_key`
was either a sorted array or the product of an algorithm that gathered
partition values as in the Dutch national flag problem. The existing
code was written knowing that the former isn't true and the latter isn't
advertised. Knowing, but not remembering. Quite the oversight.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants