Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascaded / Unary union #1246

Merged
merged 3 commits into from
Dec 19, 2024
Merged

Cascaded / Unary union #1246

merged 3 commits into from
Dec 19, 2024

Conversation

urschrei
Copy link
Member

@urschrei urschrei commented Oct 31, 2024

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

This is a draft since it currently has unknown perf characteristics, and there may be better ways of defining the trait implementers

@lnicola
Copy link
Member

lnicola commented Oct 31, 2024

Did you ask for some polygons? They're not WKT (I can help with the conversion to if needed), but here's a bunch of them I just happened to have on hand: https://data.europa.eu/data/datasets/d18f6f95-1d96-4b86-a4ba-291a416e45bb?locale=en.

They're a mix of polygons and multipolygons, but I'm not sure why that would be a problem.

@urschrei
Copy link
Member Author

urschrei commented Oct 31, 2024

Some preliminary perf data:

a 73.8 mb WKT file, containing 39490 polygons (extract from large-scale agricultural data, thanks @lnicola!) can be unioned into a MultiPolygon in around 11 seconds in release mode. The naïve approach:

let unioned = data
    .iter()
    .fold(MultiPolygon::<f64>::new(vec![]), |acc, poly| {
        acc.union(poly)
    });

hadn't terminated after 18 minutes, so this is at least a 100x improvement (?).

Next I need to verify that the algorithm produces valid output (no reason to think it doesn't)

cc @dabreegster please take this for a spin with more real-world data when you get a chance!

@urschrei urschrei changed the title Cascaded union Cascaded / Unary union Oct 31, 2024
@urschrei
Copy link
Member Author

urschrei commented Nov 1, 2024

Since we can union MultiPolygons, I'd like to make this available for them too, but I don't think it's possible to generically express that, because you need concrete types for the init, fold, and reduce function definitions and they vary slightly between Polygon and MultiPolygon.

I'd also be happy to write a second impl for MultiPolygon, but rustc also complains about that for some reason, even though the bounds are different:

impl<T, Container> UnaryUnion for Container
where
    T: BoolOpsNum,
    Container: IntoIterator<Item = MultiPolygon<T>> + Clone,
    MultiPolygon<T>: RTreeObject,
{ ... }

@dabreegster
Copy link
Contributor

I'm seeing great results from INSPIRE parcels. https://www.dropbox.com/scl/fi/8w9ufilq2c7wq1vp4fei0/London_Borough_of_Southwark.zip?rlkey=ejcug1i5fnkf35d2xjwcogz9m&dl=0

Using mapshaper v1.geojson -dissolve -explode -o v2.geojson format=geojson geojson-type=FeatureCollection, it takes 1.3s to dissolve 40k features into 7.5k. Using the new API in this PR (via https://github.com/acteng/will-it-fit/tree/use_unary_union), it takes 5.1s to dissolve to 1.1k. This implementation handles inputs that already overlap better than mapshaper.

Mapshaper output, darker area is features overlapping originally that weren't dissolved:
image
Rust output handles these:
image

There's more, larger INSPIRE inputs. I'm on battery power right now, so not going to push it :) But a larger run in Birmingham on 60k features ran in about 30s (I didn't time it exactly) and produced 17k outputs. I have even larger OS data I could try, but can't share the inputs.
https://github.com/acteng/will-it-fit/blob/use_unary_union/data_prep/inspire_one_area.sh

It's particularly impressive/useful that the base boolean ops work on both projected and un-projected inputs!

No feedback on this API; it seems fine to me. I have a more niche use case of limiting the maximum area of unioned polygons, so I'll probably adapt what's here to do that. I don't think baking in support for that level of customization makes sense.

@michaelkirk
Copy link
Member

michaelkirk commented Nov 1, 2024

un-projected inputs

As I understand it, the math is all fundamentally euclidean, so you're playing with fire if you're doing these calculations without first projecting to the euclidean plane.

Fire can be kind of fun though.

@michaelkirk
Copy link
Member

speaking of a multithreading flag - this feature might be a good candidate for rayon. (probably not in this PR, but it'd be interesting to see)

@kylebarron
Copy link
Member

this feature might be a good candidate for rayon

It might be good to have a separate discussion thread about rayon API design. In particular, I'd love it if any rayon integration could use a rayon thread pool provided by the caller, so that higher level apps that already use rayon don't have work split across multiple rayon pools.

@michaelkirk
Copy link
Member

michaelkirk commented Nov 1, 2024

Since we can union MultiPolygons, I'd like to make this available for them too,

Could you implement it in terms of impl BooleanOps<Scalar = Self::Scalar> (both MultiPolygon and Polygon implement that)

Something like:

- Container: IntoIterator<Item = Polygon<T>> + Clone,
+ Container: IntoIterator<Item = impl BooleanOps<Scalar = T>> + Clone,

This should work for both Iterator<Item=Polygon> and Iterator<Item=MultiPolygon>, but if you wanted a single iterator, you'd need to do something like Iterator<Item=Box<dyn BooleanOps>>

@urschrei
Copy link
Member Author

urschrei commented Nov 2, 2024

Since we can union MultiPolygons, I'd like to make this available for them too,

Could you implement it in terms of impl BooleanOps<Scalar = Self::Scalar> (both MultiPolygon and Polygon implement that)

Something like:

- Container: IntoIterator<Item = Polygon<T>> + Clone,
+ Container: IntoIterator<Item = impl BooleanOps<Scalar = T>> + Clone,

This should work for both Iterator<Item=Polygon> and Iterator<Item=MultiPolygon>, but if you wanted a single iterator, you'd need to do something like Iterator<Item=Box<dyn BooleanOps>>

You can't use impl in trait bounds, so that won't work as written (and it has to be an RTreeObject but that can probably be tagged on.

I think we will actually need separate impls even if I can figure out the above because I'm struggling to imagine how we'd specialise the fold function definition inside unary_union:

it's currently

let fold = |mut accum: MultiPolygon<T>, poly: &Polygon<T>| -> MultiPolygon<T> {
            accum = accum.union(poly);
            accum
        };

but for MultiPolygon it would have to be

let fold = |mut accum: MultiPolygon<T>, mpoly: &MultiPolygon<T>| -> MultiPolygon<T> {
            accum = accum.union(mpoly);
            accum
        };

@urschrei
Copy link
Member Author

urschrei commented Nov 2, 2024

I wonder if this would work if Polygon and MultiPolygon were enum members and BooleanOps was impl'd on the enum (followed by UnaryUnion).

@urschrei
Copy link
Member Author

urschrei commented Nov 2, 2024

No point in tying myself in knots playing type Tetris: MultiPolygon doesn't impl RTreeObject anyway (we should probably do that) so it'll have to be a manual op for now (it's not hugely complex: you flat_map the MultiPolygon containers).

I have a more niche use case of limiting the maximum area of unioned polygons, so I'll probably adapt what's here to do that. I don't think baking in support for that level of customization makes sense.

I think this might be tricky to implement either on your side or in geo as the fold operation definitionally has a scalar output: there's no way to skip geometries or end early – those geometries then wouldn't appear in the output at all, which isn't what you want.

@urschrei urschrei marked this pull request as ready for review November 2, 2024 16:30
@urschrei
Copy link
Member Author

urschrei commented Nov 2, 2024

For follow-on work, a preliminary parallel version was proposed in georust/rstar#80 (comment):

fn parallel_bottom_up_fold_reduce<T, S, I, F, R>(tree: &RTree<T>, init: I, fold: F, reduce: R) -> S
where
    T: RTreeObject,
    RTreeNode<T>: Send + Sync,
    S: Send,
    I: Fn() -> S + Send + Sync,
    F: Fn(S, &T) -> S + Send + Sync,
    R: Fn(S, S) -> S + Send + Sync,
{
    fn inner<T, S, I, F, R>(parent: &ParentNode<T>, init: &I, fold: &F, reduce: &R) -> S
    where
        T: RTreeObject,
        RTreeNode<T>: Send + Sync,
        S: Send,
        I: Fn() -> S + Send + Sync,
        F: Fn(S, &T) -> S + Send + Sync,
        R: Fn(S, S) -> S + Send + Sync,
    {
        parent
            .children()
            .into_par_iter()
            .fold(init, |accum, child| match child {
                RTreeNode::Leaf(value) => fold(accum, value),
                RTreeNode::Parent(parent) => {
                    let value = inner(parent, init, fold, reduce);

                    reduce(accum, value)
                }
            })
            .reduce(init, reduce)
    }

    inner(tree.root(), &init, &fold, &reduce)
}

@michaelkirk
Copy link
Member

type Tetris

The type tetris isn't too bad. But concerningly the test doesn't pass for multipolygon.

I'm not sure if this represents a bug in BooleanOps or in UnaryUnion (or something else): 39a2f62

I'm OK without merging support for MultiPolygons at present, but wanted to highlight this behavior in case it's meaningful to you.

@urschrei
Copy link
Member Author

urschrei commented Nov 2, 2024

type Tetris

The type tetris isn't too bad. But concerningly the test doesn't pass for multipolygon.

I'm not sure if this represents a bug in BooleanOps or in UnaryUnion (or something else): 39a2f62

I'm OK without merging support for MultiPolygons at present, but wanted to highlight this behavior in case it's meaningful to you.

Oh nice job, I burned a couple of hours on these today. The test result should be the same AFAIK. What's the actual output geometry??

@adamreichold
Copy link
Member

Some preliminary perf data:

One thing I wonder about the performance is whether the [CachedEnvelope] combinator can help with the R tree construction due the relatively expensive envelope computation?

@urschrei
Copy link
Member Author

urschrei commented Nov 3, 2024

type Tetris

The type tetris isn't too bad. But concerningly the test doesn't pass for multipolygon.
I'm not sure if this represents a bug in BooleanOps or in UnaryUnion (or something else): 39a2f62
I'm OK without merging support for MultiPolygons at present, but wanted to highlight this behavior in case it's meaningful to you.

Oh nice job, I burned a couple of hours on these today. The test result should be the same AFAIK. What's the actual output geometry??

The correct output should be:

Polygon { exterior:
    LineString([
        Coord { x: 200.38308698623, y: 288.3387931645567 },
        Coord { x: 204.07584924189865, y: 288.2701221168692 },
        Coord { x: 210.99999999939024, y: 291.99999999857494 },
        Coord { x: 209.99999999939024, y: 289.99999999857494 },
        Coord { x: 212.24082540659725, y: 285.47846008694717 },
        Coord { x: 205.630617245422, y: 287.73853614038774 },
        Coord { x: 204.00000000684082, y: 287.0000000060255 },
        Coord { x: 200.38308698623, y: 288.3387931645567 }]), interiors: [] }]

But what we get is:

[Polygon { exterior: 
    LineString([
        Coord { x: 200.38308698977124, y: 288.3387931571061 },
        Coord { x: 204.07584924543988, y: 288.2701221094186 },
        Coord { x: 205.63061724896323, y: 287.73853613293716 },
        Coord { x: 204.00000001038205, y: 287.0000000060255 },
        Coord { x: 200.38308698977124, y: 288.3387931571061 }]), interiors: [] },
Polygon { exterior: LineString([
    Coord { x: 204.0758492379893, y: 288.2701221168692 },
    Coord { x: 210.9999999954809, y: 291.99999999857494 },
    Coord { x: 209.9999999954809, y: 289.99999999857494 },
    Coord { x: 212.24082541013848, y: 285.47846008694717 },
    Coord { x: 212.24082539523732, y: 285.47846009439775 },
    Coord { x: 212.2408254026879, y: 285.47846008694717 },
    Coord { x: 205.63061724896323, y: 287.73853613293716 },
    Coord { x: 205.6306172564138, y: 287.73853614038774 },
    Coord { x: 204.07584926034104, y: 288.2701221094186 },
    Coord { x: 204.07584924543988, y: 288.2701221094186 },
    Coord { x: 204.0758492379893, y: 288.2701221168692 }]), interiors: [] }
];

That's not any kind of correct. Hm.

@urschrei
Copy link
Member Author

urschrei commented Nov 3, 2024

type Tetris

The type tetris isn't too bad. But concerningly the test doesn't pass for multipolygon.
I'm not sure if this represents a bug in BooleanOps or in UnaryUnion (or something else): 39a2f62
I'm OK without merging support for MultiPolygons at present, but wanted to highlight this behavior in case it's meaningful to you.

Oh nice job, I burned a couple of hours on these today. The test result should be the same AFAIK. What's the actual output geometry??

The correct output should be:

Polygon { exterior:
    LineString([
        Coord { x: 200.38308698623, y: 288.3387931645567 },
        Coord { x: 204.07584924189865, y: 288.2701221168692 },
        Coord { x: 210.99999999939024, y: 291.99999999857494 },
        Coord { x: 209.99999999939024, y: 289.99999999857494 },
        Coord { x: 212.24082540659725, y: 285.47846008694717 },
        Coord { x: 205.630617245422, y: 287.73853614038774 },
        Coord { x: 204.00000000684082, y: 287.0000000060255 },
        Coord { x: 200.38308698623, y: 288.3387931645567 }]), interiors: [] }]

But what we get is:

[Polygon { exterior: 
    LineString([
        Coord { x: 200.38308698977124, y: 288.3387931571061 },
        Coord { x: 204.07584924543988, y: 288.2701221094186 },
        Coord { x: 205.63061724896323, y: 287.73853613293716 },
        Coord { x: 204.00000001038205, y: 287.0000000060255 },
        Coord { x: 200.38308698977124, y: 288.3387931571061 }]), interiors: [] },
Polygon { exterior: LineString([
    Coord { x: 204.0758492379893, y: 288.2701221168692 },
    Coord { x: 210.9999999954809, y: 291.99999999857494 },
    Coord { x: 209.9999999954809, y: 289.99999999857494 },
    Coord { x: 212.24082541013848, y: 285.47846008694717 },
    Coord { x: 212.24082539523732, y: 285.47846009439775 },
    Coord { x: 212.2408254026879, y: 285.47846008694717 },
    Coord { x: 205.63061724896323, y: 287.73853613293716 },
    Coord { x: 205.6306172564138, y: 287.73853614038774 },
    Coord { x: 204.07584926034104, y: 288.2701221094186 },
    Coord { x: 204.07584924543988, y: 288.2701221094186 },
    Coord { x: 204.0758492379893, y: 288.2701221168692 }]), interiors: [] }
];

That's not any kind of correct. Hm.

I think this is a bug in the MultiPolygon union algorithm (not clear if it stems from i_overlay):

Using the test polygons, you should get the same result if you do:

let empty = MultiPolygon::<f64>::new(vec![]);
// multi_poly_1 is [poly1, poly2]
let empty_union = empty.union(&multi_poly_1);

// and
let mp1 = MultiPolygon::new(vec![poly1.clone()]);
let mp2 = MultiPolygon::new(vec![poly2.clone()]);
let mpu = &mp1.union(&mp2);

…but you don't. That's a bug unless I'm missing a detail about how MultiPolygon union is supposed to work.

@michaelkirk
Copy link
Member

michaelkirk commented Nov 3, 2024

I'm not sure if it's relevant, but it might be interesting that MultiPolygon(vec![poly1, poly2]) is not a valid multipolygon

Screenshot 2024-11-02 at 21 20 58
MULTIPOLYGON(
  ((204 287,206.69670020700084 288.2213844497616,200.38308697914755 288.338793163584,204 287)),
  ((210 290,204.07584923592933 288.2701221108328,212.24082541367974 285.47846008552216,210 290))
)

It'd be nice to be able to handle this case - but it might be a hint towards resolving it.

@michaelkirk
Copy link
Member

michaelkirk commented Nov 3, 2024

I was able to get the tests passing by making the multipolygon valid: 5900563

Note the usage of assert_relative_eq, some of the output points are slightly shifted (1e-10), but I think that's expected with the grid snapping in i_overlay.

@urschrei urschrei mentioned this pull request Nov 3, 2024
2 tasks
@urschrei
Copy link
Member Author

urschrei commented Nov 3, 2024

This will pass when #1254 merges.

I was able to get the tests passing by making the multipolygon valid: 5900563

Note the usage of assert_relative_eq, some of the output points are slightly shifted (1e-10), but I think that's expected with the grid snapping in i_overlay.

Urgh I should have checked validity before anything else. Sorry.

Copy link
Member

@michaelkirk michaelkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this when building:

   |
12 |     primitives::CachedEnvelope, primitives::ObjectRef, ParentNode, RTree, RTreeNode, RTreeObject,
   |                                 ^^^^^^^^^^^^^^^^^^^^^ no `ObjectRef` in `primitives`

Do you need to bump the minimum version of rstar?

@urschrei
Copy link
Member Author

urschrei commented Nov 7, 2024

Sorry, done.

@urschrei
Copy link
Member Author

urschrei commented Dec 1, 2024

The difference in union behaviour (it's not necessarily erroneous) between single-threaded and multi-threaded ops needs to be explained before we can merge this, IMO. I'll dig into it ASAP because I'd like this to land.

@urschrei
Copy link
Member Author

I didn't get a conclusive answer about the small mismatch between multi-threaded and single-threaded geometry unions, so let's press on. Please review!

@michaelkirk
Copy link
Member

Can you clarify how to reproduce the difference between serial vs. multithreaded?

Is it like this:

let wkt_str = std::fs::read_to_string("parcels.wkt").unwrap();
let multi_polygon: MultiPolygon = MultiPolygon::try_from_wkt_str(&wkt_str).unwrap();
assert_eq!(multi_polygon.0.len(), 39490);

let serial_union= {
  let start = Instant::now();
  let union = multi_polygon.unary_union();
  let duration = start.elapsed();
  println!("Time elapsed: {:.2?}", duration);
  union
};
let parallel_union = {
  let start = Instant::now();
  let union = AllowMultithreading(&multi_polygon).unary_union();
  let duration = start.elapsed();
  println!("Time elapsed: {:.2?}", duration);
  union
};

if serial_union == parallel_union {
  println!(">> output is equal!");
} else {
  println!(">> output not equal!");
}

@urschrei
Copy link
Member Author

Can you clarify how to reproduce the difference between serial vs. multithreaded?

Is it like this:

let wkt_str = std::fs::read_to_string("parcels.wkt").unwrap();
let multi_polygon: MultiPolygon = MultiPolygon::try_from_wkt_str(&wkt_str).unwrap();
assert_eq!(multi_polygon.0.len(), 39490);

let serial_union= {
  let start = Instant::now();
  let union = multi_polygon.unary_union();
  let duration = start.elapsed();
  println!("Time elapsed: {:.2?}", duration);
  union
};
let parallel_union = {
  let start = Instant::now();
  let union = AllowMultithreading(&multi_polygon).unary_union();
  let duration = start.elapsed();
  println!("Time elapsed: {:.2?}", duration);
  union
};

if serial_union == parallel_union {
  println!(">> output is equal!");
} else {
  println!(">> output not equal!");
}

Correct!

@michaelkirk
Copy link
Member

Here's my go at implementing the "simplification" based approach recommended by Nail over in iShape-Rust/iOverlay#16 (comment)

https://github.com/georust/geo/compare/mkirk/unary_union?expand=1

It is quite a bit faster!

Using our existing nl_plots test fixture as input, projected from lng/lat to a local euclidean projection.

Time elapsed (naive): 28.34ms
Time elapsed (serial recursive): 4.72ms
Time elapsed (multithreaded recursive): 2.93ms
Time elapsed (simplification): 165.13µs

I also ran a test with @urschrei's 72MB parcels.wkt and got similar results (though linearly larger, which makes sense given the much larger dataset):

Time elapsed (naive): I gave up after waiting 10 minutes.
Time elapsed (serial recursive): 6.15s
Time elapsed (multithreaded recursive): 1.62s
Time elapsed (simplification): 333.45ms

I've tried a few inputs, and haven't found one where the simplification approach is slower.

@urschrei
Copy link
Member Author

Kill your darlings 😭; I can't argue with a 5x speedup that has the same quality output. I think most of the doc changes and additions can be copied over to a new PR without much change @michaelkirk, and then we can close this in favour of that and merge.

@michaelkirk
Copy link
Member

Ok @urschrei - I've pushed up a commit that uses the new simplification based approach.

I've also reverted the dependency changes, since those are no longer relevant to the new approach.

Mostly as a matter of personal taste, I exposed a free function fn unary_union rather than a UnaryUnion trait with a unary_union method. It seemed more discoverable, less ceremonial, but I could swap back if you prefer.

@michaelkirk
Copy link
Member

(note CI is failing due to #1283)

@urschrei
Copy link
Member Author

Amazing work. I can't approve because I'm officially the PR author, so someone else will need to do it.

urschrei and others added 3 commits December 19, 2024 13:55
This makes the unary unary union algorithm available to anything that can produce an iterator of Polygons (MultiPolygon, containers of Polygon) and enables on-demand multi-threaded unary unions with a wrapper.
Also adds a test for the multithread vs. serial discrepancy.

    Time elapsed (naive): 28.34ms
    Time elapsed (serial recursive): 4.72ms
    Time elapsed (multithreaded recursive): 2.93ms
    Time elapsed (simplification): 165.13µs

This test uses our existing nl_plots test fixture as input, projected
from lng/lat to a local euclidean projection.

I also ran a test with @urschrei's 72MB parcels.wkt and got similar
results (though linearly larger, which makes sense given the much larger
dataset):

    Time elapsed (naive): I gave up after waiting 10 minutes.
    Time elapsed (serial recursive): 6.15s
    Time elapsed (multithreaded recursive): 1.62s
    Time elapsed (simplification): 333.45ms
@michaelkirk michaelkirk added this pull request to the merge queue Dec 19, 2024
Merged via the queue into georust:main with commit d28970f Dec 19, 2024
17 of 18 checks passed
@michaelkirk michaelkirk mentioned this pull request Dec 19, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants