-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cascaded / Unary union #1246
Cascaded / Unary union #1246
Conversation
Did you ask for some polygons? They're not WKT (I can help with the conversion to if needed), but here's a bunch of them I just happened to have on hand: https://data.europa.eu/data/datasets/d18f6f95-1d96-4b86-a4ba-291a416e45bb?locale=en. They're a mix of polygons and multipolygons, but I'm not sure why that would be a problem. |
Some preliminary perf data: a 73.8 mb WKT file, containing 39490 polygons (extract from large-scale agricultural data, thanks @lnicola!) can be unioned into a let unioned = data
.iter()
.fold(MultiPolygon::<f64>::new(vec![]), |acc, poly| {
acc.union(poly)
}); hadn't terminated after 18 minutes, so this is at least a 100x improvement (?). Next I need to verify that the algorithm produces valid output (no reason to think it doesn't) cc @dabreegster please take this for a spin with more real-world data when you get a chance! |
46d5a17
to
885f428
Compare
Since we can I'd also be happy to write a second impl for impl<T, Container> UnaryUnion for Container
where
T: BoolOpsNum,
Container: IntoIterator<Item = MultiPolygon<T>> + Clone,
MultiPolygon<T>: RTreeObject,
{ ... } |
I'm seeing great results from INSPIRE parcels. https://www.dropbox.com/scl/fi/8w9ufilq2c7wq1vp4fei0/London_Borough_of_Southwark.zip?rlkey=ejcug1i5fnkf35d2xjwcogz9m&dl=0 Using Mapshaper output, darker area is features overlapping originally that weren't dissolved: There's more, larger INSPIRE inputs. I'm on battery power right now, so not going to push it :) But a larger run in Birmingham on 60k features ran in about 30s (I didn't time it exactly) and produced 17k outputs. I have even larger OS data I could try, but can't share the inputs. It's particularly impressive/useful that the base boolean ops work on both projected and un-projected inputs! No feedback on this API; it seems fine to me. I have a more niche use case of limiting the maximum area of unioned polygons, so I'll probably adapt what's here to do that. I don't think baking in support for that level of customization makes sense. |
As I understand it, the math is all fundamentally euclidean, so you're playing with fire if you're doing these calculations without first projecting to the euclidean plane. Fire can be kind of fun though. |
speaking of a multithreading flag - this feature might be a good candidate for rayon. (probably not in this PR, but it'd be interesting to see) |
It might be good to have a separate discussion thread about rayon API design. In particular, I'd love it if any rayon integration could use a rayon thread pool provided by the caller, so that higher level apps that already use rayon don't have work split across multiple rayon pools. |
Could you implement it in terms of Something like: - Container: IntoIterator<Item = Polygon<T>> + Clone,
+ Container: IntoIterator<Item = impl BooleanOps<Scalar = T>> + Clone, This should work for both |
You can't use I think we will actually need separate impls even if I can figure out the above because I'm struggling to imagine how we'd specialise the it's currently let fold = |mut accum: MultiPolygon<T>, poly: &Polygon<T>| -> MultiPolygon<T> {
accum = accum.union(poly);
accum
}; but for MultiPolygon it would have to be let fold = |mut accum: MultiPolygon<T>, mpoly: &MultiPolygon<T>| -> MultiPolygon<T> {
accum = accum.union(mpoly);
accum
}; |
I wonder if this would work if Polygon and MultiPolygon were enum members and BooleanOps was impl'd on the enum (followed by UnaryUnion). |
899c782
to
79290f4
Compare
No point in tying myself in knots playing type Tetris:
I think this might be tricky to implement either on your side or in |
For follow-on work, a preliminary parallel version was proposed in georust/rstar#80 (comment): fn parallel_bottom_up_fold_reduce<T, S, I, F, R>(tree: &RTree<T>, init: I, fold: F, reduce: R) -> S
where
T: RTreeObject,
RTreeNode<T>: Send + Sync,
S: Send,
I: Fn() -> S + Send + Sync,
F: Fn(S, &T) -> S + Send + Sync,
R: Fn(S, S) -> S + Send + Sync,
{
fn inner<T, S, I, F, R>(parent: &ParentNode<T>, init: &I, fold: &F, reduce: &R) -> S
where
T: RTreeObject,
RTreeNode<T>: Send + Sync,
S: Send,
I: Fn() -> S + Send + Sync,
F: Fn(S, &T) -> S + Send + Sync,
R: Fn(S, S) -> S + Send + Sync,
{
parent
.children()
.into_par_iter()
.fold(init, |accum, child| match child {
RTreeNode::Leaf(value) => fold(accum, value),
RTreeNode::Parent(parent) => {
let value = inner(parent, init, fold, reduce);
reduce(accum, value)
}
})
.reduce(init, reduce)
}
inner(tree.root(), &init, &fold, &reduce)
} |
The type tetris isn't too bad. But concerningly the test doesn't pass for multipolygon. I'm not sure if this represents a bug in BooleanOps or in UnaryUnion (or something else): 39a2f62 I'm OK without merging support for MultiPolygons at present, but wanted to highlight this behavior in case it's meaningful to you. |
Oh nice job, I burned a couple of hours on these today. The test result should be the same AFAIK. What's the actual output geometry?? |
One thing I wonder about the performance is whether the [ |
The correct output should be: Polygon { exterior:
LineString([
Coord { x: 200.38308698623, y: 288.3387931645567 },
Coord { x: 204.07584924189865, y: 288.2701221168692 },
Coord { x: 210.99999999939024, y: 291.99999999857494 },
Coord { x: 209.99999999939024, y: 289.99999999857494 },
Coord { x: 212.24082540659725, y: 285.47846008694717 },
Coord { x: 205.630617245422, y: 287.73853614038774 },
Coord { x: 204.00000000684082, y: 287.0000000060255 },
Coord { x: 200.38308698623, y: 288.3387931645567 }]), interiors: [] }] But what we get is: [Polygon { exterior:
LineString([
Coord { x: 200.38308698977124, y: 288.3387931571061 },
Coord { x: 204.07584924543988, y: 288.2701221094186 },
Coord { x: 205.63061724896323, y: 287.73853613293716 },
Coord { x: 204.00000001038205, y: 287.0000000060255 },
Coord { x: 200.38308698977124, y: 288.3387931571061 }]), interiors: [] },
Polygon { exterior: LineString([
Coord { x: 204.0758492379893, y: 288.2701221168692 },
Coord { x: 210.9999999954809, y: 291.99999999857494 },
Coord { x: 209.9999999954809, y: 289.99999999857494 },
Coord { x: 212.24082541013848, y: 285.47846008694717 },
Coord { x: 212.24082539523732, y: 285.47846009439775 },
Coord { x: 212.2408254026879, y: 285.47846008694717 },
Coord { x: 205.63061724896323, y: 287.73853613293716 },
Coord { x: 205.6306172564138, y: 287.73853614038774 },
Coord { x: 204.07584926034104, y: 288.2701221094186 },
Coord { x: 204.07584924543988, y: 288.2701221094186 },
Coord { x: 204.0758492379893, y: 288.2701221168692 }]), interiors: [] }
]; That's not any kind of correct. Hm. |
I think this is a bug in the Using the test polygons, you should get the same result if you do: let empty = MultiPolygon::<f64>::new(vec![]);
// multi_poly_1 is [poly1, poly2]
let empty_union = empty.union(&multi_poly_1);
// and
let mp1 = MultiPolygon::new(vec![poly1.clone()]);
let mp2 = MultiPolygon::new(vec![poly2.clone()]);
let mpu = &mp1.union(&mp2); …but you don't. That's a bug unless I'm missing a detail about how |
I was able to get the tests passing by making the multipolygon valid: 5900563 Note the usage of |
This will pass when #1254 merges.
Urgh I should have checked validity before anything else. Sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this when building:
|
12 | primitives::CachedEnvelope, primitives::ObjectRef, ParentNode, RTree, RTreeNode, RTreeObject,
| ^^^^^^^^^^^^^^^^^^^^^ no `ObjectRef` in `primitives`
Do you need to bump the minimum version of rstar?
Sorry, done. |
9f1e643
to
83e06bd
Compare
b4b3e2b
to
a2fe229
Compare
The difference in union behaviour (it's not necessarily erroneous) between single-threaded and multi-threaded ops needs to be explained before we can merge this, IMO. I'll dig into it ASAP because I'd like this to land. |
0313844
to
58cfbcb
Compare
I didn't get a conclusive answer about the small mismatch between multi-threaded and single-threaded geometry unions, so let's press on. Please review! |
Can you clarify how to reproduce the difference between serial vs. multithreaded? Is it like this: let wkt_str = std::fs::read_to_string("parcels.wkt").unwrap();
let multi_polygon: MultiPolygon = MultiPolygon::try_from_wkt_str(&wkt_str).unwrap();
assert_eq!(multi_polygon.0.len(), 39490);
let serial_union= {
let start = Instant::now();
let union = multi_polygon.unary_union();
let duration = start.elapsed();
println!("Time elapsed: {:.2?}", duration);
union
};
let parallel_union = {
let start = Instant::now();
let union = AllowMultithreading(&multi_polygon).unary_union();
let duration = start.elapsed();
println!("Time elapsed: {:.2?}", duration);
union
};
if serial_union == parallel_union {
println!(">> output is equal!");
} else {
println!(">> output not equal!");
} |
Correct! |
Here's my go at implementing the "simplification" based approach recommended by Nail over in iShape-Rust/iOverlay#16 (comment) https://github.com/georust/geo/compare/mkirk/unary_union?expand=1 It is quite a bit faster! Using our existing nl_plots test fixture as input, projected from lng/lat to a local euclidean projection.
I also ran a test with @urschrei's 72MB parcels.wkt and got similar results (though linearly larger, which makes sense given the much larger dataset):
I've tried a few inputs, and haven't found one where the simplification approach is slower. |
Kill your darlings 😭; I can't argue with a 5x speedup that has the same quality output. I think most of the doc changes and additions can be copied over to a new PR without much change @michaelkirk, and then we can close this in favour of that and merge. |
b5e7648
to
4af25d0
Compare
Ok @urschrei - I've pushed up a commit that uses the new simplification based approach. I've also reverted the dependency changes, since those are no longer relevant to the new approach. Mostly as a matter of personal taste, I exposed a free function |
(note CI is failing due to #1283) |
Amazing work. I can't approve because I'm officially the PR author, so someone else will need to do it. |
This makes the unary unary union algorithm available to anything that can produce an iterator of Polygons (MultiPolygon, containers of Polygon) and enables on-demand multi-threaded unary unions with a wrapper.
Also adds a test for the multithread vs. serial discrepancy. Time elapsed (naive): 28.34ms Time elapsed (serial recursive): 4.72ms Time elapsed (multithreaded recursive): 2.93ms Time elapsed (simplification): 165.13µs This test uses our existing nl_plots test fixture as input, projected from lng/lat to a local euclidean projection. I also ran a test with @urschrei's 72MB parcels.wkt and got similar results (though linearly larger, which makes sense given the much larger dataset): Time elapsed (naive): I gave up after waiting 10 minutes. Time elapsed (serial recursive): 6.15s Time elapsed (multithreaded recursive): 1.62s Time elapsed (simplification): 333.45ms
374e8b6
to
3bb1881
Compare
CHANGES.md
if knowledge of this change could be valuable to users.This is a draft since it currently has unknown perf characteristics, and there may be better ways of defining the trait implementers