Skip to content
This repository was archived by the owner on Dec 1, 2025. It is now read-only.

Conversation

@dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Mar 3, 2025

Resolves #86

Adds a wrapper for sort_values. This one is a bit different than the usual wrapper as dask has more of it's own functionality because sorting is a much more dask-sensitive operation. This is kind of a crude solution, but we effectively just call dasks sort_values when not sorting on nested layers, but then call a map_partitioned version of nested-pandas sort_values when working on nested layers.

These really act as two different functions under the same namespace, as top-level sorting invokes the neccesary parallelization and consequences of a shuffle that you'd expect with dask, dropping divisions. But sorting on nested layers is a contained problem that avoids dask shuffles and preserves the divisions information. We could break these out, but I think it would be better to make the functionality "feel" like nested-pandas so that would instead motivate breaking the functions out at the nested-pandas level too. The main issue I have with keeping them as one is that the kwarg set is subtly switching from one API to the other, and so things like optional kwargs may be confusing if a user switches their sorting from base->nested or vice-versa.

^But realistically, it may be fine to keep these as a single modified function until we get feedback that it's confusing in real use cases?

@github-actions
Copy link

github-actions bot commented Mar 3, 2025

Before [413eb33] After [d3c5477] Ratio Benchmark (Parameter)
828±3ms 834±4ms 1.01 benchmarks.NestedFrameQuery.time_run
278±1ms 280±2ms 1.01 benchmarks.NestedFrameReduce.time_run
137M 137M 1 benchmarks.NestedFrameAddNested.peakmem_run
139M 139M 1 benchmarks.NestedFrameQuery.peakmem_run
135M 136M 1 benchmarks.NestedFrameReduce.peakmem_run
271±0.5ms 267±1ms 0.99 benchmarks.NestedFrameAddNested.time_run

Click here to view all benchmarks.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Mar 3, 2025

Unit tests fail due to nested-pandas not having a release version of sort_values

@codecov
Copy link

codecov bot commented Mar 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.76%. Comparing base (413eb33) to head (806977a).
Report is 14 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   95.45%   95.76%   +0.31%     
==========================================
  Files           9        9              
  Lines         242      260      +18     
==========================================
+ Hits          231      249      +18     
  Misses         11       11              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dougbrn dougbrn requested a review from hombit March 3, 2025 22:41
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you. I'm opening an issue for LSDB!

@dougbrn dougbrn merged commit 49d26eb into main Mar 4, 2025
9 checks passed
@dougbrn dougbrn deleted the wrap_sort_values branch March 4, 2025 16:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrap sort_values

3 participants