Skip to content

Conversation

@robtandy
Copy link
Contributor

Significant time is spent allocating StageServices as the python actor, RayStage and waiting for them to bind to a listening port.

This changes the semantics of RayContext and RayDataFrame such that the RayQuerySupervisor is now created when the RayContext is created and a pool of RayStages are preallocated. When a RayDataFrame is created by the context's sql() method, stages are calculated and the number of RayStage actors are requested from the pool. When the query is finished, instead of tearing down these actors, they are simply returned to the pool.

The pool size is parameterized by min size and max size values. The pool will preallocate at the minimum size and can grow up to the maximum size. Requesting workers beyond the maximum size will raise an exception. The pool is released and ray resources are torn down when the RayContext goes out of scope.

This change makes a significant difference on TPCH benchmarks. Tested on SF100, it improved the result by 25% on a machine with very fast disk, such that the overhead of creating and tearing down ray resources was a large chunk of execution time.

This PR does not handle the pool shrinking back to a minimum size only growing, let's handle that in a subsequent change.

The tpcbench.py benchmark script, and tpc.py script accept --worker-pool-min

As RayStage actors are now longer lived, they were updated to be able to accept updated ExecutionPlans to serve. This meant that debugging issues with RayStages is a little more difficult as it no longer makes sense to name them after they stage they are hosting, because that can change. As such, they now receive friendly human readable unique names which make reading debug and info output much easier.

@robtandy
Copy link
Contributor Author

Updated to squash messy commit history from source branch

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robtandy!

@andygrove andygrove merged commit 116734d into apache:main Feb 20, 2025
1 check passed
vmingchen added a commit to vmingchen/datafusion-ray that referenced this pull request Feb 23, 2025
1. Remove a duplicate sentence.
2. Replace backquote of the sql argument with single quote (backquote in bash is for command substitution).
3. Since apache#62, the `--worker-pool-min` is a new arg without default value and need to be provided in the TPC example commands to run.
@vmingchen vmingchen mentioned this pull request Feb 23, 2025
andygrove pushed a commit that referenced this pull request Feb 28, 2025
1. Remove a duplicate sentence.
2. Replace backquote of the sql argument with single quote (backquote in bash is for command substitution).
3. Since #62, the `--worker-pool-min` is a new arg without default value and need to be provided in the TPC example commands to run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants