Preallocate Ray Workers #62

robtandy · 2025-02-18T15:57:19Z

Significant time is spent allocating StageServices as the python actor, RayStage and waiting for them to bind to a listening port.

This changes the semantics of RayContext and RayDataFrame such that the RayQuerySupervisor is now created when the RayContext is created and a pool of RayStages are preallocated. When a RayDataFrame is created by the context's sql() method, stages are calculated and the number of RayStage actors are requested from the pool. When the query is finished, instead of tearing down these actors, they are simply returned to the pool.

The pool size is parameterized by min size and max size values. The pool will preallocate at the minimum size and can grow up to the maximum size. Requesting workers beyond the maximum size will raise an exception. The pool is released and ray resources are torn down when the RayContext goes out of scope.

This change makes a significant difference on TPCH benchmarks. Tested on SF100, it improved the result by 25% on a machine with very fast disk, such that the overhead of creating and tearing down ray resources was a large chunk of execution time.

This PR does not handle the pool shrinking back to a minimum size only growing, let's handle that in a subsequent change.

The tpcbench.py benchmark script, and tpc.py script accept --worker-pool-min

As RayStage actors are now longer lived, they were updated to be able to accept updated ExecutionPlans to serve. This meant that debugging issues with RayStages is a little more difficult as it no longer makes sense to name them after they stage they are hosting, because that can change. As such, they now receive friendly human readable unique names which make reading debug and info output much easier.

robtandy · 2025-02-18T16:59:40Z

Updated to squash messy commit history from source branch

requirements-in.txt

andygrove

Thanks @robtandy!

1. Remove a duplicate sentence. 2. Replace backquote of the sql argument with single quote (backquote in bash is for command substitution). 3. Since apache#62, the `--worker-pool-min` is a new arg without default value and need to be provided in the TPC example commands to run.

1. Remove a duplicate sentence. 2. Replace backquote of the sql argument with single quote (backquote in bash is for command substitution). 3. Since #62, the `--worker-pool-min` is a new arg without default value and need to be provided in the TPC example commands to run.

robtandy force-pushed the main branch from 1f1bfc0 to 6e6a409 Compare February 18, 2025 16:23

Refactor to preallocate a pool of RayStage Actors

6559613

robtandy force-pushed the main branch from 6e6a409 to 6559613 Compare February 18, 2025 16:59

andygrove reviewed Feb 18, 2025

View reviewed changes

requirements-in.txt Outdated Show resolved Hide resolved

corrected version of datafusion in requirements-in.txt

c44d46b

andygrove approved these changes Feb 20, 2025

View reviewed changes

andygrove merged commit 116734d into apache:main Feb 20, 2025
1 check passed

vmingchen mentioned this pull request Feb 23, 2025

Minor fixes to README #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preallocate Ray Workers #62

Preallocate Ray Workers #62

Uh oh!

robtandy commented Feb 18, 2025

Uh oh!

robtandy commented Feb 18, 2025

Uh oh!

Uh oh!

andygrove left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preallocate Ray Workers #62

Preallocate Ray Workers #62

Uh oh!

Conversation

robtandy commented Feb 18, 2025

Uh oh!

robtandy commented Feb 18, 2025

Uh oh!

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants