There are several future efforts that will require distributed planning to be able to hold asynchronous operations involving network calls:
Right now, distributed planning is implemented as PhysicalOptimizerRule, however, it's hard to model all the things we need to do in this project as optimizer rules, as we'll need to execute asynchronous operations during the modification of the single-node plan.
From a more philosophical, I don't think it's very worrying that distributed planning is not implemented as optimizer rules, as we are effectively not optimizing a query, so as long as the changes do not bleed to users and can be kept internally in this project it should be fine.
Implementation details
I don't think there's any other option besides lazily distributing a query during execution, which is similar to what we do already for worker assignation in DistributedExec, so we can keep the same design, extending it to also perform distributed planning:
- The
DistributedPhsysicalOptimizerRule will just place a DistributedExec on top of the plan, but nothing else
- The existing content of
DistributedPhsysicalOptimizerRule can be just moved out to its own function, and made asynchronous (async distributed_plan(plan: Arc<dyn ExecutionPlan>) -> Result<Arc<dyn ExecutionPlan>>)
- The
async distribute_plan() function can be called either:
- Lazily right before
DistributedExec::prepare_plan()
- By the users themselves, just in case they are interested in inspecting the distributed plan before execution
I'd expect no impact of this change neither in functionality or performance, just the ability to make the annotate_plan and distribute_plan functions asynchronous.
There are several future efforts that will require distributed planning to be able to hold asynchronous operations involving network calls:
Right now, distributed planning is implemented as
PhysicalOptimizerRule, however, it's hard to model all the things we need to do in this project as optimizer rules, as we'll need to execute asynchronous operations during the modification of the single-node plan.From a more philosophical, I don't think it's very worrying that distributed planning is not implemented as optimizer rules, as we are effectively not optimizing a query, so as long as the changes do not bleed to users and can be kept internally in this project it should be fine.
Implementation details
I don't think there's any other option besides lazily distributing a query during execution, which is similar to what we do already for worker assignation in
DistributedExec, so we can keep the same design, extending it to also perform distributed planning:DistributedPhsysicalOptimizerRulewill just place aDistributedExecon top of the plan, but nothing elseDistributedPhsysicalOptimizerRulecan be just moved out to its own function, and made asynchronous (async distributed_plan(plan: Arc<dyn ExecutionPlan>) -> Result<Arc<dyn ExecutionPlan>>)async distribute_plan()function can be called either:DistributedExec::prepare_plan()I'd expect no impact of this change neither in functionality or performance, just the ability to make the
annotate_plananddistribute_planfunctions asynchronous.