docs: update design docs

vsoch · vsoch · commit 7b52ccff80fe · 2024-10-23T16:26:19.000-06:00
Signed-off-by: vsoch &lt;vsoch@users.noreply.github.com&gt;
diff --git a/docs/getting_started/design.md b/docs/getting_started/design.md
@@ -1,12 +1,16 @@
 # Design
 
+The ensemble operator orchestrates ensembles of jobs based on events and state machine logic. This means that you define a set of work (jobs) and rules (triggers and actions) for how to act. It is a state machine because there is no DAG calculated in advance. The rules are combined with events delivered from ensemble members, which can range from an entire HPC cluster on premises or in Kubernetes to other Kubernetes abstractions like JobSet (coming soon). Generally speaking, triggers can be anything from metrics collected about the queue for a job group (typicalling streaming models of ML) to expected job events. Custom rules that return actions of your choosing can also be written that take into account anything you can implement in Python. There are a lot of details, and the [ensemble python README](https://github.com/converged-computing/ensemble-python?tab=readme-ov-file#design) is the best place to read about the details. The remainder of the design document here will focus on the Ensemble Operator, which primarily serves to run ensemble-python in the Kubernetes space, providing actions "grow" and "shrink" that do not work (yet) on traditional HPC.
+
 ## Current Design
 
-This current design moves the responsibility to monitor one or more ensembles from the operator to a service that is deployed alongside the members. It's a much better design to do that, and it also has a refactored [ensemble-python](https://github.com/converged-computing/ensemble-python) library that works external to the operator here (and Kubernetes in general) so you can run Ensembles alongside your workload manager a la carte!
+This current design moves the responsibility to service one or more ensembles from the operator to a GRPC deployment that runs alongside the members. It's a much better design to do that, and it also has a refactored [ensemble-python](https://github.com/converged-computing/ensemble-python) library that works external to the operator here (and Kubernetes in general) so you can run Ensembles alongside your workload manager a la carte!
 
 ![img/design.png](img/design.png)
 
-More to come on this soon. Right now we have the ensemble running for Flux inside of Kubernetes, and I'll be adding grow capability soon.
+In the above, an ensemble member (a Flux Framework MiniCluster) is deployed as a single member ensemble. The ensemble member will be running ensemble-python on the lead broker (index 0 of the indexed job), where it is installed on the fly, akin to how Flux is added to the application container on the fly. The ensemble follows the work (jobs) and rules that are defined in the user-provided ensemble.yaml file. The ensemble-python library provides a simple state machine that receives job events from flux, and also uses a heartbeat (at a user defined frequency) to look for changes in metric models that might warrant an action. As an example, a rule might say to grow the cluster if the pending time for a job group goes above a threshold. We need a heartbeat to check that. For Kubernetes logic, the ensemble service is a deployment that runs a GRPC service following the same protocol (gRPC) as ensemble python knows how to interact with. It can receive events from multiple ensemble members (not shown here) and eventually handle things like fair share, etc. A headless service was explicitly not chosen because ensemble members should not share a network. Rather, the GRPC service is provided via its own exposed ClusterIP that is provided to ensemble members. For the GRPC service to make changes to ensemble members (grow/shrink) it has a paired Role and Role Binding with a Service Account to control MiniClusters in the same namespace. This is a huge improvement on the first design (discussed below) because ensemble-python works outside of Kubernetes, and there is not a huge load on the operator to interact with ensemble members.
+
+Note that while this is running in Kubernetes, it does not need to be - it works on "bare metal" Flux, but not all features can be supported. For features that are in the queue (see what I did there) please see the [ensemble-python](https://github.com/converged-computing/ensemble-python) README. Most development will happen there, as the operator doesn't need to do much aside from running it!
 
 ## Design 1