-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to run multiple playbooks sequentially #153
Comments
I think a couple of options here:
|
I was just going to mention that perhaps comprehension of some DAG type document is the best way. CLI options would be good for dev, but a document is way better - these kinds of interactions between playbooks are not intuitive and should be abstracted away from the user as much as possible. |
Could this be a level lower? DBT for example builds a DAG implicitly from tables that get referenced. So a query that depends on a table will always be executed after the creation of said table so there is no need to hardcode this logic. If we could have SQL Runner understand what order queries should be run in by building some logical parsing rules into the system perhaps we could do away with the "playbook" entirely. |
I think that would lose a lot of the power of SQL Runner? The nice thing with SQL Runner is that you can put any SQL in it - which means there isn't always an inherent DAG from table references. SQL Runner : dbt :: Assembly : C++ |
I would prefer an explicitly configured one for another reason - the deps can be implicit and not based on existence of a resource. To illustrate - in my example above, the second playbook relies on the manifest for the first. So, if the manifest is updated before the second one runs, things will go wrong. Inference of dependencies doesn't strike me as terribly suited to managing these kinds of interactions, where the order of execution is incidentally important. |
I guess my preference here would be to define these dependencies in the SQL code rather than in the playbook itself. If you take Terraform as an example - for a particular resource you define a dependency link either implicitly (referencing the output of another resource) or declared explicitly (with an in-line declaration). A SQL statement could be similarly managed and a DAG built from this instead. Moving dependencies to being discovered and understood by the application also removes any ability for you to execute steps in the wrong order. Assuming you have defined these links in your code you would never need to worry about where you place a step in your playbook. This, in-turn, removes the cognitive overhead of having to know where all of these dependencies live and ensuring you don't accidentally move a step where it shouldn't be! I can see the merit in a playbook as it is extremely explicit but the worst part for me of any datamodel development in it is the playbook setup and having to define all of the steps. I would much prefer executing a directory of SQL files and having the application figure out the best order to execute them. |
That's quite neat! Could you share a code example of an explicit declaration? |
Sure - illustrated below are both implicit and explicit declarations: // Create a resource!
resource "fake_example" "res_1" {
foo = "bar"
}
// Explicit dependency on "res_1"
resource "fake_example" "res_2" {
foo = "bar"
depends_on = ["fake_example.res_1"]
}
// Implicit dependency on "res_1"
resource "fake_example" "res_3" {
foo = "${fake_example.res_1.id}"
} Translating this to SQL Runner we would need to add ways for a block of SQL to be reference-able but given that you could consider each "resource" here as a SQL block that you would link together - or not if they had no dependency on each other and could therefore be executed in parallel. |
I need to think it over, as I do also see the merit in the suggestion, but I'm unsure if my concerns on it are legit. However, to play devil's advocate on one point:
I don't see this as necessarily being solved by defining deps in the SQL. If your logic has an implicit kind of requirement, then you still have to do that reasoning, but your defining it in many places instead of one. So potentially the problem just moves from 'which of my steps in this single playbook is wrong' to 'which of my declarations in these many sql files is wrong'. I'm talking about a case here where a rule like 'x needs to exist before y is run' is not sufficient to determine the order of execution. I think I may have confused things by using 'implicit' to refer to this case. What I mean is 'incidental to the actual structure of the model'. |
Thinking further on it - plugging in custom modules means that we can't know in advance whether or not a custom module exists, nor can we know what the module is. So in my example, if we're to have a requirement that "updating manifest should happen after any custom module is run", I don't think we can define it in the SQL file. I do think we can find a way to remove the playbook pain and satisfy that requirement though, with a bit of consideration. |
True - but you have an explicit dependency definition not a somewhat hidden one in the playbook. In a playbook there is a defined order but why the order exists is not as clear and is not enforced as opposed to the block of SQL that had clearly defined requirements of "I need this first".
Would love to dig into an example of where that would be the case! If you have a defined order in the playbook would that not denote that that is the exact level of logic being employed already? |
Hard to explain simply, but I'll try. The new web model structure aims to have immutable logic, but enable customisation. It has 4 components:
Now, if we update the manifest before the custom module, and that module fails part way thorough, it is quite difficult to recover gracefully. Restart from top is not safe. Manifest has no record of what events were in the last run. Any logic to handle this is unwieldy. So, the simplest way to achieve it is for each module to have its own playbook, and for the 'finish' playbook to run after the other 3 have run. Defining dependencies inside the SQL file can't really do this.
Yes, it is, except that I can only run a playbook one by one. The original ask here is simply to enable me to run a sequential list of playbooks, as if a factotum DAG were being run. The nicest solution to it seems to be to define the same thing as what the DAG defines, but for SQL-runnner. |
I am 100% on board, by the way, with the idea of removing the complexity of playbooks (and potentially removing the need for a user to explicitly define the playbook), by the way. I think our new model structure allows us to implement a solution (possibly even the solution you outline @jbeemster ). (having inference per-playbook, or per directory of files works). But that doesn't solve the requirement to run multiple playbooks in a defined order (which may not be inferrable), so I think we need a solution to both. |
Yeah thats fair - I think it would be nice to try and remove the need for multiple playbooks even for this case but can appreciate there is a lot of complexity going on here. Maybe a higher level abstraction is just that you have:
Essentially you would have 1 playbook but with multiple "modules" defined within them. This lets you remove the very verbose playbook definitions for steps but allows you still to enforce the sequential and controlled execution of the DAG steps. This is starting to sound a lot like Terraform! Edit: Perhaps you could still force a dependency link between modules somehow to make it very explicit that SQL in one module is required for the other - propagation of output variables could also be a nice thing to be able to pass between modules. |
Yeah I like this direction. Relevant to the idea, by the way, is that evolving this kind of structure involves propagating playbooks with how SQL-runner currently works - so there is definitely a case to manage the order of module execution in SQL-runner, rather than having massive and unwieldy factotum DAGs. |
Exciting to see this taking shape! |
Noting this here in case I forget. If we pursue the ideas outlined here, it would be nice to be able to both set global vars (for an entire DAG in our current structure), and vars which are specific to modules (ie for an individual playbook). Failing that, just setting them globally per-model is workable, any that are specific to a given step would just need unique names. This reduces scope for misconfiguration and reduces cognitive load in running any given model. |
This is currently something we do with our job orchestration tech - if there are multiple playbooks to be run in one job, a DAG dictates that job's execution.
It would be nice from the point of view of dev experience (and also simplifying orchestration) - to be able to pass a list of playbooks and have SQL-runner run them in order.
In case it's not clear, an example use case is:
This is basically one job, but the playbooks are distinct because that's the simplest way to have a modular structure which enables one to 'plug in' some custom bits before updating the manifests (without needing to maintain identical manifest logic in two places).
So, it would if SQL-runner could consider a job to consist of several playbooks.
The text was updated successfully, but these errors were encountered: