Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to run multiple playbooks sequentially #153

Open
colmsnowplow opened this issue Jul 17, 2020 · 17 comments
Open

Add ability to run multiple playbooks sequentially #153

colmsnowplow opened this issue Jul 17, 2020 · 17 comments

Comments

@colmsnowplow
Copy link
Contributor

colmsnowplow commented Jul 17, 2020

This is currently something we do with our job orchestration tech - if there are multiple playbooks to be run in one job, a DAG dictates that job's execution.

It would be nice from the point of view of dev experience (and also simplifying orchestration) - to be able to pass a list of playbooks and have SQL-runner run them in order.

In case it's not clear, an example use case is:

  • playbook for a standard web model is run. This creates & relies on a manifest.
  • playbook for a custom model is run. This may depend on the web model, its temp tables, and also its manifest. (eg adding custom events which relate to the web data).
  • playbook to complete the job is run - this updates manifests, removes temp tables, and cleans up.

This is basically one job, but the playbooks are distinct because that's the simplest way to have a modular structure which enables one to 'plug in' some custom bits before updating the manifests (without needing to maintain identical manifest logic in two places).

So, it would if SQL-runner could consider a job to consist of several playbooks.

@alexanderdean
Copy link
Member

I think a couple of options here:

  1. Allow an entry inside a playbook to be another playbook, so that Colm's approach is modelled inside a playbook
  2. Allow SQL Runner to have command line options to invoke multiple playbooks in series

@colmsnowplow
Copy link
Contributor Author

I was just going to mention that perhaps comprehension of some DAG type document is the best way.

CLI options would be good for dev, but a document is way better - these kinds of interactions between playbooks are not intuitive and should be abstracted away from the user as much as possible.

@jbeemster
Copy link
Member

Could this be a level lower? DBT for example builds a DAG implicitly from tables that get referenced. So a query that depends on a table will always be executed after the creation of said table so there is no need to hardcode this logic.

If we could have SQL Runner understand what order queries should be run in by building some logical parsing rules into the system perhaps we could do away with the "playbook" entirely.

@alexanderdean
Copy link
Member

I think that would lose a lot of the power of SQL Runner? The nice thing with SQL Runner is that you can put any SQL in it - which means there isn't always an inherent DAG from table references.

SQL Runner : dbt :: Assembly : C++

@colmsnowplow
Copy link
Contributor Author

colmsnowplow commented Jul 17, 2020

I would prefer an explicitly configured one for another reason - the deps can be implicit and not based on existence of a resource.

To illustrate - in my example above, the second playbook relies on the manifest for the first. So, if the manifest is updated before the second one runs, things will go wrong.

Inference of dependencies doesn't strike me as terribly suited to managing these kinds of interactions, where the order of execution is incidentally important.

@jbeemster
Copy link
Member

I guess my preference here would be to define these dependencies in the SQL code rather than in the playbook itself. If you take Terraform as an example - for a particular resource you define a dependency link either implicitly (referencing the output of another resource) or declared explicitly (with an in-line declaration). A SQL statement could be similarly managed and a DAG built from this instead.

Moving dependencies to being discovered and understood by the application also removes any ability for you to execute steps in the wrong order. Assuming you have defined these links in your code you would never need to worry about where you place a step in your playbook.

This, in-turn, removes the cognitive overhead of having to know where all of these dependencies live and ensuring you don't accidentally move a step where it shouldn't be!


I can see the merit in a playbook as it is extremely explicit but the worst part for me of any datamodel development in it is the playbook setup and having to define all of the steps. I would much prefer executing a directory of SQL files and having the application figure out the best order to execute them.

@alexanderdean
Copy link
Member

If you take Terraform as an example - for a particular resource you define a dependency link either implicitly (referencing the output of another resource) or declared explicitly (with an in-line declaration).

That's quite neat! Could you share a code example of an explicit declaration?

@jbeemster
Copy link
Member

Sure - illustrated below are both implicit and explicit declarations:

// Create a resource!
resource "fake_example" "res_1" {
  foo = "bar"
}

// Explicit dependency on "res_1"
resource "fake_example" "res_2" {
  foo = "bar"

  depends_on = ["fake_example.res_1"]
}

// Implicit dependency on "res_1"
resource "fake_example" "res_3" {
  foo = "${fake_example.res_1.id}"
}

Translating this to SQL Runner we would need to add ways for a block of SQL to be reference-able but given that you could consider each "resource" here as a SQL block that you would link together - or not if they had no dependency on each other and could therefore be executed in parallel.

@colmsnowplow
Copy link
Contributor Author

colmsnowplow commented Jul 17, 2020

I need to think it over, as I do also see the merit in the suggestion, but I'm unsure if my concerns on it are legit. However, to play devil's advocate on one point:

This, in-turn, removes the cognitive overhead of having to know where all of these dependencies live and ensuring you don't accidentally move a step where it shouldn't be!

I don't see this as necessarily being solved by defining deps in the SQL. If your logic has an implicit kind of requirement, then you still have to do that reasoning, but your defining it in many places instead of one. So potentially the problem just moves from 'which of my steps in this single playbook is wrong' to 'which of my declarations in these many sql files is wrong'.

I'm talking about a case here where a rule like 'x needs to exist before y is run' is not sufficient to determine the order of execution. I think I may have confused things by using 'implicit' to refer to this case. What I mean is 'incidental to the actual structure of the model'.

@colmsnowplow
Copy link
Contributor Author

Thinking further on it - plugging in custom modules means that we can't know in advance whether or not a custom module exists, nor can we know what the module is.

So in my example, if we're to have a requirement that "updating manifest should happen after any custom module is run", I don't think we can define it in the SQL file.

I do think we can find a way to remove the playbook pain and satisfy that requirement though, with a bit of consideration.

@jbeemster
Copy link
Member

I don't see this as necessarily being solved by defining deps in the SQL. If your logic has an implicit kind of requirement, then you still have to do that reasoning, but your defining it in many places instead of one. So potentially the problem just moves from 'which of my steps in this single playbook is wrong' to 'which of my declarations in these many sql files is wrong'.

True - but you have an explicit dependency definition not a somewhat hidden one in the playbook. In a playbook there is a defined order but why the order exists is not as clear and is not enforced as opposed to the block of SQL that had clearly defined requirements of "I need this first".

I'm talking about a case here where a rule like 'x needs to exist before y is run' is not sufficient to determine the order of execution. I think I may have confused things by using 'implicit' to refer to this case. What I mean is 'incidental to the actual structure of the model'.

Would love to dig into an example of where that would be the case! If you have a defined order in the playbook would that not denote that that is the exact level of logic being employed already?

@colmsnowplow
Copy link
Contributor Author

Would love to dig into an example of where that would be the case!

Hard to explain simply, but I'll try. The new web model structure aims to have immutable logic, but enable customisation. It has 4 components:

  • 'base' - contains incrementalisation logic, produces a 'events this batch' table. This is a temporary table.
  • 'web model' - runs the web model logic using 'events this batch' as an input.
  • 'custom' - optional module where someone can write their own logic - eg to produce a custom events table which joins onto the web model. This uses 'events this batch' as an input also. It also can depend on temp tables from the web model, if that's what they need to do.
  • 'finish' - this cleans up tables, and updates the manifests

Now, if we update the manifest before the custom module, and that module fails part way thorough, it is quite difficult to recover gracefully. Restart from top is not safe. Manifest has no record of what events were in the last run. Any logic to handle this is unwieldy.

So, the simplest way to achieve it is for each module to have its own playbook, and for the 'finish' playbook to run after the other 3 have run. Defining dependencies inside the SQL file can't really do this.

If you have a defined order in the playbook would that not denote that that is the exact level of logic being employed already?

Yes, it is, except that I can only run a playbook one by one. The original ask here is simply to enable me to run a sequential list of playbooks, as if a factotum DAG were being run. The nicest solution to it seems to be to define the same thing as what the DAG defines, but for SQL-runnner.

@colmsnowplow
Copy link
Contributor Author

colmsnowplow commented Jul 17, 2020

I am 100% on board, by the way, with the idea of removing the complexity of playbooks (and potentially removing the need for a user to explicitly define the playbook), by the way. I think our new model structure allows us to implement a solution (possibly even the solution you outline @jbeemster ). (having inference per-playbook, or per directory of files works).

But that doesn't solve the requirement to run multiple playbooks in a defined order (which may not be inferrable), so I think we need a solution to both.

@jbeemster
Copy link
Member

jbeemster commented Jul 17, 2020

Yeah thats fair - I think it would be nice to try and remove the need for multiple playbooks even for this case but can appreciate there is a lot of complexity going on here.

Maybe a higher level abstraction is just that you have:

  1. Modules: A directory with SQL which is then automatically able to figure out the order of execution
  2. Playbook: Defines the order of execution for Modules

Essentially you would have 1 playbook but with multiple "modules" defined within them. This lets you remove the very verbose playbook definitions for steps but allows you still to enforce the sequential and controlled execution of the DAG steps.

This is starting to sound a lot like Terraform!

Edit: Perhaps you could still force a dependency link between modules somehow to make it very explicit that SQL in one module is required for the other - propagation of output variables could also be a nice thing to be able to pass between modules.

@colmsnowplow
Copy link
Contributor Author

Yeah I like this direction.

Relevant to the idea, by the way, is that evolving this kind of structure involves propagating playbooks with how SQL-runner currently works - so there is definitely a case to manage the order of module execution in SQL-runner, rather than having massive and unwieldy factotum DAGs.

@alexanderdean
Copy link
Member

Exciting to see this taking shape!

@colmsnowplow
Copy link
Contributor Author

Noting this here in case I forget. If we pursue the ideas outlined here, it would be nice to be able to both set global vars (for an entire DAG in our current structure), and vars which are specific to modules (ie for an individual playbook).

Failing that, just setting them globally per-model is workable, any that are specific to a given step would just need unique names.

This reduces scope for misconfiguration and reduces cognitive load in running any given model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants