Skip to content

feat: add shard execution workflow #1557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

polvalente
Copy link
Contributor

Adds the initial version of the process communication structure for sharded execution.

Does not handle container outputs for the sharded function yet,
and also does not yet bring everything together into the compiler jit function.

@polvalente polvalente self-assigned this Nov 8, 2024
@polvalente polvalente changed the base branch from main to pv-feat/experimental-sharding-backend November 8, 2024 01:20
@polvalente polvalente requested a review from josevalim November 8, 2024 01:20
@josevalim
Copy link
Collaborator

Could we fully decouple the workflow definition and execution from Nx? Ideally we would have a workflow like this:

workflow = %{
  0 => %{
    code: &foo(&1, &2, ...),
    args: [1, 2]
  },
  1 =>  %{
    code: &bar(&1),
    args: [2]
  },
  2 => %{
    code: &baz/0,
    args: []
  }
}

And then we pass this to a ProcessExecutor which is completely independent of Nx and tensors. You could also have a Nx executor, but the overall idea is that the Executor should worry about resources and not necessarily tensors (except the resources the tensors are located).

@polvalente
Copy link
Contributor Author

Could we fully decouple the workflow definition and execution from Nx? Ideally we would have a workflow like this:

We talked about this, but just to register here, I believe this is possible, but probably easier to generalize after we have things ready.

Each workflow step needs to be able to determine:

  1. The input dependency mapping that tells the workflow from where to fetch each argument (which is represented here as simple numbers, but also should represent indices as we need to represent container outputs)
  2. Some way to check whether a given dependency has produced its output.
  3. A callback for fetching the output given the requested input section

All of this is mostly already present in the current code, with the coupling happening mostly on how we identify the dependencies based on data section ids.

@polvalente polvalente merged commit dce2c60 into pv-feat/experimental-sharding-backend Nov 28, 2024
8 checks passed
@polvalente polvalente deleted the pv-feat/parallelize-graph-stages branch November 28, 2024 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants