Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLBuffer destruction timeline #716

Open
reillyeon opened this issue Jul 2, 2024 · 14 comments
Open

MLBuffer destruction timeline #716

reillyeon opened this issue Jul 2, 2024 · 14 comments
Labels

Comments

@reillyeon
Copy link
Contributor

While prototyping an implementation of the MLBuffer interface for Chromium's TFLite-based WebNN backend I've encountered a question: What should happen if a developer executes the following code?

let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

As currently implemented promise will never be resolved. This is likely undesirable. We probably desire one of the following behaviors:

  1. promise resolves with the contents of the buffer. More generally, the buffer is not destroyed until all pending operations referencing it are completed.
  2. promise rejects (probably with an AbortError or InvalidStateError). More generally, any incomplete operation referencing the buffer is aborted (if possible).

Option 1 is consistent with the idea that readBuffer(), writeBuffer() and dispatch() all queue tasks to run on a "timeline" separate from JavaScript execution. destroy() would be one more such method. Option 2 seems useful for quickly releasing resources, but means the developer must take care to only call destroy() when they are truly done with the buffer (e.g. after awaiting the Promise returned by readBuffer()).

@bbernhar
Copy link

bbernhar commented Jul 9, 2024

I'm in favor of Option 1. This was the original proposal intent #542:

Destroy() gets called on the context timeline but doesn't actually release until the device signals completion.

@huningxin
Copy link
Contributor

let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

In the current Chromium prototype, the promise is never resolved nor rejected, because destroy() resets the mojo remote of the buffer. WebNN service still reads the data from GPU but fails (silently) to send the data back to renderer and never resolves the promise in JavaScript.

I guess destory() can be fixed by rejecting the pending promises that basically implements Option 2. WebNN service still needs to wait for GPU readback is done and releases the readback buffer. The backing resource of the destroyed buffer would be released earlier when handling the destory() / disconnection message.

e.g. after awaiting the Promise returned by readBuffer()

I suppose this is the typical usage.

@bbernhar
Copy link

Updated #543 with additional clarification on content/script timeline: destroy() must reject pending readBuffer promises.

@reillyeon
Copy link
Contributor Author

The updated proposal defines what happens to the promises returned by readBuffer(), but what about pending dispatch operations? If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

@bbernhar
Copy link

If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

Thanks for raising the question, Reilly. I don't believe its possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

@reillyeon
Copy link
Contributor Author

Thanks for raising the question, Reilly. I don't believe it's possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

In that case I'm not sure of the value of causing readBuffer() to reject since the compute is still going to happen and we don't actually free any resources immediately.

@bbernhar
Copy link

I don't feel strongly either way. However, I think it's atypical to call destroy() without first waiting on readBuffer(). Therefore, I'm okay with disallowing it and rejecting the current approach.

@reillyeon
Copy link
Contributor Author

Thinking about this more with the additional context of MLContext.destroy() I think rejecting the promises immediately is the right behavior as it will make these two methods consistent, even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

@bbernhar
Copy link

even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

@reillyeon
Copy link
Contributor Author

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

Yes, I think explicitly splitting the timelines in the method's algorithm will make this clear for developers and implementers.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

+1 to adding MLGraph.destroy() and MLGraphBuilder.destroy() methods.

@inexorabletash
Copy link
Member

Revisiting this now that we've made an initial attempt at specifying MLContext's destroy(), MLGraph's destroy(), and MLTensor's destroy()

  • readTensor() defines "abort when" only for context loss (resulting in the InvalidStateError), and obliquely "Let bytes be a byte sequence containing a copy of tensor.[[data]]. If that fails ...".
    • Suggestion: expand the "abort when" to cover this.[[data]] being released (per the destroy() steps)
  • MLGraphBuilder.destroy() hasn't been specified or prototyped. Do we still want that?

Is anything else discussed here still outstanding?

@reillyeon
Copy link
Contributor Author

I don't think we need to cover this.[[data]] being released because destroy() and readTensor() execute on the same timelines so if the tensor hadn't been destroyed when readTensor() was called then the data will still be around to be read when the task runs on the device timeline.

@a-sully was looking at making readTensor() also fail if there was an error during a dispatch() that would have written to the MLTensor.

We should specify MLGraphBuilder.destroy(), and specify that Symbol.dispose is an alias for these methods so that developers can use explicit resource management.

@a-sully
Copy link
Contributor

a-sully commented Jan 25, 2025

@a-sully was looking at making readTensor() also fail if there was an error during a dispatch() that would have written to the MLTensor.

After further discussion on #778 and with @RafaelCintron and @bbernhar, I think we're better off firing a lost promise on the MLGraph (as we do for the MLContext) if dispatch() fails with said graph.

Benefits of this approach over the proposal in #778 to propagate errors via MLTensors include:

  • It's much simpler
  • We'd probably want to invalidate the MLGraph anyways if dispatch() fails
  • It's more clear what triggered the error (e.g. it doesn't assume readTensor() will be called at some later point)
  • If dispatch() fails, the contents of the output tensors may be undefined... but they're very likely to be unmodified (I haven’t seen anything else happen in practice) and WebGPU also doesn't bother shielding the contents of buffers after failed operations

@bbernhar
Copy link

bbernhar commented Jan 27, 2025

MLGraphBuilder would not require a destroy() if it did not create device-allocated memory for destruction to reclaim. Looking at MLGraphBuilder, I only see constants creating device-allocated memory. If we decide to specify constants as MLTensor then we might need a destroy() off the builder after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants