MLBuffer destruction timeline #716

reillyeon · 2024-07-02T01:45:45Z

While prototyping an implementation of the MLBuffer interface for Chromium's TFLite-based WebNN backend I've encountered a question: What should happen if a developer executes the following code?

let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

As currently implemented promise will never be resolved. This is likely undesirable. We probably desire one of the following behaviors:

promise resolves with the contents of the buffer. More generally, the buffer is not destroyed until all pending operations referencing it are completed.
promise rejects (probably with an AbortError or InvalidStateError). More generally, any incomplete operation referencing the buffer is aborted (if possible).

Option 1 is consistent with the idea that readBuffer(), writeBuffer() and dispatch() all queue tasks to run on a "timeline" separate from JavaScript execution. destroy() would be one more such method. Option 2 seems useful for quickly releasing resources, but means the developer must take care to only call destroy() when they are truly done with the buffer (e.g. after awaiting the Promise returned by readBuffer()).

The text was updated successfully, but these errors were encountered:

bbernhar · 2024-07-09T22:26:40Z

I'm in favor of Option 1. This was the original proposal intent #542:

Destroy() gets called on the context timeline but doesn't actually release until the device signals completion.

huningxin · 2024-07-10T08:17:54Z

let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

In the current Chromium prototype, the promise is never resolved nor rejected, because destroy() resets the mojo remote of the buffer. WebNN service still reads the data from GPU but fails (silently) to send the data back to renderer and never resolves the promise in JavaScript.

I guess destory() can be fixed by rejecting the pending promises that basically implements Option 2. WebNN service still needs to wait for GPU readback is done and releases the readback buffer. The backing resource of the destroyed buffer would be released earlier when handling the destory() / disconnection message.

e.g. after awaiting the Promise returned by readBuffer()

I suppose this is the typical usage.

bbernhar · 2024-07-23T20:18:12Z

Updated #543 with additional clarification on content/script timeline: destroy() must reject pending readBuffer promises.

reillyeon · 2024-07-24T00:10:57Z

The updated proposal defines what happens to the promises returned by readBuffer(), but what about pending dispatch operations? If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

bbernhar · 2024-07-24T17:07:37Z

If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

Thanks for raising the question, Reilly. I don't believe its possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

reillyeon · 2024-07-26T19:04:57Z

Thanks for raising the question, Reilly. I don't believe it's possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

In that case I'm not sure of the value of causing readBuffer() to reject since the compute is still going to happen and we don't actually free any resources immediately.

bbernhar · 2024-07-26T19:27:11Z

I don't feel strongly either way. However, I think it's atypical to call destroy() without first waiting on readBuffer(). Therefore, I'm okay with disallowing it and rejecting the current approach.

reillyeon · 2024-07-30T18:22:20Z

Thinking about this more with the additional context of MLContext.destroy() I think rejecting the promises immediately is the right behavior as it will make these two methods consistent, even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

bbernhar · 2024-07-30T21:12:32Z

even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

reillyeon · 2024-08-02T18:08:05Z

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

Yes, I think explicitly splitting the timelines in the method's algorithm will make this clear for developers and implementers.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

+1 to adding MLGraph.destroy() and MLGraphBuilder.destroy() methods.

inexorabletash · 2025-01-24T19:29:00Z

Revisiting this now that we've made an initial attempt at specifying MLContext's destroy(), MLGraph's destroy(), and MLTensor's destroy()

readTensor() defines "abort when" only for context loss (resulting in the InvalidStateError), and obliquely "Let bytes be a byte sequence containing a copy of tensor.[[data]]. If that fails ...".
- Suggestion: expand the "abort when" to cover this.[[data]] being released (per the destroy() steps)
MLGraphBuilder.destroy() hasn't been specified or prototyped. Do we still want that?

Is anything else discussed here still outstanding?

reillyeon · 2025-01-25T01:11:59Z

I don't think we need to cover this.[[data]] being released because destroy() and readTensor() execute on the same timelines so if the tensor hadn't been destroyed when readTensor() was called then the data will still be around to be read when the task runs on the device timeline.

@a-sully was looking at making readTensor() also fail if there was an error during a dispatch() that would have written to the MLTensor.

We should specify MLGraphBuilder.destroy(), and specify that Symbol.dispose is an alias for these methods so that developers can use explicit resource management.

a-sully · 2025-01-25T02:11:43Z

@a-sully was looking at making readTensor() also fail if there was an error during a dispatch() that would have written to the MLTensor.

After further discussion on #778 and with @RafaelCintron and @bbernhar, I think we're better off firing a lost promise on the MLGraph (as we do for the MLContext) if dispatch() fails with said graph.

Benefits of this approach over the proposal in #778 to propagate errors via MLTensors include:

It's much simpler
We'd probably want to invalidate the MLGraph anyways if dispatch() fails
It's more clear what triggered the error (e.g. it doesn't assume readTensor() will be called at some later point)
If dispatch() fails, the contents of the output tensors may be undefined... but they're very likely to be unmodified (I haven’t seen anything else happen in practice) and WebGPU also doesn't bother shielding the contents of buffers after failed operations

bbernhar · 2025-01-27T23:29:58Z

MLGraphBuilder would not require a destroy() if it did not create device-allocated memory for destruction to reclaim. Looking at MLGraphBuilder, I only see constants creating device-allocated memory. If we decide to specify constants as MLTensor then we might need a destroy() off the builder after all.

inexorabletash added the question label Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLBuffer destruction timeline #716

MLBuffer destruction timeline #716

reillyeon commented Jul 2, 2024

bbernhar commented Jul 9, 2024

huningxin commented Jul 10, 2024

bbernhar commented Jul 23, 2024

reillyeon commented Jul 24, 2024

bbernhar commented Jul 24, 2024

reillyeon commented Jul 26, 2024

bbernhar commented Jul 26, 2024

reillyeon commented Jul 30, 2024

bbernhar commented Jul 30, 2024

reillyeon commented Aug 2, 2024

inexorabletash commented Jan 24, 2025

reillyeon commented Jan 25, 2025

a-sully commented Jan 25, 2025 •

edited

Loading

bbernhar commented Jan 27, 2025 •

edited

Loading

MLBuffer destruction timeline #716

MLBuffer destruction timeline #716

Comments

reillyeon commented Jul 2, 2024

bbernhar commented Jul 9, 2024

huningxin commented Jul 10, 2024

bbernhar commented Jul 23, 2024

reillyeon commented Jul 24, 2024

bbernhar commented Jul 24, 2024

reillyeon commented Jul 26, 2024

bbernhar commented Jul 26, 2024

reillyeon commented Jul 30, 2024

bbernhar commented Jul 30, 2024

reillyeon commented Aug 2, 2024

inexorabletash commented Jan 24, 2025

reillyeon commented Jan 25, 2025

a-sully commented Jan 25, 2025 • edited Loading

bbernhar commented Jan 27, 2025 • edited Loading

a-sully commented Jan 25, 2025 •

edited

Loading

bbernhar commented Jan 27, 2025 •

edited

Loading