The high level summary of how this works is that there's a dedicated wasm worker thread. When calling into any wasm that could transitively call into an async JS function instead of directly bridging to wasm we post a message to the wasm worker thread via a SharedArrayBuffer. The wasm worker runs wasm just as it normally would until it needs to call into JS. When that happens the worker posts a message back to the main thread (currently via a different queue but could be the same one, I think).
On the main thread, there is an infinite looping async JS function to handle the wasm -> JS bridging, which parks periodically via Atomics.waitAsync so it doesn't stall the runloop.
All of this bridging code could/should be handled by toolchains and works with shipping wasm today. Likely, the toolchains would want some sort of PGO to figure out which signatures are hot enough to warrant specialization. For rare or call-once signatures they could be handled dynamically in JS.
I think there a few places for improvement on the core wasm side that would improve performance:
- All atomics are sequentially consistent today but really this code wants acquire/release semantics.
- There doesn't seem to be a good way to write a high performance condition variable as there's no
Atomics.wait_rmwso you have to unconditionally notify the other thread. - There might also be a need for relaxed atomics since it's unclear if the spin loop could use a normal load and expect the VM to not change practical effects under the current wasm memory model.
- Shared objects would mean this design works with GC'd values in a much more reasonable way.
- This could work without SAB if there was a syncPostMessage for workers although likely with much reduced performance. (kinda JS not wasm)
I think 2 is the biggest perf bottleneck since we have to unconditionally notify the other thread each time we post a message. You could have your own spin lock to guard Atomics.wait/Atomics.notify but that's a bit of a pain, although not impossible or outstandingly impractical. It would be hard for the VM to do quickly check for waiters as
the VM doesn't own the wait target's bits so it can't use any of them to store that information and bail early.
I think 4/5 are the biggest adoption bottlenecks, if users need them. 4 will eventually happen as it's on the current roadmap. It could also maybe be worked around by having a wrapper object in the main thread that holds some serializable index for that object, which actually lives in the Worker. AFAIK, there's no current plan for 5 though.
To build I ran:
emcc -std=c++20 -o JSPIAsLibrary.js -O2 -s WASM=1 -s WASM_WORKERS=1 -s TOTAL_MEMORY=52428800 -s ALLOW_TABLE_GROWTH -g1 --emit-symbol-map -s EXPORTED_RUNTIME_METHODS=addFunction -s EXPORTED_FUNCTIONS=_workerPromiseFib,_wasmMain,_initJSToWasmQueue,_initWasmToJSQueue,_promiseFib JSPIAsLibrary.cpp -vTo get a build with JSPI I ran:
emcc -std=c++20 -o JSPIAsLibrary.js -O2 -s WASM=1 -s WASM_WORKERS=1 -s TOTAL_MEMORY=52428800 -s ALLOW_TABLE_GROWTH -g1 --emit-symbol-map -s EXPORTED_RUNTIME_METHODS=addFunction -s EXPORTED_FUNCTIONS=_workerPromiseFib,_wasmMain,_initJSToWasmQueue,_initWasmToJSQueue JSPIAsLibrary.cpp -v -s JSPI -s JSPI_EXPORTS=_promiseFibThe results I got running the JSPI version against this prototype were basically the same ~.18ms to do workerPromiseFib and promiseFib. It's possible with improvements to points 1/2 above we might make the library faster than JSPI.