First off great work, wonnx has been very easy to use and besides a few missing operators it "just works".
I'm in the optimization phase of building an app that does inference using wonnx. When I benchmark (with criterion) wonnx running a model I've found it's just about as fast as onnxruntime. I figured that this probably has to do with marshaling the data to the GPU (maybe the shader created by wonnx runs a little faster but the marshaling time is a little longer). If that's the case I figured I could get a throughput improvement by running my model in parallel. Unfortunately I am not in control of the models and cannot retrain or re-export the model with a dynamic batch size so instead I opted to edit the onnx model itself and clone the graph into 64 subgraphs, each with its own input. Even though it worked as expected and validated, it provided no gain in throughput (or latency, for that matter). My guess is that the shader that wonnx produces is probably not performing each subgraph in parallel, but I don't know.
My question is - is there a general method of parallelization that might yield "pretty good" results that doesn't involve re-training or other python tasks? I don't mind editing the onnx model to possibly use another method like SequenceMap (if that's supported), or something similar. Or maybe there's an opportunity to expand the wonnx API to support this out-of-the-box? Possibly by issuing multiple draw calls over an offset buffer? What do you think?
First off great work,
wonnxhas been very easy to use and besides a few missing operators it "just works".I'm in the optimization phase of building an app that does inference using
wonnx. When I benchmark (withcriterion)wonnxrunning a model I've found it's just about as fast asonnxruntime. I figured that this probably has to do with marshaling the data to the GPU (maybe the shader created bywonnxruns a little faster but the marshaling time is a little longer). If that's the case I figured I could get a throughput improvement by running my model in parallel. Unfortunately I am not in control of the models and cannot retrain or re-export the model with a dynamic batch size so instead I opted to edit the onnx model itself and clone the graph into 64 subgraphs, each with its own input. Even though it worked as expected and validated, it provided no gain in throughput (or latency, for that matter). My guess is that the shader thatwonnxproduces is probably not performing each subgraph in parallel, but I don't know.My question is - is there a general method of parallelization that might yield "pretty good" results that doesn't involve re-training or other python tasks? I don't mind editing the onnx model to possibly use another method like SequenceMap (if that's supported), or something similar. Or maybe there's an opportunity to expand the
wonnxAPI to support this out-of-the-box? Possibly by issuing multiple draw calls over an offset buffer? What do you think?