Commit e38c8e9

and

authored

Prepare Unified Attention biases on HPU + add NumPy memory pooling (#550)

This PR moves bias preparation logic from CPU to HPU - moving causal bias creation alone can reduce batch preparation time >2x (~20ms down to ~8ms), with more to be gained on top of that with shared biases. Causal biases give the most benefit, although we can shave couple of milliseconds by also doing shared biases on HPU - although given their dynamicity (depending on number of shared tokens), it's hard to make them work under static shapes, so I've implemented some heuristics that cause in HPU execution when it's deemed beneficial, and fall back to CPU otherwise. Also, this seems to affect torch.compile much more than lazy - I've observed much bigger performance gains there and it now seems to perform close-ish to lazy (and it used to be >2x slower) This also includes a fancy memory pooling optimization for persistent numpy buffers, used for padding (in `hpu_tensor` method) - we don't do online padding anymore, we pre-allocate padded ndarray filled with certain value and store it in LRU cache for later use. If someone requests a placeholder larger than whatever is in the array, we extend the memory pool to accommodate that - then, when someone requests a smaller placeholder, we just reuse the previously allocated larger one, and just trim it to size. In the end, I've managed to get the host overheads of batch preparation down to ~3ms (GSM8k scenario), compared to ~20 ms with current numpy implementation, or ~120ms compared to previous PyTorch implementation. Pretty cool, I guess. If you want, you can disable all the stuff I added by setting `hpu_bias_acceleration` to `False` and `hpu_tensor_online_padding` to `True`. I used these for A/B testing and left them as a fallback in case something's broken and I haven't caught it. --------- Signed-off-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]>

1 parent f96e7cd commit e38c8e9Copy full SHA for e38c8e9

2 files changed

+477

-127

lines changed

vllm_gaudi
- extension
  - unified_batch.py
- v1/worker
  - hpu_model_runner.py

2 files changed

+477

-127

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit e38c8e9

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments