docs: update README for amd-flashinfer library consumers by demandal25 · Pull Request #239 · ROCm/flashinfer

demandal25 · 2026-05-21T04:34:26Z

Summary

Refresh the FlashInfer+ROCm README aimed at library consumers, refresh the Feature Support Matrix to match what has actually landed on amd-integration, and align the ROCm MLA wrapper with the rest of the ROCm backends so backend="auto" is accepted everywhere.

What changed

`README.md`

Intro and structure. Tighten the intro to call out HIP-in-repo kernels vs AITER dispatch up front; link to the Feature Support Matrix and AITER sections from the first paragraph. Cross-link CDNA3 / CDNA4 to AMD's official architecture whitepapers on first mention.
Feature Support Matrix. Replaced with a five-column table (Kernel / HIP / AITER / backend="auto" resolves to / Notes). New ✅ rows: Cascade (feat(hip): cascade attention support on ROCm using HIP #221), MLA via AITER (feat(hip): AITER backend for batch-ragged prefill, batch-paged decode, KV-cache append, MLA, and RMSNorm #232), RoPE (feat(hip): port RoPE to ROCm #223), paged KV-cache append, RMSNorm via AITER (feat(hip): AITER backend for batch-ragged prefill, batch-paged decode, KV-cache append, MLA, and RMSNorm #232), sliding-window decode on the AITER path (fix(rocm): correct AITER decode backend gaps — sliding window, CUDA graph, return_lse #234), activation, quantization, and opt-in torch.compile (Enable torch.compile under a flag #210). Every ✅ is backed by a tests/rocm_tests/test_*_hip.py. FP8 status is folded into per-row notes rather than a dedicated column.
GPU / ROCm / PyTorch. Consolidated into one section with arch codenames inline (gfx942 → MI300X/MI325X = CDNA3, gfx950 → MI355X = CDNA4). pip install torch uses --index-url instead of -f so pip cannot silently fall back to a CPU-only PyPI wheel (matches CLAUDE.md).
Getting Started. Collapsed the Docker image table to the latest validated tag and pointed at Docker Hub for older releases. Dropped the manual micromamba activate base step (the env is auto-activated). Used the concrete image tag plus a --name=flashinfer-rocm in the docker run snippet.
Trying the Examples. Simplified to point at examples/ plus one run command — no wget-based downloads.
Install from Source. Renamed from "Build from Source"; rewrote the ambiguous "Environment name varies …" note (and later removed it once the build / run blocks made the matching tag self-evident).
AITER Support. Collapsed the section intro to avoid re-listing conditions already in the matrix; cross-link Known Limitations. Rewrote Known Limitations preamble to state the two-group split (hard errors vs silently-ignored kwargs). Dropped the redundant Single Prefill Example (Basic Usage already shows the call pattern).
Environment Variables. New section documenting runtime env vars — FLASHINFER_USE_TORCH_CUSTOM_OPS, FLASHINFER_HIP_FUSED_CASCADE, FLASHINFER_LOGGING_LEVEL, FLASHINFER_DISABLE_JIT, ROCM_PATH / ROCM_HOME. Build-time vars stay in CLAUDE.md and are linked from here.
Runtime Helpers. Short snippet showing is_aiter_supported and check_torch_rocm_compatibility; calls out validate_flashinfer_rocm_arch as a build-time validator, not a runtime helper.
CPX-mode pytest notes. Split the dense paragraph into labelled bullets (Worker count / Reruns / slow marker / HIPBLAS retry).
Basic Usage. Moved to the end of the README as a closing example.
License and Acknowledgements. Added; the contributing reminder lives on its own line.

`flashinfer/mla_rocm.py` + `tests/rocm_tests/test_mla_aiter_hip.py`

Accept backend="auto" as an alias for "aiter" on the ROCm MLA wrapper (default is now "auto" to match every other ROCm wrapper). Previously the wrapper raised ValueError on anything other than "aiter", leaving MLA as the odd one out in the public API even though there is exactly one implementation to pick from on ROCm.
New tests: test_mla_backend_accepts_auto_and_aiter (parametrized over both values) and test_mla_backend_rejects_unsupported (confirms backend="fa2" still raises; runs without a GPU since the check fires before the AITER probe).

Test plan

pre-commit run -a passes.
pre-commit run markdownlint --files README.md passes after every change.
Every TOC entry resolves to an ## heading in the body.
Every ✅ in the Feature Support Matrix has a backing tests/rocm_tests/test_*_hip.py.
pytest tests/rocm_tests/test_mla_aiter_hip.py — 11 passed.
Render the README on the PR page and visually confirm tables, code blocks, and <details> sections look right.

🤖 Generated with Claude Code

Reframe the top-level README for developers embedding FlashInfer+ROCm. Add a minimal usage example, feature matrix with prefill backends (fa2, aiter, fa3_cdna3), consolidated GPU/ROCm/PyTorch support, AITER page-size constraints from prefill_rocm, notebook link, and a dedicated prefill backends section. Remove verbose docker details blocks in favor of inline context. Made-with: Cursor

Restore the practical sections that the prior rewrite dropped (Docker tag table, source-build instructions, CPX-mode pytest guidance, AITER install recipes) and refresh the Feature Support Matrix to reflect what has actually landed on amd-integration: Cascade, MLA (AITER), RoPE, paged KV-cache append, RMSNorm/AITER, sliding-window decode, torch.compile. Drop the stale fa3_cdna3 backend mention — it has no Python dispatch entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Refreshes the FlashInfer+ROCm consumer-facing README to better reflect current ROCm feature availability on amd-integration, provide a quick-start usage snippet, and consolidate support/install guidance.

Changes:

Reworked README introduction + added a “Basic Usage” snippet for library consumers.
Updated the Feature Support Matrix and clarified GPU/ROCm/PyTorch support + PyTorch install instructions (--index-url).
Expanded/updated AITER section (install options, limitations, and updated capability notes) and added a License/Acknowledgements section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Split the single "AITER backend" column into HIP and AITER columns plus a new `backend="auto"` column that spells out the exact conditions that auto-routes to AITER vs. HIP per kernel. MLA is flagged as AITER-only (no HIP fallback); RMSNorm auto stays on HIP even though AITER is available (opt-in only). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous matrix listed dtypes inconsistently — single-decode named fp16/bf16/fp8 explicitly while sibling rows didn't. Drop the implicit fp16/bf16 enumeration (already covered by the ✅ HIP marker) and call out fp8 only where it's actually supported: batch decode KV-cache (E4M3FNUZ), RoPE fused quant+append (E4M3FNUZ + E5M2FNUZ), paged KV-cache append HIP path. Prefill rows mark fp8 as WIP. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

…helpers The matrix referenced a nonexistent FLASHINFER_ENABLE_TORCH_COMPILE; the actual gate is FLASHINFER_USE_TORCH_CUSTOM_OPS=1 (must be set before importing flashinfer, requires PyTorch >= 2.4). While here, add an Environment Variables section covering the runtime knobs that aren't already in CLAUDE.md (FLASHINFER_HIP_FUSED_CASCADE, FLASHINFER_LOGGING_LEVEL, FLASHINFER_DISABLE_JIT, ROCM_PATH/ROCM_HOME) and a Runtime Helpers section pointing at is_aiter_supported and check_torch_rocm_compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Keep only the current validated rocm/flashinfer image and point readers at hub.docker.com/r/rocm/flashinfer/tags for older ROCm/PyTorch combos. The full table goes stale on every release. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The base environment is activated on shell start inside the rocm/flashinfer images, so the explicit `micromamba activate base` call was misleading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace <container-name>/<docker-image-tag> placeholders with the flashinfer-rocm container name and the actual latest image tag so the snippet is copy-pasteable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Replace the wget-based download steps with a brief pointer to the examples/ directory and a single run command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The Basic Usage snippet at the top of the README already shows the same call pattern; the AITER-section duplicate added no extra information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Hyperlink the first mention of CDNA3 to the ROCm MI300 microarchitecture docs and CDNA4 to AMD's MI350 product page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the GPU product / ROCm doc links with the AMD CDNA3 and CDNA4 architecture whitepapers — the right reference for the architectures themselves rather than the cards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

- Drop the "AMD ROCm port" redundancy with the title and lead with what ships in-tree (the HIP kernel set) versus what dispatches to AITER. - Cross-link the Feature Support Matrix and AITER from the first paragraph so readers landing on the README see the structure immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Install / Feature Matrix / Build / AITER are what a new reader needs first; the code snippet reads better as a closing example. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Tighten Feature Matrix preamble and Legend; drop the duplicate AITER link (already in the intro). - Collapse the AITER Support intro that overlapped with the matrix; cross-link Known Limitations instead of re-listing the conditions. - Rewrite Known Limitations preamble to call out the two-group split (hard errors vs. silently-ignored kwargs) more directly. - Split the dense CPX-mode pytest notes into labelled bullets. - Drop the unused validate_flashinfer_rocm_arch import from the runtime helpers snippet and note (separately) that it's a build-time validator, not a runtime helper. - Move the pre-commit / pytest contributing reminder out of the License paragraph into its own line. - Fix "Python tests suite" → "Python test suite". Verified against the codebase: env var names + defaults (FLASHINFER_USE_TORCH_CUSTOM_OPS, FLASHINFER_HIP_FUSED_CASCADE, FLASHINFER_LOGGING_LEVEL, FLASHINFER_DISABLE_JIT, ROCM_PATH/ROCM_HOME), hip_utils / aiter_utils helper signatures, attention_reference.py path, and the MI308X CPX-mode reference (decode.cuh:707). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

MLA on ROCm previously forced the user to pass backend="aiter" explicitly: the wrapper's __init__ raised ValueError on anything other than "aiter", including the auto value used by every other ROCm kernel. That left MLA as the odd one out in the public API even though it has exactly one implementation to choose from on ROCm. Accept both "auto" and "aiter" (default is now "auto" to match the rest of the ROCm wrappers); any other value still raises with an updated message. The behaviour is unchanged for callers who already pass "aiter". ### Test plan - New parametrized test covering backend="auto" / "aiter" construction. - New test that backend="fa2" still raises ValueError (runs anywhere, no GPU required since the check fires before the AITER probe). - Full tests/rocm_tests/test_mla_aiter_hip.py — 11 passed. - pre-commit run -a — passed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The original "Environment name varies …" wording was ambiguous in context (the surrounding section is about the Docker image tag, not a shell or micromamba env). Rewrite to spell out that it's the Docker image tag that encodes the versions, and that the -t tag and the tag passed to docker run must match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The build/run blocks already show the matching -t tag and the docker run image tag side-by-side; the extra explanatory note added noise without new information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

README.md:352

This guidance suggests “pass backend=\"fa2\" explicitly” when AITER ignores kwargs, but several AITER-enabled non-attention APIs don’t accept "fa2" (e.g., append_paged_kv_cache / rmsnorm use "native"). Please adjust the wording to reference the correct non-AITER backend(s) depending on the API being discussed.


**Conditions that fall back to `fa2` under `backend="auto"`:**

* GPU is not gfx942 or gfx950
* `kv_layout` is not `NHD`

- Feature matrix: add `pos_encoding_mode="NONE"` to batch decode AITER auto-routing criteria; add gfx942/gfx950 arch gate to the `append_paged_kv_cache` row. - AITER Support: clarify the in-tree backend strings per-op (`fa2` for attention wrappers vs `native` for `append_paged_kv_cache` / `rmsnorm`) and call out the two backend-specific quirks (`rmsnorm` auto stays on HIP, batch decode auto avoids CUDA-graph / tensor cores). - Known Limitations: promote `pos_encoding_mode != "NONE"` and batch decode's `use_cuda_graph` / `use_tensor_cores` from the silently-ignored group to the hard-error / fallback group; the AITER attention paths reject them outright. - Runtime Helpers: add the missing `import torch` to the snippet and correct the `is_aiter_supported` comment — the function only checks ROCm build + GPU arch, not whether the `aiter` Python package can actually be imported. - CLAUDE.md: update the README anchor link to follow the renamed "GPU, ROCm, and PyTorch Support" section so cross-references stay live. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

demandal25 and others added 2 commits May 21, 2026 04:14

Copilot AI review requested due to automatic review settings May 21, 2026 04:34

Copilot started reviewing on behalf of demandal25 May 21, 2026 04:34 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md Outdated

demandal25 and others added 2 commits May 21, 2026 12:52

Copilot AI review requested due to automatic review settings May 21, 2026 12:54

Copilot started reviewing on behalf of demandal25 May 21, 2026 12:54 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md Outdated

Comment thread README.md Outdated

demandal25 and others added 2 commits May 21, 2026 13:47

Copilot AI review requested due to automatic review settings May 21, 2026 13:50

Copilot started reviewing on behalf of demandal25 May 21, 2026 13:50 View session

demandal25 and others added 2 commits May 21, 2026 13:51

docs: drop manual micromamba activate from docker verify step

7c122a6

The base environment is activated on shell start inside the rocm/flashinfer images, so the explicit `micromamba activate base` call was misleading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: use concrete image tag and container name in docker run

012c053

Replace <container-name>/<docker-image-tag> placeholders with the flashinfer-rocm container name and the actual latest image tag so the snippet is copy-pasteable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md

demandal25 and others added 2 commits May 21, 2026 13:59

docs(readme): simplify "Trying the Examples" to point at examples/

4cdf823

Replace the wget-based download steps with a brief pointer to the examples/ directory and a single run command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(readme): drop redundant Single Prefill Example from AITER section

7da63b9

The Basic Usage snippet at the top of the README already shows the same call pattern; the AITER-section duplicate added no extra information. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 21, 2026 14:00

Copilot started reviewing on behalf of demandal25 May 21, 2026 14:01 View session

docs(readme): rename "Build from Source" to "Install from Source"

7b3bcdd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md

Comment thread README.md Outdated

Comment thread README.md Outdated

demandal25 and others added 2 commits May 21, 2026 15:25

docs(readme): link CDNA3 / CDNA4 to their architecture references

9f13ae1

Hyperlink the first mention of CDNA3 to the ROCm MI300 microarchitecture docs and CDNA4 to AMD's MI350 product page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 21, 2026 15:27

Copilot started reviewing on behalf of demandal25 May 21, 2026 15:27 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md

Comment thread README.md

Comment thread README.md

demandal25 and others added 2 commits May 21, 2026 15:37

docs(readme): move Basic Usage to the end of the README

df1cb84

Install / Feature Matrix / Build / AITER are what a new reader needs first; the code snippet reads better as a closing example. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 21, 2026 15:40

Copilot started reviewing on behalf of demandal25 May 21, 2026 15:41 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md

Comment thread README.md

demandal25 and others added 2 commits May 21, 2026 16:06

Copilot AI review requested due to automatic review settings May 21, 2026 16:08

Copilot started reviewing on behalf of demandal25 May 21, 2026 16:08 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread README.md

Comment thread README.md

Comment thread README.md

Comment thread README.md

demandal25 changed the title ~~docs: refresh README for amd-flashinfer library consumers~~ docs: update README for amd-flashinfer library consumers May 21, 2026

demandal25 merged commit 31ea6a9 into ROCm:amd-integration May 21, 2026
1 check passed

demandal25 deleted the update-readme branch May 21, 2026 16:24

Conversation

demandal25 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

README.md

flashinfer/mla_rocm.py + tests/rocm_tests/test_mla_aiter_hip.py

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

demandal25 commented May 21, 2026 •

edited

Loading

`README.md`

`flashinfer/mla_rocm.py` + `tests/rocm_tests/test_mla_aiter_hip.py`