ROCm · demandal25 · May 21, 2026 · Apr 23, 2026 · May 21, 2026 · May 21, 2026
diff --git a/README.md b/README.md
@@ -1,16 +1,26 @@
 # FlashInfer+ROCm: An AMD ROCm port of FlashInfer
 
-FlashInfer+ROCm is a port of the [FlashInfer](https://github.com/flashinfer-ai/flashinfer) library
-that adds support for AMD Instinct GPUs. The project is in active development with current focus on
-porting attention kernels to ROCm.
-
-**Versioning:** The release tag format `<upstream_version>+amd` ties each FlashInfer+ROCm release
-to its corresponding upstream tag (e.g., `0.2.5+amd.2` is second release of amd-flashinfer based on upstream version `v0.2.5`).
+FlashInfer+ROCm is an AMD ROCm port of the
+[FlashInfer](https://github.com/flashinfer-ai/flashinfer) attention,
+RoPE, normalization, sampling, and logits-processor kernels for LLM
+inference on AMD Instinct GPUs. The port targets CDNA3 (gfx942 —
+MI300X / MI325X) and CDNA4 (gfx950 — MI355X), and is aimed at developers
+embedding FlashInfer kernels into their own training or serving stack.
+
+The project is in active development with the primary focus on attention
+(single and batch prefill / decode) and the surrounding KV-cache, RoPE,
+and normalization kernels. See [CHANGELOG.md](CHANGELOG.md) for the
+full release history.
+
+**Versioning:** The release tag format `<upstream_version>+amd.<n>` ties
+each FlashInfer+ROCm release to its corresponding upstream tag (e.g.
+`0.5.3+amd.1` is the first AMD release based on upstream `v0.5.3`).
 
 ## Table of Contents
 
+* [Basic Usage](#basic-usage)
 * [Feature Support Matrix](#feature-support-matrix)
-* [GPU and ROCm Support](#gpu-and-rocm-support)
+* [GPU, ROCm, and PyTorch Support](#gpu-rocm-and-pytorch-support)
 * [Getting Started](#getting-started)
   * [Option 1: Get a Pre-built Docker Image](#option-1-get-a-pre-built-docker-image)
   * [Option 2: Install from a Wheel Package](#option-2-install-from-a-wheel-package)
@@ -20,47 +30,84 @@ to its corresponding upstream tag (e.g., `0.2.5+amd.2` is second release of amd-
   * [Building and Installing a Wheel Package](#building-and-installing-a-wheel-package)
   * [Running Tests](#running-tests)
 * [AITER Support](#aiter-support)
-  * [Single Prefill AITER example](#single-prefill-example)
+  * [Install AITER from source](#install-aiter-from-source)
+  * [Install AITER wheel package](#install-aiter-wheel-package)
+  * [Known Limitations](#known-limitations)
+  * [Single Prefill Example](#single-prefill-example)
+* [License and Acknowledgements](#license-and-acknowledgements)
+
+## Basic Usage
+
+```python
+import torch
+import flashinfer
+
+# PyTorch+ROCm still uses device="cuda" for AMD GPUs.
+q = torch.randn(1024, 32, 128, dtype=torch.float16, device="cuda")
+k = torch.randn(1024,  8, 128, dtype=torch.float16, device="cuda")  # GQA 4:1
+v = torch.randn(1024,  8, 128, dtype=torch.float16, device="cuda")
+
+# backend="auto" (default) routes to AITER when supported on gfx942/gfx950
+# and falls back to the in-tree fa2 HIP kernel otherwise.
+output = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True)
+```
+
+See [`examples/`](examples/) for batch prefill, batch decode, and a
+Jupyter tutorial that walks through the full public API on ROCm.
 
 ## Feature Support Matrix
 
 | Kernel Type | FP16 / BF16 | FP8 (E4M3, E5M2) | Has AITER backend | Notes |
 | :--- | :---: | :---: | :---: | :--- |
-| **Decode Attention** | ✅ | ✅ | No | Supports MHA, GQA, and MQA |
-| **Prefill Attention** | ✅ | WIP | ✅ | Supports MHA, GQA, and MQA |
-| **Cascade Attention** | TBD | TBD | No | Not Yet Ported |
-| **MLA** | TBD | TBD | No | Not Yet Ported |
-| **POD** | TBD | TBD | No | Not Yet Ported |
-| **Positional Encoding** | TBD | TBD | No | Not Yet Ported |
-| **Sampling** | ✅ | TBD | No | Supports Top-K/Top-P Sampling/OnlineSoftmax/SamplingFromLogits |
-| **Logits Processor** | ✅ | TBD | No | |
-| **Normalization** | ✅ | TBD | No | Supports RMS-Norm/Layer-Norm |
+| **Single / Batch Decode Attention** | ✅ | ✅ (E4M3FNUZ KV-cache) | ✅ (batch paged, fp16/bf16) | MHA, GQA, MQA; sliding-window on the AITER path; CUDA-graph support |
+| **Single / Batch Prefill Attention** | ✅ | WIP | ✅ (single, batch-paged, batch-ragged) | MHA, GQA, MQA |
+| **Cascade Attention** | ✅ | — | No | Two-level shared-prefix attention |
+| **MLA (Multi-Latent Attention)** | ✅ (bf16, `page_size=1`) | — | ✅ (AITER-only path) | DeepSeek-style 192/128 head-dim split; **requires AITER** on ROCm |
+| **POD Attention** | TBD | TBD | No | Code present; **not yet validated on ROCm** |
+| **RoPE (Positional Encoding)** | ✅ | — | No | LLaMA-style + LLaMA 3.1 scaling; fused RoPE + paged-KV append |
+| **Paged KV-Cache Append** | ✅ | ✅ | ✅ (opt-in) | `append_paged_kv_cache` |
+| **Sampling** | ✅ | — | No | Top-K / Top-P / Min-P / OnlineSoftmax / SamplingFromLogits |
+| **Logits Processor** | ✅ | — | No | Composable processor pipeline (cap, mask, temperature, …) |
+| **Normalization** | ✅ | — | ✅ (RMSNorm only) | RMSNorm, LayerNorm, Gemma RMSNorm |
+| **Activation** | ✅ | — | No | SiLU / GELU with fused gating |
+| **Quantization** | ✅ | — | No | `packbits`, `segment_packbits` |
+| **`torch.compile`** | ✅ (opt-in) | — | n/a | Enabled via the `FLASHINFER_ENABLE_TORCH_COMPILE` env flag |
+
+Every ✅ row above is exercised by a matching `tests/rocm_tests/test_*_hip.py`.
+
+## GPU, ROCm, and PyTorch Support
 
-## GPU and ROCm Support
+**Supported GPUs:** gfx942 (CDNA3 — MI300X, MI325X), gfx950 (CDNA4 — MI355X).
 
-**Supported GPU:** gfx942 (CDNA3 architecture), gfx950 (CDNA4 architecture)
+**Supported ROCm versions:** 7.0.2, 7.1.1, 7.2.
 
-**Supported ROCm versions:** 7.0.2, 7.1.1, 7.2
+**Supported PyTorch+ROCm versions:** 2.8.0, 2.9.1.
 
-## Torch Version Support
+Install the matching ROCm-enabled PyTorch wheel from
+<https://repo.radeon.com>:
 
-**Torch+ROCm:** 2.8.0, 2.9.1
+```bash
+pip install torch==2.9.1 --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/
+```
 
-**Note**: Other versions may work but have not been tested. Refer to <https://repo.radeon.com/rocm/manylinux/rocm-rel-{rocm-version}/> (replacing `{rocm-version}` with the desired ROCm version, e.g., `7.0.2`) for available versions.
+Other versions may work but have not been tested. Replace `7.2` with the
+ROCm version you need; refer to
+<https://repo.radeon.com/rocm/manylinux/rocm-rel-{rocm-version}/> for
+available wheels.
 
 ## Getting Started
 
 ### Option 1: Get a Pre-built Docker Image
 
 AMD validates and publishes [FlashInfer images](https://hub.docker.com/r/rocm/flashinfer/tags)
-with ROCm backends on Docker Hub. The following Docker image tag and associated
-inventories represent the latest available FlashInfer version from the official Docker Hub.
+with ROCm backends on Docker Hub. The following Docker image tags
+represent the latest available FlashInfer+ROCm releases:
 
 | Docker image | ROCm | FlashInfer | PyTorch | Ubuntu | Python | GPU |
 | ------------ | ---- | ---------- | ------- | ------ | ------ | --- |
-| rocm/flashinfer:flashinfer-0.5.3.amd1_rocm7.2_ubuntu24.04_py3.12_pytorch2.9.1 |7.2.0 | v0.5.3 | 2.9.1 | 24.04 | 3.12 | MI355x, MI325X, MI300X |
-| rocm/flashinfer:flashinfer-0.5.3.amd1_rocm7.0.2_ubuntu24.04_py3.12_pytorch2.9.1 | 7.0.2 | v0.5.3 | 2.9.1 | 24.04 | 3.12 | MI355x, MI325X, MI300X |
-| rocm/flashinfer:flashinfer-0.2.5.amd2_rocm7.1.1_ubuntu24.04_py3.12_pytorch2.8 | 7.1.1 | v0.2.5 | 2.8.0 | 24.04 | 3.12 | MI325X, MI300X |
+| `rocm/flashinfer:flashinfer-0.5.3.amd1_rocm7.2_ubuntu24.04_py3.12_pytorch2.9.1` | 7.2.0 | v0.5.3 | 2.9.1 | 24.04 | 3.12 | MI355X, MI325X, MI300X |
+| `rocm/flashinfer:flashinfer-0.5.3.amd1_rocm7.0.2_ubuntu24.04_py3.12_pytorch2.9.1` | 7.0.2 | v0.5.3 | 2.9.1 | 24.04 | 3.12 | MI355X, MI325X, MI300X |
+| `rocm/flashinfer:flashinfer-0.2.5.amd2_rocm7.1.1_ubuntu24.04_py3.12_pytorch2.8` | 7.1.1 | v0.2.5 | 2.8.0 | 24.04 | 3.12 | MI325X, MI300X |
 
 **Start a container:**
 
@@ -73,14 +120,14 @@ docker run -it --privileged --network=host --device=/dev/kfd --device=/dev/dri \
 **Activate the environment and verify:**
 
 ```bash
-# Activate micromamba environment (Note: env name may vary based on the image)
+# Activate the micromamba environment (env name may vary based on the image)
 micromamba activate base
 
 # Verify installation
 python -c "import flashinfer; print(flashinfer.__version__)"
 ```
 
-Expected output: `0.5.3+amd.1` (with a possible JIT backend message)
+Expected output: `0.5.3+amd.1` (with a possible JIT backend message).
 
 ### Option 2: Install from a Wheel Package
 
@@ -90,14 +137,14 @@ Install from AMD's package repository:
 pip install amd-flashinfer --index-url https://pypi.amd.com/simple/
 ```
 
-Install the needed ROCm-enabled torch package from <https://repo.radeon.com>:
+Install the matching ROCm-enabled torch package from <https://repo.radeon.com>:
 
 ```bash
-pip install torch==2.9.1 -f https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2
+pip install torch==2.9.1 --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/
 ```
 
-**NOTE**: The torch version should be exactly as available on repo.radeon.com otherwise a non-ROCm
-torch version will get installed from pypi.
+**NOTE:** Use `--index-url` (not `-f`) so pip cannot silently fall back
+to a CPU-only PyPI wheel.
 
 ### Trying the Examples
 
@@ -116,11 +163,14 @@ done
 
 **Available examples:**
 
-* `single_prefill_example.py` - Single-sequence prefill attention
-* `batch_prefill_example.py` - Batched prefill attention
-* `batch_decode_example.py` - Batched decode attention
-* `examples/amd_flashinfer_rocm_tutorial.ipynb` - Jupyter tutorial: environment verification (`hip_utils`), AITER-backed prefill examples, and `logits_processor` on ROCm
-* `examples/run_jupyter_server.sh` - Start JupyterLab from the repo root (run inside your ROCm/FlashInfer environment or Docker container)
+* `single_prefill_example.py` — single-sequence prefill attention
+* `batch_prefill_example.py` — batched prefill attention
+* `batch_decode_example.py` — batched decode attention
+* `examples/amd_flashinfer_rocm_tutorial.ipynb` — Jupyter tutorial:
+  environment verification (`hip_utils`), AITER-backed prefill examples,
+  and `logits_processor` on ROCm
+* `examples/run_jupyter_server.sh` — start JupyterLab from the repo root
+  (run inside your ROCm/FlashInfer environment or Docker container)
 
 ## Build from Source
 
@@ -184,7 +234,8 @@ docker run -it \
 </details>
 <!-- markdownlint-enable MD033 -->
 
-**Note:** Environment name varies based on Python, PyTorch, and ROCm versions.
+**Note:** Environment name varies based on Python, PyTorch, and ROCm
+versions.
 
 ### Building and Installing a Wheel Package
 
@@ -198,12 +249,12 @@ cd dist && pip install amd_flashinfer-*.whl
 **Editable install for development:**
 
 ```bash
-python -m pip install --no-build-isolation -ve.
+python -m pip install --no-build-isolation -ve .
 ```
 
-**Note:** The `--no-deps` flag assumes dependencies are pre-installed. Omit it
-to download dependencies during build. AOT builds take longer and use more disk
-space but avoid JIT compilation at runtime.
+**Note:** The `--no-deps` flag assumes dependencies are pre-installed.
+Omit it to download dependencies during build. AOT builds take longer
+and use more disk space but avoid JIT compilation at runtime.
 
 ### Running Tests
 
@@ -214,20 +265,21 @@ The Python tests suite can be run with pytest:
 pytest
 
 # Run specific test file
-pytest tests/test_decode_kernels_hip.py
+pytest tests/rocm_tests/test_batch_decode_kernels_hip.py
 
 # Run with pattern matching
-pytest -k "test_decode_kernels_hip"
+pytest -k "test_batch_decode_kernels_hip"
 
 # Verbose output
 pytest -v
 
-# To run tests parallely on multiple GPUs
-pytest -n auto # Uses all available GPUs
-pytest -n 2 # Use only two GPUs
+# Run tests in parallel across multiple GPUs
+pytest -n auto  # Uses all available GPUs
+pytest -n 2     # Use only two GPUs
 ```
 
-The default test configuration is specified in [pyproject.toml](pyproject.toml) under the `testpaths` setting.
+The default test configuration is specified in [pyproject.toml](pyproject.toml)
+under the `testpaths` setting.
 
 #### Recommended invocation on AMD CPX systems
 
@@ -256,7 +308,10 @@ pytest -n auto --reruns 2 -m "slow"
 ## AITER Support
 
 FlashInfer+ROCm supports the use of [AITER](https://github.com/ROCm/aiter) as a
-backend. The `aiter` backend is enabled for the `single_prefill` and `batch_prefill` kernels.
+backend. The `aiter` backend is enabled for the `single_prefill`,
+`batch_prefill` (paged and ragged), `batch_decode`, `append_paged_kv_cache`,
+`rmsnorm`, and `MLA` paths. MLA on ROCm is **only** available via AITER —
+there is no in-tree HIP MLA kernel yet.
 
 **On gfx942/gfx950 GPUs, `backend="auto"` (the default) automatically selects the AITER backend**
 when the call parameters are compatible (fp16/bf16, NHD layout, no custom mask, equal Q/K/V
@@ -316,10 +371,12 @@ results — pass `backend="fa2"` explicitly if you need any of these):
   `{16, 1024}` (or `{128, 256, 1024}` on `amd-aiter==0.1.10`). Other page
   sizes still work but go through an extra GPU gather to flatten paged KV
   before the AITER call.
-* Ragged (non-paged) KV is not yet implemented on the AITER batch-prefill
-  path. `BatchPrefillWithRaggedKVCacheWrapper` therefore forces the backend
-  to `fa2` regardless of whether you pass `backend="auto"` or
-  `backend="aiter"` (a warning is logged in the latter case).
+* Ragged (non-paged) batch prefill via AITER is supported through
+  `BatchPrefillWithRaggedKVCacheWrapper`. The wrapper auto-routes to
+  AITER under `backend="auto"` when the standard AITER compatibility
+  conditions are met and falls back to `fa2` otherwise.
+* MLA on ROCm currently supports only `bfloat16` and `page_size=1`
+  through the AITER backend.
 
 ### Single Prefill Example
 
@@ -340,8 +397,15 @@ q = torch.randn(seq_len, num_qo_heads, head_dim, dtype=torch.float16, device="cu
 k = torch.randn(seq_len, num_kv_heads, head_dim, dtype=torch.float16, device="cuda")
 v = torch.randn(seq_len, num_kv_heads, head_dim, dtype=torch.float16, device="cuda")
 
-# Run single prefill attention with causal masking
+# Run single prefill attention with causal masking.
 # On gfx942/gfx950, backend="auto" (default) routes to AITER automatically.
 # Pass backend="aiter" to require AITER explicitly, or backend="fa2" to skip it.
 output = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, backend="auto")
 ```
+
+## License and Acknowledgements
+
+FlashInfer+ROCm is released under the Apache-2.0 License — see
+[LICENSE](LICENSE) and [NOTICE](NOTICE). Upstream project:
+[flashinfer-ai/flashinfer](https://github.com/flashinfer-ai/flashinfer).
+Run `pre-commit run -a` and `pytest` before opening a PR.