You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updated README.md with config info and header font size (#473)
* Updated README.md with config info and header font size
* Changed AITER example to in 128 out 2048 bs 256
* Reorg whats new added changelog
---------
Co-authored-by: arakowsk-amd <[email protected]>
Copy file name to clipboardexpand all lines: docs/dev-docker/README.md
+29-19
Original file line number
Diff line number
Diff line change
@@ -21,26 +21,12 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
21
21
22
22
## What is New
23
23
24
-
20250305_aiter:
25
-
- AITER improvements
24
+
-[Experimental AITER support](#aiter-use-cases)
25
+
-[Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
26
+
- Performance improvement for custom paged attention
26
27
- Support for FP8 skinny GEMM
27
-
28
-
20250207_aiter:
29
-
- More performant AITER
30
28
- Bug fixes
31
29
32
-
20250205_aiter:
33
-
-[AITER](https://github.com/ROCm/aiter) support
34
-
- Performance improvement for custom paged attention
35
-
- Reduced memory overhead bug fix
36
-
37
-
20250124:
38
-
- Fix accuracy issue with 405B FP8 Triton FA
39
-
- Fixed accuracy issue with TP8
40
-
41
-
20250117:
42
-
-[Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
43
-
44
30
## Performance Results
45
31
46
32
The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
@@ -62,7 +48,7 @@ The table below shows performance data where a local inference client is fed req
62
48
63
49
*TP stands for Tensor Parallelism.*
64
50
65
-
## Latency Measurements
51
+
###Latency Measurements
66
52
67
53
The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
68
54
@@ -103,6 +89,8 @@ The table below shows latency measurement, which typically involves assessing th
0 commit comments