-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dlight][CPU] Add CPU Backend Support for GEMV Optimization #17663
base: main
Are you sure you want to change the base?
Conversation
This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic. CPU: AMD Ryzen 9 7950X 16-Core Processor MODEL: Qwen2-0.5B-q4f16_1-MLC Prompt: What is the meaning of life? Results: Baseline: prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138, 'decode_tokens_per_s': 0.558195154052001, 'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446} Optimized: usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994, 'decode_tokens_per_s': 2.9349053824023454, 'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635}
6bb3b44
to
34b4466
Compare
34b4466
to
e09b152
Compare
Also cc @HongHongHongL |
python/tvm/dlight/cpu/gemv.py
Outdated
return buffer_store.value.b | ||
|
||
|
||
def is_gemv(sch: tir.Schedule, block_info: BlockInfo) -> Optional[List[tir.Buffer]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse gpu's util functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saying that we can create a folder named something like "analysis" or "utils" under dlight folder, for different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree this is a good suggestion, dlight.analysis sounds right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Hzfengsy, I've created a folder analysis to ensure CPU and GPU backends reuse shared logic for GEMV, could you recheck it, thanks.
python/tvm/dlight/cpu/gemv.py
Outdated
return ret if 0 < len(ret) < len(block_stmt.reads) else None | ||
|
||
|
||
def normalize( # pylint: disable=too-many-locals, use-a-generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can reuse this one as well
abc8ad4
to
33b406b
Compare
This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic.
CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?
Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446}
Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635}