Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dlight][CPU] Add CPU Backend Support for GEMV Optimization #17663

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mengshyu
Copy link
Contributor

This PR adds Dlight CPU support with optimized GEMV scheduling, including pattern detection, loop tiling, vectorization, and parallel execution. It improves maintainability by refining target checks, reduction handling, and scheduling logic.

CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?

Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262 extra={'prompt_tokens': 27, 'completion_tokens': 235, 'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906, 'inter_token_latency_s': 1.9013750143957446}

Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254 extra={'prompt_tokens': 27, 'completion_tokens': 227, 'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0, 'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387, 'inter_token_latency_s': 0.4580444070528635}

This PR adds Dlight CPU support with optimized GEMV scheduling,
including pattern detection, loop tiling, vectorization, and parallel
execution. It improves maintainability by refining target checks,
reduction handling, and scheduling logic.

CPU: AMD Ryzen 9 7950X 16-Core Processor
MODEL: Qwen2-0.5B-q4f16_1-MLC
Prompt: What is the meaning of life?

Results:
Baseline:
prompt_tokens=27 completion_tokens=235 total_tokens=262
extra={'prompt_tokens': 27, 'completion_tokens': 235,
'prefill_tokens': 27, 'decode_tokens': 234, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 0.9777329325367138,
'decode_tokens_per_s': 0.558195154052001,
'end_to_end_latency_s': 446.823128383, 'ttft_s': 27.614902906,
'inter_token_latency_s': 1.9013750143957446}

Optimized:
usage: prompt_tokens=27 completion_tokens=227 total_tokens=254
extra={'prompt_tokens': 27, 'completion_tokens': 227,
'prefill_tokens': 27, 'decode_tokens': 226, 'jump_forward_tokens': 0,
'prefill_tokens_per_s': 1.0010420333327994,
'decode_tokens_per_s': 2.9349053824023454,
'end_to_end_latency_s': 103.976080401, 'ttft_s': 26.971894387,
'inter_token_latency_s': 0.4580444070528635}
@tqchen
Copy link
Member

tqchen commented Feb 18, 2025

cc @Hzfengsy can you help to take a look, also cc @tlopex

@Hzfengsy
Copy link
Member

Also cc @HongHongHongL

return buffer_store.value.b


def is_gemv(sch: tir.Schedule, block_info: BlockInfo) -> Optional[List[tir.Buffer]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse gpu's util functions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saying that we can create a folder named something like "analysis" or "utils" under dlight folder, for different backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree this is a good suggestion, dlight.analysis sounds right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Hzfengsy, I've created a folder analysis to ensure CPU and GPU backends reuse shared logic for GEMV, could you recheck it, thanks.

return ret if 0 < len(ret) < len(block_stmt.reads) else None


def normalize( # pylint: disable=too-many-locals, use-a-generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reuse this one as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants