Skip to content

Conversation

@achirkin
Copy link
Contributor

@achirkin achirkin commented Nov 3, 2025

Improve the efficiency of process_and_fill_codes_kernel by writing the codes in larger chunks.

@achirkin achirkin requested a review from a team as a code owner November 3, 2025 15:24
@achirkin achirkin added the improvement Improves an existing functionality label Nov 3, 2025
@achirkin achirkin added the non-breaking Introduces a non-breaking change label Nov 3, 2025
@achirkin achirkin marked this pull request as draft November 4, 2025 14:55
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 4, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@achirkin
Copy link
Contributor Author

achirkin commented Nov 4, 2025

Moving this to draft: although the write efficiency improves by 2-5x according to the nsight profiler (less store instructions), the overall kernel runtime barely changes at all, because the bottleneck is data reading and ALU (encoding). So the value of the PR is in question.

if (filled_bits >= BitsPerLabel) {
filled_bits -= BitsPerLabel;
// write the codes to global memory
*out_codes_ptr++ = staging_codes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the lane condition if (lane_id == 0) to only this line can improve warp parallelism

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

3 participants