Skip to content

Conversation

@tgujar
Copy link
Contributor

@tgujar tgujar commented Nov 18, 2025

Description

MR improves performance of gather API for small string columns(avg.length <= 32 char) by using cub::DeviceMemcpy::Batched API to perform the gather for string columns with greater than ~0.5 million rows. The threshold for the the row count is decided based on benchmarking data on H100.

MR adds an additional test case to check string gather implementation where kernel is launched with > 1 CTA to verify correctness. It also adds more parameters to sweep for the gather benchmark for representing larger input row counts.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 18, 2025
@tgujar
Copy link
Contributor Author

tgujar commented Nov 18, 2025

Benchmarking data:
gather_cub_bench.json
gather_baseline_bench.json

❯ python  nvbench_compare.py ~/data_remote/repos/forks2/cudf/cpp/build/benchmarks/gather_baseline_bench.json ~/data_remote/repos/forks2/cudf/cpp/build/benchmarks/gather_cub_bench.json
['/home/tgujar/data_remote/repos/forks2/cudf/cpp/build/benchmarks/gather_baseline_bench.json', '/home/tgujar/data_remote/repos/forks2/cudf/cpp/build/benchmarks/gather_cub_bench.json']
# copy

## [0] NVIDIA H100 NVL

|  min_width  |  max_width  |  num_rows  |  api   |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-------------|-------------|------------|--------|------------|-------------|------------|-------------|--------------|---------|----------|
|      0      |     32      |   32768    | gather | 111.099 us |       3.24% | 112.520 us |       2.96% |     1.420 us |   1.28% |   SAME   |
|      0      |     64      |   32768    | gather | 118.542 us |       3.92% | 119.767 us |       3.08% |     1.225 us |   1.03% |   SAME   |
|      0      |     128     |   32768    | gather | 119.131 us |       2.39% | 121.248 us |       2.67% |     2.117 us |   1.78% |   SAME   |
|      0      |     256     |   32768    | gather | 120.557 us |       3.07% | 122.292 us |       2.92% |     1.735 us |   1.44% |   SAME   |
|      0      |     32      |   65536    | gather | 120.678 us |       1.95% | 122.477 us |       2.01% |     1.798 us |   1.49% |   SAME   |
|      0      |     64      |   65536    | gather | 129.182 us |       1.72% | 130.930 us |       1.80% |     1.747 us |   1.35% |   SAME   |
|      0      |     128     |   65536    | gather | 127.069 us |       1.94% | 128.716 us |       1.81% |     1.647 us |   1.30% |   SAME   |
|      0      |     256     |   65536    | gather | 128.142 us |       1.43% | 129.840 us |       2.33% |     1.698 us |   1.32% |   SAME   |
|      0      |     32      |   131072   | gather | 129.837 us |       1.48% | 131.545 us |       2.09% |     1.708 us |   1.32% |   SAME   |
|      0      |     64      |   131072   | gather | 144.399 us |       1.55% | 146.016 us |       1.65% |     1.618 us |   1.12% |   SAME   |
|      0      |     128     |   131072   | gather | 143.701 us |       1.25% | 145.687 us |       1.14% |     1.986 us |   1.38% |   SLOW   |
|      0      |     256     |   131072   | gather | 143.719 us |       1.52% | 145.386 us |       1.62% |     1.667 us |   1.16% |   SAME   |
|      0      |     32      |   262144   | gather | 146.761 us |       1.20% | 148.053 us |       1.70% |     1.292 us |   0.88% |   SAME   |
|      0      |     64      |   262144   | gather | 174.832 us |       0.96% | 176.190 us |       1.47% |     1.358 us |   0.78% |   SAME   |
|      0      |     128     |   262144   | gather | 176.751 us |       0.87% | 178.364 us |       1.30% |     1.613 us |   0.91% |   SLOW   |
|      0      |     256     |   262144   | gather | 179.043 us |       0.96% | 179.119 us |       1.18% |     0.077 us |   0.04% |   SAME   |
|      0      |     32      |   524288   | gather | 179.092 us |       0.95% | 170.930 us |       1.05% |    -8.162 us |  -4.56% |   FAST   |
|      0      |     64      |   524288   | gather | 233.622 us |       0.93% | 206.112 us |       1.89% |   -27.510 us | -11.78% |   FAST   |
|      0      |     128     |   524288   | gather | 234.615 us |       0.70% | 237.850 us |       0.78% |     3.235 us |   1.38% |   SLOW   |
|      0      |     256     |   524288   | gather | 239.008 us |       0.66% | 240.098 us |       0.79% |     1.090 us |   0.46% |   SAME   |
|      0      |     32      |  1048576   | gather | 248.319 us |       1.36% | 232.996 us |       1.71% |   -15.324 us |  -6.17% |   FAST   |
|      0      |     64      |  1048576   | gather | 355.450 us |       0.61% | 295.545 us |       1.23% |   -59.905 us | -16.85% |   FAST   |
|      0      |     128     |  1048576   | gather | 350.025 us |       0.90% | 352.695 us |       0.88% |     2.670 us |   0.76% |   SAME   |
|      0      |     256     |  1048576   | gather | 352.820 us |       0.50% | 356.689 us |       0.56% |     3.869 us |   1.10% |   SLOW   |
|      0      |     32      |  2097152   | gather | 396.594 us |       0.75% | 347.368 us |       0.70% |   -49.227 us | -12.41% |   FAST   |
|      0      |     64      |  2097152   | gather | 609.615 us |       0.45% | 468.495 us |       0.60% |  -141.120 us | -23.15% |   FAST   |
|      0      |     128     |  2097152   | gather | 596.088 us |       0.25% | 598.296 us |       0.38% |     2.208 us |   0.37% |   SLOW   |
|      0      |     256     |  2097152   | gather | 603.428 us |       0.52% | 607.047 us |       0.42% |     3.619 us |   0.60% |   SLOW   |
|      0      |     32      |  4194304   | gather | 671.909 us |       0.41% | 566.581 us |       0.66% |  -105.328 us | -15.68% |   FAST   |
|      0      |     64      |  4194304   | gather |   1.087 ms |       0.19% | 781.951 us |       0.57% |  -305.371 us | -28.08% |   FAST   |
|      0      |     128     |  4194304   | gather |   1.063 ms |       0.25% |   1.062 ms |       0.36% |    -0.776 us |  -0.07% |   SAME   |
|      0      |     256     |  4194304   | gather |   1.077 ms |       0.32% |   1.073 ms |       0.30% |    -3.349 us |  -0.31% |   FAST   |
|      0      |     32      |  8388608   | gather |   1.240 ms |       0.29% | 998.687 us |       0.41% |  -240.828 us | -19.43% |   FAST   |
|      0      |     64      |  8388608   | gather |   2.066 ms |       0.16% |   1.404 ms |       0.29% |  -661.855 us | -32.04% |   FAST   |
|      0      |     128     |  8388608   | gather |   2.018 ms |       0.50% |   2.006 ms |       0.19% |   -11.918 us |  -0.59% |   FAST   |
|      0      |     256     |  8388608   | gather |   2.039 ms |       0.18% |   2.027 ms |       0.23% |   -11.670 us |  -0.57% |   FAST   |
|      0      |     32      |  16777216  | gather |   2.361 ms |       0.09% |   1.838 ms |       0.21% |  -523.242 us | -22.16% |   FAST   |
|      0      |     64      |  16777216  | gather |   4.011 ms |       0.06% |   2.655 ms |       0.09% | -1356.147 us | -33.81% |   FAST   |
|      0      |     128     |  16777216  | gather |   3.897 ms |       0.09% |   3.879 ms |       0.05% |   -18.658 us |  -0.48% |   FAST   |
|      0      |     256     |  16777216  | gather |   3.930 ms |       0.22% |   4.012 ms |       1.03% |    81.861 us |   2.08% |   SLOW   |
|      0      |     32      |  33554432  | gather |   4.571 ms |       0.06% |   3.530 ms |       0.32% | -1040.743 us | -22.77% |   FAST   |
|      0      |     64      |  33554432  | gather |   7.860 ms |       0.14% |   5.141 ms |       0.06% | -2718.947 us | -34.59% |   FAST   |
|      0      |     128     |  33554432  | gather |   7.657 ms |       0.16% |   7.668 ms |       0.03% |    11.320 us |   0.15% |   SLOW   |
|      0      |     256     |  33554432  | gather |   8.320 ms |       0.75% |   8.398 ms |       0.78% |    77.301 us |   0.93% |   SLOW   |

# Summary

- Total Matches: 44
  - Pass    (diff <= min_noise): 17
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  27

@tgujar tgujar marked this pull request as ready for review November 18, 2025 23:57
@tgujar tgujar requested a review from a team as a code owner November 18, 2025 23:57
@tgujar tgujar requested review from davidwendt and lamarrr November 18, 2025 23:57
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for working on this.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 20, 2025
@davidwendt
Copy link
Contributor

/ok to test 95cee21

@davidwendt
Copy link
Contributor

/ok to test bbc702e

@davidwendt
Copy link
Contributor

The failed wheel-tests-cudf-polars-with-rapidsmpf runners are not required to merge this.

@davidwendt
Copy link
Contributor

/ok to test dcd49e6

@davidwendt
Copy link
Contributor

/ok to test a4b6cd1

@davidwendt davidwendt removed their assignment Dec 1, 2025
@davidwendt
Copy link
Contributor

/ok to test c8095f3

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor requests, otherwise LGTM.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the cudf::detail::util::div_rounding_up_safe utility from cpp/include/detail/utilities/integer_utils.hpp to compute the grid_size.

@davidwendt
Copy link
Contributor

/ok to test 2a2ec83

@davidwendt
Copy link
Contributor

/ok to test 2fc013d

@davidwendt
Copy link
Contributor

/ok to test 86d6865

@davidwendt
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 918189f into rapidsai:main Dec 4, 2025
266 of 269 checks passed
@davidwendt davidwendt removed their assignment Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants