Skip to content

Conversation

junjzhang
Copy link

@junjzhang junjzhang commented Oct 15, 2025

Motivation

A possible way to remove record stream of normal mode by holding reference. Issue see #455 .

Test results

Test on downstream training task, no problems encountered.
Test on 8 node, passed.

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4, transmit: 2447.00 us, notify: 1994.00 us, BW: 49.24 GB/s (RDMA), 91.70 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8, transmit: 2336.00 us, notify: 677.52 us, BW: 51.58 GB/s (RDMA), 96.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12, transmit: 2339.00 us, notify: 658.60 us, BW: 51.51 GB/s (RDMA), 95.94 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16, transmit: 2343.00 us, notify: 895.24 us, BW: 51.43 GB/s (RDMA), 95.77 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20, transmit: 2357.00 us, notify: 989.53 us, BW: 51.12 GB/s (RDMA), 95.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24, transmit: 2377.00 us, notify: 899.59 us, BW: 50.69 GB/s (RDMA), 94.40 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28, transmit: 2398.00 us, notify: 919.23 us, BW: 50.25 GB/s (RDMA), 93.58 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32, transmit: 2428.00 us, notify: 876.37 us, BW: 49.63 GB/s (RDMA), 92.42 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4, transmit: 2446.00 us, notify: 428.26 us, BW: 49.26 GB/s (RDMA), 91.74 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8, transmit: 2335.00 us, notify: 646.25 us, BW: 51.60 GB/s (RDMA), 96.10 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12, transmit: 2327.00 us, notify: 751.40 us, BW: 51.78 GB/s (RDMA), 96.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16, transmit: 2337.00 us, notify: 999.82 us, BW: 51.56 GB/s (RDMA), 96.02 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20, transmit: 2348.00 us, notify: 944.60 us, BW: 51.32 GB/s (RDMA), 95.57 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24, transmit: 2367.00 us, notify: 832.81 us, BW: 50.90 GB/s (RDMA), 94.80 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28, transmit: 2389.00 us, notify: 933.12 us, BW: 50.44 GB/s (RDMA), 93.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32, transmit: 2404.00 us, notify: 913.03 us, BW: 50.12 GB/s (RDMA), 93.34 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4, transmit: 2439.00 us, notify: 318.43 us, BW: 49.40 GB/s (RDMA), 92.00 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8, transmit: 2334.00 us, notify: 615.93 us, BW: 51.62 GB/s (RDMA), 96.14 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12, transmit: 2326.00 us, notify: 779.14 us, BW: 51.80 GB/s (RDMA), 96.47 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16, transmit: 2338.00 us, notify: 961.88 us, BW: 51.54 GB/s (RDMA), 95.98 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20, transmit: 2344.00 us, notify: 973.96 us, BW: 51.40 GB/s (RDMA), 95.73 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24, transmit: 2360.00 us, notify: 910.55 us, BW: 51.05 GB/s (RDMA), 95.08 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28, transmit: 2381.00 us, notify: 920.34 us, BW: 50.60 GB/s (RDMA), 94.25 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32, transmit: 2400.00 us, notify: 819.04 us, BW: 50.20 GB/s (RDMA), 93.50 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4, transmit: 2435.00 us, notify: 308.75 us, BW: 49.48 GB/s (RDMA), 92.16 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8, transmit: 2332.00 us, notify: 601.73 us, BW: 51.67 GB/s (RDMA), 96.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12, transmit: 2329.00 us, notify: 711.45 us, BW: 51.73 GB/s (RDMA), 96.35 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16, transmit: 2335.00 us, notify: 946.27 us, BW: 51.60 GB/s (RDMA), 96.10 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20, transmit: 2347.00 us, notify: 923.43 us, BW: 51.34 GB/s (RDMA), 95.61 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24, transmit: 2361.00 us, notify: 815.65 us, BW: 51.03 GB/s (RDMA), 95.04 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28, transmit: 2380.00 us, notify: 868.54 us, BW: 50.63 GB/s (RDMA), 94.29 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32, transmit: 2396.00 us, notify: 896.06 us, BW: 50.29 GB/s (RDMA), 93.66 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4, transmit: 2443.00 us, notify: 315.88 us, BW: 49.32 GB/s (RDMA), 91.85 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8, transmit: 2337.00 us, notify: 633.58 us, BW: 51.56 GB/s (RDMA), 96.02 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12, transmit: 2327.00 us, notify: 744.71 us, BW: 51.78 GB/s (RDMA), 96.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16, transmit: 2332.00 us, notify: 980.10 us, BW: 51.67 GB/s (RDMA), 96.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20, transmit: 2346.00 us, notify: 972.22 us, BW: 51.36 GB/s (RDMA), 95.65 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24, transmit: 2360.00 us, notify: 820.88 us, BW: 51.05 GB/s (RDMA), 95.08 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28, transmit: 2382.00 us, notify: 872.97 us, BW: 50.58 GB/s (RDMA), 94.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32, transmit: 2397.00 us, notify: 906.19 us, BW: 50.27 GB/s (RDMA), 93.62 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4, transmit: 2440.00 us, notify: 294.39 us, BW: 49.38 GB/s (RDMA), 91.97 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8, transmit: 2330.00 us, notify: 606.91 us, BW: 51.71 GB/s (RDMA), 96.31 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12, transmit: 2382.00 us, notify: 775.80 us, BW: 50.58 GB/s (RDMA), 94.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16, transmit: 2338.00 us, notify: 952.89 us, BW: 51.54 GB/s (RDMA), 95.98 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20, transmit: 2344.00 us, notify: 979.52 us, BW: 51.40 GB/s (RDMA), 95.73 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24, transmit: 2356.00 us, notify: 829.35 us, BW: 51.14 GB/s (RDMA), 95.25 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28, transmit: 2383.00 us, notify: 886.39 us, BW: 50.56 GB/s (RDMA), 94.17 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32, transmit: 2396.00 us, notify: 913.22 us, BW: 50.29 GB/s (RDMA), 93.66 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4, transmit: 2446.00 us, notify: 300.67 us, BW: 49.26 GB/s (RDMA), 91.74 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8, transmit: 2333.00 us, notify: 636.02 us, BW: 51.65 GB/s (RDMA), 96.18 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12, transmit: 2326.00 us, notify: 756.02 us, BW: 51.80 GB/s (RDMA), 96.47 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16, transmit: 2334.00 us, notify: 954.54 us, BW: 51.62 GB/s (RDMA), 96.14 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20, transmit: 2343.00 us, notify: 967.82 us, BW: 51.43 GB/s (RDMA), 95.77 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24, transmit: 2357.00 us, notify: 848.03 us, BW: 51.12 GB/s (RDMA), 95.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28, transmit: 2375.00 us, notify: 1029.00 us, BW: 50.73 GB/s (RDMA), 94.48 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32, transmit: 2395.00 us, notify: 917.50 us, BW: 50.31 GB/s (RDMA), 93.69 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4, transmit: 2444.00 us, notify: 292.90 us, BW: 49.30 GB/s (RDMA), 91.82 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8, transmit: 2335.00 us, notify: 629.24 us, BW: 51.60 GB/s (RDMA), 96.10 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12, transmit: 2328.00 us, notify: 766.76 us, BW: 51.76 GB/s (RDMA), 96.39 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16, transmit: 2332.00 us, notify: 944.27 us, BW: 51.67 GB/s (RDMA), 96.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20, transmit: 2341.00 us, notify: 989.01 us, BW: 51.47 GB/s (RDMA), 95.86 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24, transmit: 2360.00 us, notify: 879.64 us, BW: 51.05 GB/s (RDMA), 95.08 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28, transmit: 2376.00 us, notify: 966.13 us, BW: 50.71 GB/s (RDMA), 94.44 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32, transmit: 2403.00 us, notify: 916.29 us, BW: 50.14 GB/s (RDMA), 93.38 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 4, transmit: 2443.00 us, notify: 286.33 us, BW: 49.32 GB/s (RDMA), 91.85 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 8, transmit: 2335.00 us, notify: 624.65 us, BW: 51.60 GB/s (RDMA), 96.10 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 12, transmit: 2327.00 us, notify: 732.50 us, BW: 51.78 GB/s (RDMA), 96.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 16, transmit: 2338.00 us, notify: 928.79 us, BW: 51.54 GB/s (RDMA), 95.98 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 20, transmit: 2344.00 us, notify: 967.01 us, BW: 51.40 GB/s (RDMA), 95.73 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 24, transmit: 2357.00 us, notify: 912.57 us, BW: 51.12 GB/s (RDMA), 95.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 28, transmit: 2378.00 us, notify: 982.52 us, BW: 50.67 GB/s (RDMA), 94.36 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 32, transmit: 2402.00 us, notify: 884.01 us, BW: 50.16 GB/s (RDMA), 93.42 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 4, transmit: 2436.00 us, notify: 357.19 us, BW: 49.46 GB/s (RDMA), 92.12 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 8, transmit: 2336.00 us, notify: 689.41 us, BW: 51.58 GB/s (RDMA), 96.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 12, transmit: 2327.00 us, notify: 770.62 us, BW: 51.78 GB/s (RDMA), 96.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 16, transmit: 2338.00 us, notify: 943.69 us, BW: 51.54 GB/s (RDMA), 95.98 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 20, transmit: 2347.00 us, notify: 974.51 us, BW: 51.34 GB/s (RDMA), 95.61 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 24, transmit: 2355.00 us, notify: 866.46 us, BW: 51.16 GB/s (RDMA), 95.29 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 28, transmit: 2378.00 us, notify: 929.25 us, BW: 50.67 GB/s (RDMA), 94.36 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 32, transmit: 2401.00 us, notify: 902.59 us, BW: 50.18 GB/s (RDMA), 93.46 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 4, transmit: 2447.00 us, notify: 361.12 us, BW: 49.24 GB/s (RDMA), 91.70 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 8, transmit: 2331.00 us, notify: 672.83 us, BW: 51.69 GB/s (RDMA), 96.27 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 12, transmit: 2327.00 us, notify: 710.19 us, BW: 51.78 GB/s (RDMA), 96.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 16, transmit: 2336.00 us, notify: 866.34 us, BW: 51.58 GB/s (RDMA), 96.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 20, transmit: 2347.00 us, notify: 964.72 us, BW: 51.34 GB/s (RDMA), 95.61 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 24, transmit: 2356.00 us, notify: 893.31 us, BW: 51.14 GB/s (RDMA), 95.25 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 28, transmit: 2383.00 us, notify: 944.42 us, BW: 50.56 GB/s (RDMA), 94.17 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 32, transmit: 2400.00 us, notify: 957.04 us, BW: 50.20 GB/s (RDMA), 93.50 GB/s (NVL) 
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 12, RDMA chunk 12, transmit: 2326.00 us, notify: 779.14 us, BW: 51.80 GB/s (RDMA), 96.47 GB/s (NVL)

[tuning] SMs 24, NVL chunk 4, RDMA chunk 4, transmit: 4488.00 us, notify: 1193.00 us, BW: 52.07 GB/s (RDMA), 96.97 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8, transmit: 4484.00 us, notify: 1884.00 us, BW: 52.11 GB/s (RDMA), 97.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12, transmit: 4470.00 us, notify: 5016.00 us, BW: 52.28 GB/s (RDMA), 97.36 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16, transmit: 30180.00 us, notify: 20559.00 us, BW: 7.74 GB/s (RDMA), 14.42 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20, transmit: 23556.00 us, notify: 41167.00 us, BW: 9.92 GB/s (RDMA), 18.48 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24, transmit: 18894.00 us, notify: 55680.00 us, BW: 12.37 GB/s (RDMA), 23.03 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28, transmit: 10955.00 us, notify: 62392.00 us, BW: 21.33 GB/s (RDMA), 39.73 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32, transmit: 6721.00 us, notify: 73571.00 us, BW: 34.77 GB/s (RDMA), 64.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 4, transmit: 4479.00 us, notify: 1135.00 us, BW: 52.17 GB/s (RDMA), 97.16 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 8, transmit: 4481.00 us, notify: 1876.00 us, BW: 52.15 GB/s (RDMA), 97.12 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 12, transmit: 4476.00 us, notify: 1996.00 us, BW: 52.21 GB/s (RDMA), 97.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 16, transmit: 4845.00 us, notify: 25621.00 us, BW: 48.23 GB/s (RDMA), 89.82 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 20, transmit: 9222.00 us, notify: 41026.00 us, BW: 25.34 GB/s (RDMA), 47.19 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 24, transmit: 5594.00 us, notify: 56775.00 us, BW: 41.77 GB/s (RDMA), 77.80 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 28, transmit: 7591.00 us, notify: 40542.00 us, BW: 30.78 GB/s (RDMA), 57.33 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 8, RDMA chunk 32, transmit: 6151.00 us, notify: 59678.00 us, BW: 37.99 GB/s (RDMA), 70.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 4, transmit: 4490.00 us, notify: 1114.00 us, BW: 52.04 GB/s (RDMA), 96.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 8, transmit: 4486.00 us, notify: 1854.00 us, BW: 52.09 GB/s (RDMA), 97.01 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 12, transmit: 4553.00 us, notify: 2653.00 us, BW: 51.32 GB/s (RDMA), 95.58 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 16, transmit: 5091.00 us, notify: 28888.00 us, BW: 45.90 GB/s (RDMA), 85.48 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 20, transmit: 5379.00 us, notify: 29969.00 us, BW: 43.44 GB/s (RDMA), 80.91 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 24, transmit: 5604.00 us, notify: 57656.00 us, BW: 41.70 GB/s (RDMA), 77.66 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 28, transmit: 5759.00 us, notify: 40969.00 us, BW: 40.58 GB/s (RDMA), 75.57 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 12, RDMA chunk 32, transmit: 5769.00 us, notify: 46458.00 us, BW: 40.51 GB/s (RDMA), 75.44 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 4, transmit: 4491.00 us, notify: 1155.00 us, BW: 52.03 GB/s (RDMA), 96.90 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 8, transmit: 4490.00 us, notify: 1827.00 us, BW: 52.04 GB/s (RDMA), 96.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 12, transmit: 4476.00 us, notify: 1944.00 us, BW: 52.21 GB/s (RDMA), 97.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 16, transmit: 5261.00 us, notify: 22370.00 us, BW: 44.42 GB/s (RDMA), 82.72 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 20, transmit: 5183.00 us, notify: 20125.00 us, BW: 45.09 GB/s (RDMA), 83.97 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 24, transmit: 5808.00 us, notify: 41073.00 us, BW: 40.23 GB/s (RDMA), 74.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 28, transmit: 5831.00 us, notify: 24971.00 us, BW: 40.07 GB/s (RDMA), 74.64 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 16, RDMA chunk 32, transmit: 5839.00 us, notify: 44763.00 us, BW: 40.02 GB/s (RDMA), 74.53 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 4, transmit: 4492.00 us, notify: 1147.00 us, BW: 52.02 GB/s (RDMA), 96.88 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 8, transmit: 4483.00 us, notify: 1882.00 us, BW: 52.13 GB/s (RDMA), 97.08 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 12, transmit: 4468.00 us, notify: 2183.00 us, BW: 52.30 GB/s (RDMA), 97.40 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 16, transmit: 6100.00 us, notify: 22332.00 us, BW: 38.31 GB/s (RDMA), 71.34 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 20, transmit: 5698.00 us, notify: 30857.00 us, BW: 41.01 GB/s (RDMA), 76.38 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 24, transmit: 5681.00 us, notify: 32847.00 us, BW: 41.13 GB/s (RDMA), 76.61 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 28, transmit: 6789.00 us, notify: 19764.00 us, BW: 34.42 GB/s (RDMA), 64.10 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 20, RDMA chunk 32, transmit: 6262.00 us, notify: 33909.00 us, BW: 37.32 GB/s (RDMA), 69.50 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 4, transmit: 4487.00 us, notify: 1140.00 us, BW: 52.08 GB/s (RDMA), 96.99 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 8, transmit: 4487.00 us, notify: 1804.00 us, BW: 52.08 GB/s (RDMA), 96.99 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 12, transmit: 4560.00 us, notify: 1890.00 us, BW: 51.24 GB/s (RDMA), 95.44 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 16, transmit: 5213.00 us, notify: 17296.00 us, BW: 44.83 GB/s (RDMA), 83.48 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 20, transmit: 5446.00 us, notify: 17394.00 us, BW: 42.91 GB/s (RDMA), 79.91 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 24, transmit: 6398.00 us, notify: 28617.00 us, BW: 36.52 GB/s (RDMA), 68.02 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 28, transmit: 5749.00 us, notify: 9062.00 us, BW: 40.65 GB/s (RDMA), 75.70 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 24, RDMA chunk 32, transmit: 5908.00 us, notify: 23289.00 us, BW: 39.55 GB/s (RDMA), 73.66 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 4, transmit: 4485.00 us, notify: 1142.00 us, BW: 52.10 GB/s (RDMA), 97.03 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 8, transmit: 4488.00 us, notify: 1796.00 us, BW: 52.07 GB/s (RDMA), 96.97 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 12, transmit: 4469.00 us, notify: 1943.00 us, BW: 52.29 GB/s (RDMA), 97.38 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 16, transmit: 5193.00 us, notify: 12090.00 us, BW: 45.00 GB/s (RDMA), 83.80 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 20, transmit: 5436.00 us, notify: 11975.00 us, BW: 42.99 GB/s (RDMA), 80.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 24, transmit: 5700.00 us, notify: 22701.00 us, BW: 41.00 GB/s (RDMA), 76.35 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 28, transmit: 5801.00 us, notify: 10570.00 us, BW: 40.28 GB/s (RDMA), 75.02 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 28, RDMA chunk 32, transmit: 5847.00 us, notify: 13908.00 us, BW: 39.97 GB/s (RDMA), 74.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 4, transmit: 4485.00 us, notify: 1130.00 us, BW: 52.10 GB/s (RDMA), 97.03 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 8, transmit: 4481.00 us, notify: 1811.00 us, BW: 52.15 GB/s (RDMA), 97.12 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 12, transmit: 4466.00 us, notify: 1993.00 us, BW: 52.32 GB/s (RDMA), 97.45 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 16, transmit: 5549.00 us, notify: 10151.00 us, BW: 42.11 GB/s (RDMA), 78.43 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 20, transmit: 5326.00 us, notify: 8242.00 us, BW: 43.87 GB/s (RDMA), 81.71 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 24, transmit: 6027.00 us, notify: 17828.00 us, BW: 38.77 GB/s (RDMA), 72.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 28, transmit: 6265.00 us, notify: 5787.00 us, BW: 37.30 GB/s (RDMA), 69.46 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 32, RDMA chunk 32, transmit: 5896.00 us, notify: 14804.00 us, BW: 39.63 GB/s (RDMA), 73.81 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 4, transmit: 4491.00 us, notify: 1121.00 us, BW: 52.03 GB/s (RDMA), 96.90 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 8, transmit: 4491.00 us, notify: 1794.00 us, BW: 52.03 GB/s (RDMA), 96.90 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 12, transmit: 4469.00 us, notify: 1983.00 us, BW: 52.29 GB/s (RDMA), 97.38 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 16, transmit: 5303.00 us, notify: 9594.00 us, BW: 44.07 GB/s (RDMA), 82.07 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 20, transmit: 5758.00 us, notify: 12371.00 us, BW: 40.58 GB/s (RDMA), 75.58 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 24, transmit: 5762.00 us, notify: 14330.00 us, BW: 40.55 GB/s (RDMA), 75.53 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 28, transmit: 5914.00 us, notify: 6499.00 us, BW: 39.51 GB/s (RDMA), 73.59 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 36, RDMA chunk 32, transmit: 5840.00 us, notify: 10584.00 us, BW: 40.01 GB/s (RDMA), 74.52 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 4, transmit: 4489.00 us, notify: 1130.00 us, BW: 52.06 GB/s (RDMA), 96.95 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 8, transmit: 4490.00 us, notify: 1854.00 us, BW: 52.04 GB/s (RDMA), 96.93 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 12, transmit: 4474.00 us, notify: 1940.00 us, BW: 52.23 GB/s (RDMA), 97.27 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 16, transmit: 4871.00 us, notify: 8777.00 us, BW: 47.97 GB/s (RDMA), 89.34 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 20, transmit: 5407.00 us, notify: 3117.00 us, BW: 43.22 GB/s (RDMA), 80.49 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 24, transmit: 5619.00 us, notify: 12994.00 us, BW: 41.59 GB/s (RDMA), 77.45 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 28, transmit: 5537.00 us, notify: 2889.00 us, BW: 42.20 GB/s (RDMA), 78.60 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 40, RDMA chunk 32, transmit: 5886.00 us, notify: 9079.00 us, BW: 39.70 GB/s (RDMA), 73.94 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 4, transmit: 4488.00 us, notify: 1131.00 us, BW: 52.07 GB/s (RDMA), 96.97 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 8, transmit: 4484.00 us, notify: 1840.00 us, BW: 52.11 GB/s (RDMA), 97.06 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 12, transmit: 4531.00 us, notify: 1940.00 us, BW: 51.57 GB/s (RDMA), 96.05 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 16, transmit: 5233.00 us, notify: 9009.00 us, BW: 44.65 GB/s (RDMA), 83.16 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 20, transmit: 5412.00 us, notify: 6282.00 us, BW: 43.18 GB/s (RDMA), 80.41 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 24, transmit: 5853.00 us, notify: 11069.00 us, BW: 39.92 GB/s (RDMA), 74.35 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 28, transmit: 5663.00 us, notify: 3839.00 us, BW: 41.26 GB/s (RDMA), 76.85 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 44, RDMA chunk 32, transmit: 5866.00 us, notify: 16774.00 us, BW: 39.84 GB/s (RDMA), 74.19 GB/s (NVL) 
[tuning] Best dispatch (BF16): SMs 24, NVL chunk 32, RDMA chunk 12, transmit: 4466.00 us, notify: 1993.00 us, BW: 52.32 GB/s (RDMA), 97.45 GB/s (NVL)

[tuning] SMs 24, NVL chunk 1, RDMA chunk 8, transmit: 4655.00 us, notify: 1395.00 us, BW: 50.20 GB/s (RDMA), 93.49 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 12, transmit: 4714.00 us, notify: 1691.00 us, BW: 49.57 GB/s (RDMA), 92.32 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 16, transmit: 4730.00 us, notify: 1644.00 us, BW: 49.40 GB/s (RDMA), 92.01 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 20, transmit: 4800.00 us, notify: 1688.00 us, BW: 48.68 GB/s (RDMA), 90.67 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 24, transmit: 4895.00 us, notify: 1647.00 us, BW: 47.74 GB/s (RDMA), 88.91 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 28, transmit: 4968.00 us, notify: 2011.00 us, BW: 47.04 GB/s (RDMA), 87.60 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 1, RDMA chunk 32, transmit: 5151.00 us, notify: 6059.00 us, BW: 45.37 GB/s (RDMA), 84.49 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 8, transmit: 4691.00 us, notify: 1446.00 us, BW: 49.81 GB/s (RDMA), 92.77 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 12, transmit: 4745.00 us, notify: 1604.00 us, BW: 49.25 GB/s (RDMA), 91.72 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 16, transmit: 4755.00 us, notify: 1746.00 us, BW: 49.14 GB/s (RDMA), 91.52 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 20, transmit: 4838.00 us, notify: 1680.00 us, BW: 48.30 GB/s (RDMA), 89.95 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 24, transmit: 4925.00 us, notify: 1840.00 us, BW: 47.45 GB/s (RDMA), 88.37 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 28, transmit: 4993.00 us, notify: 3131.00 us, BW: 46.80 GB/s (RDMA), 87.16 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 2, RDMA chunk 32, transmit: 5034.00 us, notify: 6706.00 us, BW: 46.42 GB/s (RDMA), 86.45 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 8, transmit: 4750.00 us, notify: 1343.00 us, BW: 49.20 GB/s (RDMA), 91.62 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 12, transmit: 4823.00 us, notify: 1552.00 us, BW: 48.45 GB/s (RDMA), 90.23 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 16, transmit: 4793.00 us, notify: 1626.00 us, BW: 48.75 GB/s (RDMA), 90.80 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 20, transmit: 4888.00 us, notify: 1623.00 us, BW: 47.81 GB/s (RDMA), 89.03 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 24, transmit: 4946.00 us, notify: 1502.00 us, BW: 47.25 GB/s (RDMA), 87.99 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 28, transmit: 5019.00 us, notify: 1826.00 us, BW: 46.56 GB/s (RDMA), 86.71 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 3, RDMA chunk 32, transmit: 5191.00 us, notify: 7244.00 us, BW: 45.02 GB/s (RDMA), 83.84 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 8, transmit: 4743.00 us, notify: 1483.00 us, BW: 49.27 GB/s (RDMA), 91.76 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 12, transmit: 4801.00 us, notify: 1788.00 us, BW: 48.67 GB/s (RDMA), 90.65 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 16, transmit: 4881.00 us, notify: 1681.00 us, BW: 47.87 GB/s (RDMA), 89.16 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 20, transmit: 4871.00 us, notify: 1716.00 us, BW: 47.97 GB/s (RDMA), 89.34 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 24, transmit: 4992.00 us, notify: 1542.00 us, BW: 46.81 GB/s (RDMA), 87.18 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 28, transmit: 5070.00 us, notify: 1819.00 us, BW: 46.09 GB/s (RDMA), 85.84 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 4, RDMA chunk 32, transmit: 5135.00 us, notify: 2063.00 us, BW: 45.51 GB/s (RDMA), 84.75 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 8, transmit: 4824.00 us, notify: 1526.00 us, BW: 48.44 GB/s (RDMA), 90.22 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 12, transmit: 4816.00 us, notify: 1359.00 us, BW: 48.52 GB/s (RDMA), 90.37 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 16, transmit: 4937.00 us, notify: 1597.00 us, BW: 47.33 GB/s (RDMA), 88.15 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 20, transmit: 4943.00 us, notify: 1700.00 us, BW: 47.27 GB/s (RDMA), 88.04 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 24, transmit: 4990.00 us, notify: 1815.00 us, BW: 46.83 GB/s (RDMA), 87.21 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 28, transmit: 5088.00 us, notify: 1960.00 us, BW: 45.93 GB/s (RDMA), 85.53 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 5, RDMA chunk 32, transmit: 5232.00 us, notify: 5364.00 us, BW: 44.66 GB/s (RDMA), 83.18 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 8, transmit: 4866.00 us, notify: 1636.00 us, BW: 48.02 GB/s (RDMA), 89.44 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 12, transmit: 4911.00 us, notify: 1478.00 us, BW: 47.58 GB/s (RDMA), 88.62 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 16, transmit: 4864.00 us, notify: 1553.00 us, BW: 48.04 GB/s (RDMA), 89.47 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 20, transmit: 4988.00 us, notify: 1658.00 us, BW: 46.85 GB/s (RDMA), 87.25 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 24, transmit: 5068.00 us, notify: 2561.00 us, BW: 46.11 GB/s (RDMA), 85.87 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 28, transmit: 5273.00 us, notify: 3051.00 us, BW: 44.32 GB/s (RDMA), 82.53 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 6, RDMA chunk 32, transmit: 5165.00 us, notify: 1862.00 us, BW: 45.24 GB/s (RDMA), 84.26 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 8, transmit: 4917.00 us, notify: 1648.00 us, BW: 47.52 GB/s (RDMA), 88.51 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 12, transmit: 4961.00 us, notify: 1617.00 us, BW: 47.10 GB/s (RDMA), 87.72 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 16, transmit: 4955.00 us, notify: 1527.00 us, BW: 47.16 GB/s (RDMA), 87.83 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 20, transmit: 5005.00 us, notify: 1530.00 us, BW: 46.69 GB/s (RDMA), 86.95 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 24, transmit: 5098.00 us, notify: 3796.00 us, BW: 45.84 GB/s (RDMA), 85.37 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 28, transmit: 5166.00 us, notify: 1930.00 us, BW: 45.23 GB/s (RDMA), 84.24 GB/s (NVL) 
[tuning] SMs 24, NVL chunk 7, RDMA chunk 32, transmit: 5214.00 us, notify: 2544.00 us, BW: 44.82 GB/s (RDMA), 83.47 GB/s (NVL) 
[tuning] Best combine: SMs 24, NVL chunk 1, RDMA chunk 8, transmit: 4655.00 us, notify: 1395.00 us, BW: 50.20 GB/s (RDMA), 93.49 GB/s (NVL)

@junjzhang junjzhang force-pushed the feat/remove_record_stream branch from 51540fa to 10e89ee Compare October 16, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant