Add GPU profiles by blegat · Pull Request #74 · blegat/ArrayDiff.jl

blegat · 2026-05-27T21:58:30Z

For Hand-CUDA

Profiler ran for 819.68 µs, capturing 558 events.

Host-side activity: calling CUDA APIs took 318.53 µs (38.86% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    7.24% │   59.37 µs │     3 │  19.79 µs ± 21.07  (  6.91 ‥ 44.11)   │ cuLaunchKernel                                       │
│    6.98% │   57.22 µs │     8 │   7.15 µs ± 3.22   (  4.29 ‥ 13.83)   │ cuMemcpyHtoDAsync                                    │
│    4.19% │   34.33 µs │     4 │   8.58 µs ± 1.51   (  7.15 ‥ 10.49)   │ cuLaunchKernelEx                                     │
│    3.98% │   32.66 µs │    13 │   2.51 µs ± 0.67   (  1.67 ‥ 4.53)    │ cuMemAllocFromPoolAsync                              │
│    1.75% │   14.31 µs │     1 │                                       │ cudaLaunchKernelExC                                  │
│    1.45% │   11.92 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.76% │     6.2 µs │     1 │                                       │ cudaEventRecord                                      │
│    0.73% │    5.96 µs │     8 │ 745.06 ns ± 990.74 (   0.0 ‥ 3099.44) │ cudaGetLastError                                     │
│    0.61% │    5.01 µs │     3 │   1.67 µs ± 0.63   (  1.19 ‥ 2.38)    │ cuKernelGetFunction                                  │
│    0.47% │    3.81 µs │     3 │   1.27 µs ± 1.18   (  0.48 ‥ 2.62)    │ cudaGetDevice                                        │
│    0.44% │    3.58 µs │     3 │   1.19 µs ± 1.04   (  0.48 ‥ 2.38)    │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.26% │    2.15 µs │     1 │                                       │ cuStreamSynchronize                                  │
│    0.26% │    2.15 µs │     3 │ 715.26 ns ± 412.95 (476.84 ‥ 1192.09) │ cudaDeviceGetAttribute                               │
│    0.12% │  953.67 ns │     1 │                                       │ cuKernelGetName                                      │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 128.51 µs (15.68% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    2.91% │   23.84 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 2, 1>, Broadcasted<CuArrayStyle<2, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>>, _, Tuple<Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, Int64>>, Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, I ⋯
│    2.76% │   22.65 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 2, 1>, Broadcasted<CuArrayStyle<2, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>>, tanh, Tuple<Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, Int64>>>>)                                                                     ⋯
│    2.62% │   21.46 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1>(cutlass_80_simt_sgemm_32x128_8x5_nn_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    2.50% │    20.5 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 2, 1>, Broadcasted<CuArrayStyle<2, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>>, _, Tuple<Int64, Broadcasted<CuArrayStyle<2, DeviceMemory>, void, literal_pow, Tuple<KernelRefValue<_>, Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>,  ⋯
│    1.54% │   12.64 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, 6, 3, 4>(int, int, int, BatchedTensor, int, BatchedTensor, int, BatchedTensor, int, BatchedTensor, int, int, int, float const*, float const*, float, float, int, cublasLtEpilogue_t, int, void const*, long)                                                                                                                                                                                                                                                                                                ⋯
│    1.43% │   11.68 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(cutlass_80_simt_sgemm_128x32_8x5_nt_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    1.31% │   10.73 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1>(cutlass_80_simt_sgemm_128x32_8x5_nn_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    0.26% │    2.15 µs │     8 │ 268.22 ns ± 152.8  (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ⋯
│    0.23% │    1.91 µs │     1 │                                      │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, float, float, false, float, float, float, true, false, false, false>(cublasLt::cublasSplitKParams<float>, float const*, float const*, float*, float*, float const*, float const*, float const*, float const*, float*, void*, long, float*, int*, float*, float*, float const*, float const*, float const*, float const*, float const*)                                                                                                                                                                          ⋯
│    0.12% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 2, 1>, Broadcasted<CuArrayStyle<2, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>>, _, Tuple<Int64, Broadcasted<CuArrayStyle<2, DeviceMemory>, void, _, Tuple<Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, Int64>>, Extrude ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1 column omitted

For ArrayDiff

Profiler ran for 1.38 ms, capturing 1476 events.

Host-side activity: calling CUDA APIs took 627.52 µs (45.34% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬──────────────────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                 │
├──────────┼────────────┼───────┼───────────────────────────────────────┼──────────────────────────────────────────────────────┤
│    7.08% │   97.99 µs │    14 │    7.0 µs ± 3.75   (  5.01 ‥ 19.79)   │ cuLaunchKernelEx                                     │
│    5.37% │   74.39 µs │    14 │   5.31 µs ± 2.02   (  3.81 ‥ 10.73)   │ cuMemcpyHtoDAsync                                    │
│    4.29% │   59.37 µs │     5 │  11.87 µs ± 2.25   (  9.78 ‥ 15.5)    │ cuMemcpyDtoHAsync                                    │
│    2.79% │   38.62 µs │    15 │   2.57 µs ± 1.18   (  1.43 ‥ 6.2)     │ cuMemAllocFromPoolAsync                              │
│    2.62% │   36.24 µs │     4 │   9.06 µs ± 5.17   (  3.81 ‥ 15.97)   │ cuMemsetD32Async                                     │
│    2.53% │   35.05 µs │     5 │   7.01 µs ± 2.49   (  4.77 ‥ 10.97)   │ cuLaunchKernel                                       │
│    1.67% │   23.13 µs │     2 │  11.56 µs ± 2.19   ( 10.01 ‥ 13.11)   │ cuMemcpyDtoDAsync                                    │
│    1.03% │   14.31 µs │     2 │   7.15 µs ± 1.35   (   6.2 ‥ 8.11)    │ cudaLaunchKernelExC                                  │
│    0.72% │   10.01 µs │    11 │ 910.33 ns ± 234.04 (715.26 ‥ 1430.51) │ cuStreamSynchronize                                  │
│    0.59% │    8.11 µs │     1 │                                       │ cudaLaunchKernel                                     │
│    0.40% │    5.48 µs │     1 │                                       │ cudaMemsetAsync                                      │
│    0.29% │    4.05 µs │     3 │   1.35 µs ± 0.5    (  0.95 ‥ 1.91)    │ cudaEventRecord                                      │
│    0.29% │    4.05 µs │    12 │ 337.76 ns ± 448.45 (   0.0 ‥ 1668.93) │ cudaGetLastError                                     │
│    0.26% │    3.58 µs │     5 │ 715.26 ns ± 337.17 (476.84 ‥ 1192.09) │ cuKernelGetFunction                                  │
│    0.21% │    2.86 µs │     5 │  572.2 ns ± 361.58 (238.42 ‥ 1192.09) │ cudaGetDevice                                        │
│    0.16% │    2.15 µs │     5 │ 429.15 ns ± 199.48 (238.42 ‥ 715.26)  │ cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags │
│    0.16% │    2.15 µs │     5 │ 429.15 ns ± 106.62 (238.42 ‥ 476.84)  │ cudaDeviceGetAttribute                               │
│    0.05% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)  │ cuCtxPushCurrent                                     │
│    0.05% │  715.26 ns │     4 │ 178.81 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuCtxPopCurrent                                      │
│    0.05% │  715.26 ns │     4 │ 178.81 ns ± 228.27 (   0.0 ‥ 476.84)  │ cuCtxGetDevice                                       │
│    0.03% │  476.84 ns │     1 │                                       │ cuKernelGetName                                      │
│    0.02% │  238.42 ns │     4 │   59.6 ns ± 119.21 (   0.0 ‥ 238.42)  │ cuDeviceGet                                          │
└──────────┴────────────┴───────┴───────────────────────────────────────┴──────────────────────────────────────────────────────┘

Device-side activity: GPU was busy for 224.11 µs (16.19% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ⋯
├──────────┼────────────┼───────┼──────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│    3.43% │   47.45 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x64_8x5_nt_align1>(cutlass_80_simt_sgemm_128x64_8x5_nt_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    2.10% │   29.09 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_64x128_8x5_tn_align1>(cutlass_80_simt_sgemm_64x128_8x5_tn_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    1.84% │   25.51 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_32x128_8x5_nn_align1>(cutlass_80_simt_sgemm_32x128_8x5_nn_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    1.29% │   17.88 µs │     1 │                                      │ void magma_sgemmEx_kernel<float, float, float, true, false, 6, 4, 6, 3, 4>(int, int, int, BatchedTensor, int, BatchedTensor, int, BatchedTensor, int, BatchedTensor, int, int, int, float const*, float const*, float, float, int, cublasLtEpilogue_t, int, void const*, long)                                                                                                                                                                                                                                                                                                ⋯
│    1.26% │    17.4 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, ifelse, Tuple<Broadcasted<CuArrayStyle<1, DeviceMemory>, void, _, Tuple<Broadcasted<CuArrayStyle<1, DeviceMemory>, void, __, Tuple<Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Int64>>, Broadcasted<C ⋯
│    0.95% │   13.11 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, tanh, Tuple<Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                                                                                             ⋯
│    0.84% │   11.68 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(cutlass_80_simt_sgemm_128x32_8x5_nt_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    0.79% │   10.97 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Float32, Broadcasted<CuArrayStyle<1, DeviceMemory>, void, _, Tuple<Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>>>)        ⋯
│    0.78% │   10.73 µs │     1 │                                      │ void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nn_align1>(cutlass_80_simt_sgemm_128x32_8x5_nn_align1::Params)                                                                                                                                                                                                                                                                                                                                                                                                                                                         ⋯
│    0.67% │     9.3 µs │     5 │   1.86 µs ± 3.12   (  0.24 ‥ 7.39)   │ [set device memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ⋯
│    0.45% │     6.2 µs │     5 │   1.24 µs ± 0.2    (  0.95 ‥ 1.43)   │ [copy device to pageable memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ⋯
│    0.28% │    3.81 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing___is_equal_, Tuple<Bool, Tuple>>, CartesianIndices<1, __is_missing___is_equal_<OneTo<Int64>>>, __is_missing___is_equal_<OneTo<Int64>>, Val<false>, CuDeviceArray<Tuple<Bool, Tuple>, 2, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, OneTo<Int64>, mapper, __is_missing___is_equal_<Val<false><Float32, 1, 1>, Float32>>)                                                                                                                                                                                 ⋯
│    0.19% │    2.62 µs │     2 │   1.31 µs ± 0.17   (  1.19 ‥ 1.43)   │ void cublasLt::splitKreduce_kernel<32, 16, int, float, float, float, float, false, float, float, float, true, false, false, false>(cublasLt::cublasSplitKParams<float>, float const*, float const*, float*, float*, float const*, float const*, float const*, float const*, float*, void*, long, float*, int*, float*, float*, float const*, float const*, float const*, float const*, float const*)                                                                                                                                                                          ⋯
│    0.17% │    2.38 µs │     1 │                                      │ partial_mapreduce_grid(identity, reducer, NamedTuple<__is_missing___is_equal_, Tuple<Bool, Tuple>>, CartesianIndices<2, __is_missing___is_equal_<OneTo<Int64>, Int64>>, __is_missing___is_equal_<OneTo<Int64>, Int64>, Val<false>, CuDeviceArray<Tuple<Bool, Tuple>, 2, 1>, CuDeviceArray)                                                                                                                                                                                                                                                                                    ⋯
│    0.17% │    2.38 µs │     2 │   1.19 µs ± 0.0    (  1.19 ‥ 1.19)   │ [copy device to device memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ⋯
│    0.17% │    2.38 µs │    14 │  170.3 ns ± 145.73 (   0.0 ‥ 476.84) │ [copy pageable to device memory]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ⋯
│    0.14% │    1.91 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, Val<true>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>)                                                                                                                                                                                                                                                                                                                                        ⋯
│    0.12% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(identity, add_sum, Float32, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>, Val<true>, CuDeviceArray<Float32, 1, 1>, CuDeviceArray<Float32, 1, 1>)                                                                                                                                                                                                                                                                                                                                                                 ⋯
│    0.12% │    1.67 µs │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Broadcasted<CuArrayStyle<1, DeviceMemory>, void, _, Tuple<Float32, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>>>)        ⋯
│    0.12% │    1.67 µs │     1 │                                      │ partial_mapreduce_grid(_, add_sum, Float32, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, Val<true>, CuDeviceArray<Float32, 2, 1>, CuDeviceArray<Float32, 2, 1>)                                                                                                                                                                                                                                                                                                                                            ⋯
│    0.10% │    1.43 µs │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, NDRange<2, DynamicSize, DynamicSize, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>>>, CuDeviceArray<Float32, 2, 1>, Broadcasted<CuArrayStyle<2, DeviceMemory>, Tuple<OneTo<Int64>, OneTo<Int64>>, _, Tuple<Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, Int64>>, Extruded<CuDeviceArray<Float32, 2, 1>, Tuple<Bool, Bool>, Tuple<Int64, I ⋯
│    0.07% │  953.67 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, identity, Tuple<Extruded<SubArray<Float32, 0, CuDeviceArray<Float32, 1, 1>, Tuple<ReshapedArray<Int64, 0, UnitRange<Int64>, Tuple<>>>, true>, Tuple<>, Tuple<>>>>)                                                           ⋯
│    0.05% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Broadcasted<CuArrayStyle<1, DeviceMemory>, void, _, Tuple<Int64, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)          ⋯
│    0.05% │  715.26 ns │     1 │                                      │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, CuDeviceArray<Float32, 1, 1>, Broadcasted<CuArrayStyle<1, DeviceMemory>, Tuple<OneTo<Int64>>, _, Tuple<Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>, Extruded<CuDeviceArray<Float32, 1, 1>, Tuple<Bool>, Tuple<Int64>>>>)                                                                             ⋯
│    0.03% │  476.84 ns │     1 │                                      │ gpu_broadcast_kernel_cartesian(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>>>, SubArray<Float32, 0, CuDeviceArray<Float32, 1, 1>, Tuple<ReshapedArray<Int64, 0, UnitRange<Int64>, Tuple<>>>, true>, Broadcasted<CuArrayStyle<0, DeviceMemory>, Tuple<>, identity, Tuple<Float32>>)                                                                                                                     ⋯
└──────────┴────────────┴───────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1 column omitted

codecov · 2026-05-27T22:06:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.23%. Comparing base (2275500) to head (8bc418d).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #74      +/-   ##
==========================================
+ Coverage   90.61%   92.23%   +1.61%     
==========================================
  Files          24       25       +1     
  Lines        3017     3219     +202     
==========================================
+ Hits         2734     2969     +235     
+ Misses        283      250      -33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add GPU profiles

b7e3118

Fix format

8bc418d

blegat merged commit cfd3e35 into main May 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU profiles#74

Add GPU profiles#74
blegat merged 2 commits into
mainfrom
bl/profile

blegat commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blegat commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 27, 2026 •

edited

Loading