Redundant accesses due to missing predicates in resize #3710

naoyam · 2025-01-15T02:54:38Z

Just realized a pad may be scheduled in a redundant way with the resize scheduler. For example:

TEST_F(PredicateIndexingTest, SplitAndResize) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  const int64_t i0 = 4;
  const int64_t i1 = 32;

  // [i0*i1]
  auto tv0 = makeContigConcreteTensor({i0 * i1});
  fusion.addInput(tv0);

  auto tv1 = set(tv0);
  // [i0, i1]
  auto tv2 = reshape(tv1, {IrBuilder::create<Val>(i0), IrBuilder::create<Val>(i1)});
  // [i0, i1 * 2]
  auto tv3 = pad(tv2, {zero, IrBuilder::create<Val>(i1)});
  auto tv4 = set(tv3);
  fusion.addOutput(tv4);

  scheduler_tools::propagateResizeToInputs(tv3->definition());

Here, it just manually reproduces what the resize scheduler would do by propagating the resize of the pad. The generated kernel would look like:

__global__ void CUDAGeneratedKernel(Tensor<float, 1, 1> T0, Tensor<float, 2, 2> T4) {
  float T1[256LL];
  #pragma unroll
  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i1;
    i1 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      T1[(i1 + i2)] = 0LL;
    }
  }
  #pragma unroll
  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i3;
    i3 = 32LL * i0;
    nvfuser_index_t i4;
    i4 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      if (((96LL + i2) < 128LL)) {
        T1[(i4 + i2)]
           = T0[(i3 + i2)];
      }
    }
  }
  float T2[256LL];
  #pragma unroll
  for(nvfuser_index_t i5 = 0LL; i5 < 4LL; ++i5) {
    nvfuser_index_t i6;
    i6 = 64LL * i5;
    #pragma unroll
    for(nvfuser_index_t i7 = 0LL; i7 < 64LL; ++i7) {
      nvfuser_index_t i8;
      i8 = i6 + i7;
      T2[i8]
         = T1[i8];
    }
  }
  // Alias Allocation - register
  auto& T3 = T1;
  #pragma unroll
  for(nvfuser_index_t i9 = 0LL; i9 < 4LL; ++i9) {
    nvfuser_index_t i10;
    i10 = 64LL * i9;
    #pragma unroll
    for(nvfuser_index_t i11 = 0LL; i11 < 64LL; ++i11) {
      nvfuser_index_t i12;
      i12 = i10 + i11;
      T3[i12]
         = (i11 < 32LL) ? T2[i12] : 0.000000000e+00f;
    }
  }
  #pragma unroll
  for(nvfuser_index_t i13 = 0LL; i13 < 4LL; ++i13) {
    nvfuser_index_t i14;
    i14 = 64LL * i13;
    #pragma unroll
    for(nvfuser_index_t i15 = 0LL; i15 < 64LL; ++i15) {
      nvfuser_index_t i16;
      i16 = i14 + i15;
      T4[i16]
         = T3[i16];
    }
  }
}

Here, the important part is the predicate of the T0 read:

  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i3;
    i3 = 32LL * i0;
    nvfuser_index_t i4;
    i4 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      if (((96LL + i2) < 128LL)) {   // <-- here
        T1[(i4 + i2)]
           = T0[(i3 + i2)];
      }
    }
  }

This is because T1 is scheduled as:

T1_l_float[iS11{4}, iS13{64}]
 logical domain : (iS1{128})
 contiguity: t
  Outer split: iS1{128} by factor 4 -> iS11{4}, iS12{32}
  Resize: iS12{32} by 0 and 32 -> iS13{64}
 loop domain : (iS11{4}, iS13{64})

The loop indices of i0 and i2 correspond to the loop IDs of iS11 and iS13, respectively. We predicate the logical domain of T1 by generating the corresponding index for iS1.

While this should have no issue with the correctness, half of the accesses would be redundant as they would be masked out by the actual pad expression:

      T3[i12]
         = (i11 < 32LL) ? T2[i12] : 0.000000000e+00f;

This is because of the split of the reshape. Due to the split, indices that exceed the resize input extent of iS12 may not be predicated out. This is quite similar to the non-divisible split, but because of the resize it can happen even with divisible splits like this case.

I don't think this is a correctness issue, but it's likely a performance issue unless nvrtc is smart enough to figure out the accesses are indeed redundant and eliminate the redundant accesses, which is unlikely.

The text was updated successfully, but these errors were encountered:

naoyam self-assigned this Jan 15, 2025

naoyam added a commit that referenced this issue Jan 15, 2025

Fixes #3710

c8baf1e

naoyam added a commit that referenced this issue Jan 15, 2025

Fixes #3710

73d36ad

naoyam linked a pull request Jan 15, 2025 that will close this issue

Avoid redundancy with resize ops #3713

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant accesses due to missing predicates in resize #3710

Redundant accesses due to missing predicates in resize #3710

naoyam commented Jan 15, 2025 •

edited

Loading

Redundant accesses due to missing predicates in resize #3710

Redundant accesses due to missing predicates in resize #3710

Comments

naoyam commented Jan 15, 2025 • edited Loading

naoyam commented Jan 15, 2025 •

edited

Loading