Can we terminate a kernel during a `Pipeline`? #1492

hyhieu · 2024-04-18T03:33:38Z

hyhieu
Apr 18, 2024

My question is just as in the title.

To elaborate, in the middle of a PipelineTmaAsync, is it safe to have a bool still_going variable that tracks whether the computations have finished, and then stop the for loops whenever the still_going is no longer true?

As an illustrative example, I attach a relevant code block below. I apologize that the code block is long, but it is necessary to provide the full context for the Pipeline usage. In the code block, I pretty much follow the sm90_mma_tma_gmma_ss.cu example.

The early stopping conditions are checked and executed around the lines:

  still_going = true;
  compute(stage_idx, i, &still_going);  // GEMM, also sets the value of still_going
  if (!still_going) {
      break;
  }

[CODE] (long)

    using PipelineTmaAsync = cutlass::PipelineTmaAsync<NUM_STAGES>;
    using PipelineState = cutlass::PipelineState<NUM_STAGES>;
    using BarrierType = typename PipelineTmaAsync::ProducerBarrierType;

    static constexpr auto num_consumers = cute::thr_size(TiledMma{});
    auto pipeline_params = typename PipelineTmaAsync::Params{};
    pipeline_params.transaction_bytes = tma_size_bytes;
    pipeline_params.role = PipelineTmaAsync::ThreadCategory::ProducerConsumer;
    pipeline_params.is_leader = threadIdx.x == 0;
    pipeline_params.num_consumers = num_consumers;

    auto pipeline = PipelineTmaAsync{shared_storage.pipeline, pipeline_params, ClusterShape{}};
    auto smem_pipe_read = PipelineState{};
    auto smem_pipe_write = cutlass::make_producer_start_state<PipelineTmaAsync>();

    const auto num_blocks_tma_prologue = cute::min(num_blocks, NUM_STAGES);
    const auto num_blocks_mma_prologue = cute::min(1, num_blocks_tma_prologue);
    const auto num_blocks_mma_mainloop = num_blocks - num_blocks_mma_prologue;

    /********************************************************************
     * `still_going` tracks whether an early-stopping condition is met. *
     ********************************************************************/
    bool still_going = false;
    int block_idx = 0;

    // TMA Prologue
    CUTE_UNROLL
    for (int i = 0; i < num_blocks_tma_prologue; ++i) {
        pipeline.producer_acquire(smem_pipe_write);

        auto stage_idx = smem_pipe_write.index();
        auto tma_mbar = pipeline.producer_get_barrier(smem_pipe_write);
        fetch_data(tma_mbar, i, stage_idx);  // involves a TMA load

        pipeline.producer_commit(smem_pipe_write, tma_size_bytes);

        ++smem_pipe_write;
    }
    block_idx += num_blocks_tma_prologue;

    // MMA Prologue
    CUTE_NO_UNROLL
    for (int i = 0; i < num_blocks_mma_prologue; ++i) {
        pipeline.consumer_wait(smem_pipe_read);

        auto stage_idx = smem_pipe_read.index();
        still_going = true;
        compute(stage_idx, i, &still_going);  // GEMM, also sets the value of still_going
        if (!still_going) {
            break;
        }

        pipeline.consumer_release(smem_pipe_read);
        ++smem_pipe_read;
    }

    // Main loop: MMA and TMA.
    CUTE_NO_UNROLL
    for (int i = 0; i < num_blocks_mma_mainloop; ++i) {
        pipeline.consumer_wait(smem_pipe_read);

        auto stage_idx = smem_pipe_read.index();
        still_going = false;
        compute(stage_idx, i, &still_going);  // GEMM, also sets the value of still_going
        if (!still_going) {
            break;
        }

        // next read stage
        if (block_idx < num_blocks) {
            pipeline.producer_acquire(smem_pipe_write);

            auto stage_idx = smem_pipe_write.index();
            auto tma_mbar = pipeline.producer_get_barrier(smem_pipe_write);
            fetch_data(tma_mbar, block_idx, stage_idx);  // involves a TMA load 

            pipeline.producer_commit(smem_pipe_write, tma_size_bytes);

            ++smem_pipe_write;
            ++block_idx;
        }

        pipeline.consumer_release(smem_pipe_read);
        ++smem_pipe_read;
    }

    // Wait on all GMMAs
    cute::warpgroup_wait<0>();
    cute::warpgroup_fence_operand(rO);

    if constexpr (size(ClusterShape{}) > 1) {
        cute::cluster_sync();
    } else {
        __syncthreads();
    }

Truly appreciate your help!

Answered by thakkarV

Apr 18, 2024

Yes this is fine in general. Ideally we would model this as some kind of while loop around an updating k tile counter etc. You just have to be really careful to make sure the pipeline states for producers and consumers agree if you terminate early in case this is a persistent kernel or you are fusing with another collective later in the lifetime of kernel.

I don't see any threads updating the value of still_going for each other. Is that happening inside compute()?
Have you tried this and running into issues?
I have not fully read through your code, but are you making sure the MMA threads do not terminate early? The TMA stage will be running ahead of the MMA state, so the MMA still has wo…

View full answer

thakkarV · 2024-04-18T14:05:37Z

thakkarV
Apr 18, 2024
Collaborator

Yes this is fine in general. Ideally we would model this as some kind of while loop around an updating k tile counter etc. You just have to be really careful to make sure the pipeline states for producers and consumers agree if you terminate early in case this is a persistent kernel or you are fusing with another collective later in the lifetime of kernel.

I don't see any threads updating the value of still_going for each other. Is that happening inside compute()?
Have you tried this and running into issues?
I have not fully read through your code, but are you making sure the MMA threads do not terminate early? The TMA stage will be running ahead of the MMA state, so the MMA still has work to do before exiting.

6 replies

thakkarV Apr 18, 2024
Collaborator

These are excellent insights. No I am not checking any point you mentioned. Do you know if there is a tutorial or reference code somewhere that early-stop a pipeline? Learning from examples is best.

early stopping is no different from normal stopping. It should not be. This is why I recommend thinking of this as just a dynamic k tile count. Stream K has to do this as well -- every tile may have a different K tile count. The way you drain the async pipelines is therefore no different from how its done today. I suspect your bug is due to not respecting the software pipeline between the MMA and TMA. Make sure termination is not a hard stop and that you drain the pipeline as if it were ramping down normally.

hyhieu Apr 18, 2024
Author

Thanks again! I guess I should stop thanking you in each reply, and save the appreciation for a final post? 😅

Anyhow, two follow-up conceptual questions:

The TMA stage will be running ahead of the MMA state, so the MMA still has work to do before exiting.

Why is this a problem? If TMA is just a mem copy, why can't we just perform the copy and then do nothing with the copied memory?

Yes we wasted some work, but if we are okay with that, is there something baked into the Pipeline object that prevents this?
I suspect your bug is due to not respecting the software pipeline between the MMA and TMA. Make sure termination is not a hard stop and that you drain the pipeline as if it were ramping down normally.

Assuming we need to respect the undrained pipeline, conceptually, does it mean to wait for all issued TMAs to be consumed by some MMAs, before terminating? Perhaps something like this in pseudo code:
```
if (!still_going) {
  for (int j = 0; j < num_outstanding_tmas; ++j) {  // need to track num_outstanding_tmas somehow
    pipeline.consumer_wait(smem_pipe_read);
    pipeline.consumer_release(smem_pipe_release);
    ++smem_pipe_read;
    ++smem_pipe_release;
  }
  break;
}
```

thakkarV Apr 18, 2024
Collaborator

Yes we wasted some work, but if we are okay with that, is there something baked into the Pipeline object that prevents this?

This is totally fine, provided that you are handling the arrive/waits properly for this. They must be symmetric, otherwise you will hang. E.g. the MMA might terminate before the TMA, and if the TMA is running ahead and expecting a buffer to be empty pipeline.producer_acquire(smem_pipe_write); then it will wait for ever. I think a simple thing that might eliminate your deadlock is the following: make sure your termination code is not in between an arrive and wait. Either terminate before the arrive/wait pair, or after.

hyhieu Apr 22, 2024
Author

Confirm that after "draining the pipeline", our code works.

Thank you for the advice, Vijay! Go CuTe.

EDIT: also Go CUTLASS! 💯

thakkarV Apr 22, 2024
Collaborator

Good stuff, no worries :) Nit: pipelines are not a part of CuTe :) CuTe does not deal with any temporal constructs

Can we terminate a kernel during a Pipeline? #1492

Uh oh!

Uh oh!

hyhieu Apr 18, 2024

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

thakkarV Apr 18, 2024 Collaborator

Uh oh!

thakkarV Apr 18, 2024 Collaborator

Uh oh!

Uh oh!

hyhieu Apr 18, 2024 Author

Uh oh!

thakkarV Apr 18, 2024 Collaborator

Uh oh!

Uh oh!

hyhieu Apr 22, 2024 Author

Uh oh!

thakkarV Apr 22, 2024 Collaborator

Can we terminate a kernel during a `Pipeline`? #1492

hyhieu
Apr 18, 2024

Replies: 1 comment 6 replies

thakkarV
Apr 18, 2024
Collaborator

thakkarV Apr 18, 2024
Collaborator

hyhieu Apr 18, 2024
Author

thakkarV Apr 18, 2024
Collaborator

hyhieu Apr 22, 2024
Author

thakkarV Apr 22, 2024
Collaborator