Skip to content

Infinite Reaper Loop with Sequence Batcher. #8449

@tjad

Description

@tjad

Description
At some point in time the sequence batcher goes into an infinite loop trying to clear stale sequence IDs.
This is critical/fatal issue which prevents the model from any further processing.

Alternative models also using the sequence batcher still seem to work fine, it is only the single model's instance of sequence batcher that is impacted.

In previous versions of Triton, this bug existed too, except it would send the main triton process into an infinite loop state (using 100% CPU) and prevent all models from working. So triton seems better now that the problem is contained/confined to only a single model, and it does not affect other models from continueing to process.

We have multiple models running with the sequence batcher. 1 of the 3 models is impacted in isolation where its sequence batcher does this and all further processing for that model is halted.

I1009 03:39:04.357965 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357969 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.357972 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357977 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.357980 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357984 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.357988 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.357992 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.357996 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.358000 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.358004 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.358008 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37"                                                                                                                                                                  │
│ I1009 03:39:04.408104 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 17: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408124 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 17"                                                                                                                                                                  │
│ I1009 03:39:04.408128 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 9: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408132 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 9"                                                                                                                                                                   │
│ I1009 03:39:04.408136 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 5: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408140 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 5"                                                                                                                                                                   │
│ I1009 03:39:04.408144 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 3: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408148 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 3"                                                                                                                                                                   │
│ I1009 03:39:04.408152 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 1: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.408156 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 1"                                                                                                                                                                   │
│ I1009 03:39:04.408160 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408163 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.408167 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408171 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.408175 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408200 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.408205 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408209 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.408213 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408217 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.408220 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.408225 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37"                                                                                                                                                                  │
│ I1009 03:39:04.458325 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 17: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458344 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 17"                                                                                                                                                                  │
│ I1009 03:39:04.458349 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 9: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458354 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 9"                                                                                                                                                                   │
│ I1009 03:39:04.458357 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 5: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458361 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 5"                                                                                                                                                                   │
│ I1009 03:39:04.458365 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 3: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458369 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 3"                                                                                                                                                                   │
│ I1009 03:39:04.458373 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 1: max sequence idle exceeded"                                                                                                                                                  │
│ I1009 03:39:04.458376 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 1"                                                                                                                                                                   │
│ I1009 03:39:04.458380 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 15: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458384 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 15"                                                                                                                                                                  │
│ I1009 03:39:04.458388 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 41: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458392 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 41"                                                                                                                                                                  │
│ I1009 03:39:04.458396 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 13: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458400 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 13"                                                                                                                                                                  │
│ I1009 03:39:04.458404 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 39: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458408 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 39"                                                                                                                                                                  │
│ I1009 03:39:04.458412 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 11: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458416 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 11"                                                                                                                                                                  │
│ I1009 03:39:04.458420 42 sequence_batch_scheduler.cc:1079] "Reaper: CORRID 37: max sequence idle exceeded"                                                                                                                                                 │
│ I1009 03:39:04.458424 42 sequence_batch_scheduler.cc:1101] "Reaper: found idle CORRID 37" 

Triton Information
2.59 from container (nvcr.io/nvidia/tritonserver:25.07-py3)

Running on GCP with 2x L4 GPU. Each model is deployed to 1 GPU only - no models are duplicated either on a single GPU or across GPU.

Are you using the Triton container or did you build it yourself?
nvcr.io/nvidia/tritonserver:25.07-py3

To Reproduce
Not sure. I suspect the END sequence is not sent. This happens intermittently and is difficult to reproduce on different environments using the identical same image. Eventually it results in this state. Indicates a sort of race condition.

I have tried to reproduce the issue by sending and not sending the various sequence batcher controls, END, START, or omitting them. We only use START/END.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

sequence_batching{
    oldest{
      max_queue_delay_microseconds: 10000
    }
    max_sequence_idle_microseconds: 5000000
    control_input [
        {
            name: "START",
            control [
                {
                    kind: CONTROL_SEQUENCE_START
                    fp32_false_true: [0, 1]
                }
            ]
        },
        {
            name: "READY"
            control [
                {
                    kind: CONTROL_SEQUENCE_READY
                    fp32_false_true: [0, 1]
                }
            ]
        },
        {
            name: "CORRID",
            control [
                {
                    kind: CONTROL_SEQUENCE_CORRID
                    data_type: TYPE_UINT64
                }
            ]
        },
        {
            name: "END",
            control [
                {
                    kind: CONTROL_SEQUENCE_END
                    fp32_false_true: [0, 1]
                }
            ]
        }
    ]
}

Expected behavior
Once the CORRID is reaped, it should not need to reap for that CORRID again. So that subsequent sequences can be processed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions