Skip to content

Ingesters latency and in-flight requests spike right after startup with empty TSDB #3349

Open
@pracucci

Description

@pracucci

Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.

Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.

I've been able to profile the affected ingesters and the following is what I found so far.

1. Number of in-flight push requests skyrocket right after ingester startup

Screenshot 2020-10-14 at 17 04 02

2. The number of TSDB appenders skyrocket too

Screenshot 2020-10-14 at 17 02 59

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

Screenshot 2020-10-14 at 17 06 17

4. Lock contention in Head.getOrCreateWithID()

With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

Screenshot 2020-10-14 at 12 55 34

To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).

Storage Engine

  • Blocks
  • Chunks

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions