Description
Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE
, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.
Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.
I've been able to profile the affected ingesters and the following is what I found so far.
1. Number of in-flight push requests skyrocket right after ingester startup
2. The number of TSDB appenders skyrocket too
3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too
4. Lock contention in Head.getOrCreateWithID()
With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID()
due to lock contention.
To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).
Storage Engine
- Blocks
- Chunks