High spin lock activity and slow performance

We are evaluating with TCMalloc with one of our large services and are trying to understand why its performance is below our baseline.

Looking at a profile, we noticed a lot of TCMalloc cycles are spent in the SpinLock class which is predominantly called by way of the `TransferCache` and `CentralFreeList` classes. Looking at the TCMalloc statistics we see that the ratio of per-CPU cache overflows to underflows is very close to 1.0 which might explain the contention for global resources. Here is a sample of what we see:

```
------------------------------------------------
Number of per-CPU cache underflows, overflows, and reclaims
------------------------------------------------
Total  :  1504795503 underflows,  1397507411 overflows, overflows / underflows:  0.93,            0 reclaims,       26008 resizes
cpu   0:      164336 underflows,      100338 overflows, overflows / underflows:  0.61,            0 reclaims,         116 resizes
cpu   1:    12795170 underflows,    13256567 overflows, overflows / underflows:  1.04,            0 reclaims,         165 resizes
cpu   2:    10301002 underflows,    10339379 overflows, overflows / underflows:  1.00,            0 reclaims,         165 resizes
cpu   3:    10581923 underflows,    10259379 overflows, overflows / underflows:  0.97,            0 reclaims,         165 resizes
cpu   4:     8324154 underflows,     8403909 overflows, overflows / underflows:  1.01,            0 reclaims,         165 resizes
cpu   5:     9256510 underflows,     8486369 overflows, overflows / underflows:  0.92,            0 reclaims,         160 resizes
cpu   6:     7312773 underflows,     7102156 overflows, overflows / underflows:  0.97,            0 reclaims,         165 resizes
cpu   7:     7051987 underflows,     7201721 overflows, overflows / underflows:  1.02,            0 reclaims,         165 resizes
cpu   8:    10577188 underflows,    10553890 overflows, overflows / underflows:  1.00,            0 reclaims,         162 resizes
...
```

We followed advice from the tuning guide and increased the page size from 8KiB to 32KiB or 256KiB and increased the per-CPU cache size to 40MiB (and even larger values). We also release memory periodically with `ProcessBackgroundActions()`.

Increasing the page size to 32KiB seemed to offset the performance loss by half but going to 256KiB did not offer much additional benefit.

Increasing the per-CPU cache size, the suggested way to address an overflow to underflow ratio close to 1.0, had only a negligible effect. Despite TCMalloc showing `PARAMETER tcmalloc_max_per_cpu_cache_size 41943040` the amount of memory in the per-CPU caches seemed to be some small proportion of the purported max size: 

```
------------------------------------------------
Bytes in per-CPU caches (per cpu limit: 41943040 bytes)
------------------------------------------------
cpu   0:       362584 bytes (    0.3 MiB) with      596200 bytes unallocated  active populated physical-populated
cpu   1:       740560 bytes (    0.7 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   2:       369256 bytes (    0.4 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   3:       538232 bytes (    0.5 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   4:       871312 bytes (    0.8 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   5:       702480 bytes (    0.7 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   6:      1072800 bytes (    1.0 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   7:       667552 bytes (    0.6 MiB) with           0 bytes unallocated  active populated physical-populated
cpu   8:       794504 bytes (    0.8 MiB) with           0 bytes unallocated  active populated physical-populated
...
```

Based on the documentation it would seem that the underflow/overflow ratio being ~1.0 is indicative of a cache that is not large enough, but the actual cache utilization seems quite small based on the results above, which seems a bit confusing.

Are there other tunings we can try which might address this problem or are there other signals we might want to look at that might inform what is going wrong?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High spin lock activity and slow performance #286

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High spin lock activity and slow performance #286

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions