-
Notifications
You must be signed in to change notification settings - Fork 530
Description
Hi there,
I'm currently evaluating tcmalloc as a potential replacement for our existing old memory allocator (jemalloc 3.6.0). One thing that I've noticed is that the virtual memory consumption of tcmalloc is much higher than other allocators such as mimalloc, and gperftools tcmalloc. I was wondering if there was something that I was doing wrong?
For testing, I built tcmalloc as a shared library following #27 (comment) and injected it into our application through LD_PRELOAD. (I know this isn't recommended due to potential ODR issues with Abseil - this is just for testing). I then ran one of our fluid simulation workloads. When building with Bazel, I made sure to specify -c opt. This seems to have built a release build of the allocator, but I'm new to Bazel, so perhaps I missed something crucial.
This is the virtual memory consumption over time for a number of allocators. tcmalloc_google is this repo, and tcmalloc is from gperftools:
As you can see, tcmalloc_google_stock's virtual memory consumption is easily double that of the other allocators, and goes past the bounds of the graph. The maximum that it hits is 145.44GB, almost 3 times higher than the next largest.
Looking at the tuning guide, it recommends that we make changes to THP settings. We unfortunately cannot control these settings on customer machines. I have therefore decided to ignore it for now.
For reference, this was run on an Ubuntu 22.04.4 machine with the default settings:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise [madvise] never
$ cat /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
511
$ cat /proc/sys/vm/overcommit_memory
0
Looking at the existing issues, I came across this post: #260 (comment). I tried exporting GLIBC_TUNABLES=glibc.pthread.rseq=0 before running our workload, but it didn't seem to have much of an effect. (-1.5GB VSIZE) [rseq0 in the graph above]
I also tried creating a new thread in our application that calls ProcessBackgroundActions. This makes quite an improvement, but VSIZE is still excessive. (I configured the release rate to 10 MB/s) [rr10MBps in the graph above)
One additional thing that I've noticed is that tcmalloc seems to be slightly slower than other the allocator provided by gperftools. It takes tcmalloc around 4430.00s for this benchmark to finish, while tcmalloc from gperftools takes 4271.00s (159 seconds [3.6%] less) with considerably lower virtual memory consumption. These numbers were collected on a silent machine with 24 cores (16P 8E) with turbo boost disabled, the cores locked to their base frequencies and the CPU governor set to performance for consistent results.
Virtual memory consumption is a big issue for us as many of our customers configure their systems to have zero swap space and monitor the virtual memory size. If the virtual memory size gets close to the physical memory size, they'll kill the application as it's "swapping." We've tried to inform users that virtual memory consumption is not an indication of swapping, but alas, this practice is very common in our industry, so we must make concessions. This is the main reason why we've been stuck on jemalloc 3.6.0 for so long.
I have dumped the stats right at the end of the workload, and you can find it here: stats.txt
Any help that you could provide would be greatly appreciated. Thanks!