Skip to content

Slow memory leak in RedisChannelLayer .receive_buffer due to lack of cleanup/expiryΒ #212

Closed
@ryanpetrello

Description

@ryanpetrello

I'm observing what looks like a slow memory leak in channels-redis in a project I maintain that uses Daphne (Ansible AWX):

/var/lib/awx/venv/awx/bin/pip3 freeze | egrep "chan|daphne|Twisted"
channels==2.4.0
channels-redis==2.4.2
daphne==2.4.1
Twisted==20.3.0

https://github.com/ansible/awx/blob/devel/awx/main/consumers.py#L114

This looks quite similar to a recently reported (and resolved) memory leak: django/channels#1181

After noticing a slow memory leak in Daphe at high volumes of (large) messages:

image

I decided to add some instrumentation to my routing code with tracemalloc to try to spot the leak:

diff --git a/awx/main/routing.py b/awx/main/routing.py
index 2866d46ed0..71e40c693f 100644
--- a/awx/main/routing.py
+++ b/awx/main/routing.py
@@ -1,3 +1,6 @@
+from datetime import datetime
+import tracemalloc
+
 import redis
 import logging

@@ -10,11 +13,16 @@ from channels.routing import ProtocolTypeRouter, URLRouter
 from . import consumers


+TRACE_INTERVAL = 60 * 5  # take a tracemalloc snapshot every 5 minutes
+
+
 logger = logging.getLogger('awx.main.routing')


 class AWXProtocolTypeRouter(ProtocolTypeRouter):
     def __init__(self, *args, **kwargs):
+        tracemalloc.start()


+class SnapshottedEventConsumer(consumers.EventConsumer):
+
+    last_snapshot = datetime.min
+
+    async def connect(self):
+        if (datetime.now() - SnapshottedEventConsumer.last_snapshot).total_seconds() > TRACE_INTERVAL:
+            snapshot = tracemalloc.take_snapshot()
+            top_stats = snapshot.statistics('lineno')
+            top_stats = [
+                stat for stat in top_stats[:25]
+                if 'importlib._bootstrap_external' not in str(stat)
+            ]
+            for stat in top_stats[:25]:
+                logger.error('[TRACE] ' + str(stat))
+            SnapshottedEventConsumer.last_snapshot = datetime.now()
+        await super().connect()
+
+
 websocket_urlpatterns = [
-    url(r'websocket/$', consumers.EventConsumer),
+    url(r'websocket/$', SnapshottedEventConsumer),

...and what I found was that we were leaking the actual messages somewhere:

2020-07-23 18:20:44,628 ERROR    awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/channels_redis/core.py:782: size=121 MiB, count=201990, average=626 B
2020-07-23 18:20:44,630 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:295: size=7628 KiB, count=71910, average=109 B
2020-07-23 18:20:44,630 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:615: size=342 KiB, count=663, average=528 B
2020-07-23 18:20:44,631 ERROR    awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/redis/client.py:91: size=191 KiB, count=1791, average=109 B
2020-07-23 18:20:44,632 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/tracemalloc.py:65: size=142 KiB, count=2020, average=72 B

I spent some time reading over https://github.com/django/channels_redis/blob/master/channels_redis/core.py and was immediately suspicious of self.receive_buffer. In my reproduction, here's what I'm seeing:

  1. I can open a new browser tab that establishes a new websocket connection.
  2. In the background, I'm generating very high volumes of data (in our case, the data represents the output of a very noisy Ansible playbook) that is going over the websocket and being printed to the user in their browser.
  3. Part of the way through this process, I'm closing the browser, and disconnecting.
  4. In the meantime, however, self.receive_buffer[channel] for the prior browser client is still growing, and these messages never end up being freed. I can open subsequent browser sessions, establish new websocket connections, and using an interactive debugger, I can see that the channel layer's receive_buffer is keeping around stale messages for old channels which will never be delivered (or freed).

I think the root of this issue is that the self.receive_buffer data structure doesn't have any form of internal cleanup or expiry (which is supported by this code comment):

# TODO: message expiry?

In practice, having only a send / receive interface might make this a hard/expensive problem to solve (tracking and managing in-memory expirations). One thing that stands out to me about the Redis channel layer is that it does do expiration in redis, but once messages are loaded into memory, there doesn't seem to be any sort expiration logic for the receive_buffer data structure (and I'm able to see it grow in an unbounded way fairly easily in my reproduction).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions