Slow memory leak in RedisChannelLayer .receive_buffer due to lack of cleanup/expiry

I'm observing what looks like a slow memory leak in `channels-redis` in a project I maintain that uses Daphne (Ansible AWX):

```
/var/lib/awx/venv/awx/bin/pip3 freeze | egrep "chan|daphne|Twisted"
channels==2.4.0
channels-redis==2.4.2
daphne==2.4.1
Twisted==20.3.0
```

https://github.com/ansible/awx/blob/devel/awx/main/consumers.py#L114

This looks _quite_ similar to a recently reported (and resolved) memory leak: https://github.com/django/channels/issues/1181

After noticing a slow memory leak in Daphe at high volumes of (large) messages:

![image](https://user-images.githubusercontent.com/214912/88402358-02ddc480-cd99-11ea-9047-9d825ca1496d.png)

I decided to add some instrumentation to my routing code with `tracemalloc` to try to spot the leak:

```diff
diff --git a/awx/main/routing.py b/awx/main/routing.py
index 2866d46ed0..71e40c693f 100644
--- a/awx/main/routing.py
+++ b/awx/main/routing.py
@@ -1,3 +1,6 @@
+from datetime import datetime
+import tracemalloc
+
 import redis
 import logging

@@ -10,11 +13,16 @@ from channels.routing import ProtocolTypeRouter, URLRouter
 from . import consumers


+TRACE_INTERVAL = 60 * 5  # take a tracemalloc snapshot every 5 minutes
+
+
 logger = logging.getLogger('awx.main.routing')


 class AWXProtocolTypeRouter(ProtocolTypeRouter):
     def __init__(self, *args, **kwargs):
+        tracemalloc.start()


+class SnapshottedEventConsumer(consumers.EventConsumer):
+
+    last_snapshot = datetime.min
+
+    async def connect(self):
+        if (datetime.now() - SnapshottedEventConsumer.last_snapshot).total_seconds() > TRACE_INTERVAL:
+            snapshot = tracemalloc.take_snapshot()
+            top_stats = snapshot.statistics('lineno')
+            top_stats = [
+                stat for stat in top_stats[:25]
+                if 'importlib._bootstrap_external' not in str(stat)
+            ]
+            for stat in top_stats[:25]:
+                logger.error('[TRACE] ' + str(stat))
+            SnapshottedEventConsumer.last_snapshot = datetime.now()
+        await super().connect()
+
+
 websocket_urlpatterns = [
-    url(r'websocket/$', consumers.EventConsumer),
+    url(r'websocket/$', SnapshottedEventConsumer),

```

...and what I found was that we were leaking the actual messages somewhere:

```
2020-07-23 18:20:44,628 ERROR    awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/channels_redis/core.py:782: size=121 MiB, count=201990, average=626 B
2020-07-23 18:20:44,630 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:295: size=7628 KiB, count=71910, average=109 B
2020-07-23 18:20:44,630 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:615: size=342 KiB, count=663, average=528 B
2020-07-23 18:20:44,631 ERROR    awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/redis/client.py:91: size=191 KiB, count=1791, average=109 B
2020-07-23 18:20:44,632 ERROR    awx.main.routing [TRACE] /usr/lib64/python3.6/tracemalloc.py:65: size=142 KiB, count=2020, average=72 B
```
I spent some time reading over https://github.com/django/channels_redis/blob/master/channels_redis/core.py and was immediately suspicious of `self.receive_buffer`.  In my reproduction, here's what I'm seeing:

1.  I can open a new browser tab that establishes a new websocket connection.
2.  In the background, I'm generating _very_ high volumes of data (in our case, the data represents the output of a very noisy Ansible playbook) that is going over the websocket and being printed to the user in their browser.
3.  Part of the way through this process, I'm closing the browser, and disconnecting.
4.  In the meantime, however, `self.receive_buffer[channel]` for the prior browser client is _still growing_, and these messages **never** end up being freed.  I can open subsequent browser sessions, establish new websocket connections, and using an interactive debugger, I can see that the channel layer's `receive_buffer` is keeping around stale messages for _old channels_ which will never be delivered (or freed).

I _think_ the root of this issue is that the `self.receive_buffer` data structure doesn't have any form of internal cleanup or expiry (which is supported by this code comment): https://github.com/django/channels_redis/blob/90129a68660d94b16bd223fe92018a0a95e30847/channels_redis/core.py#L535

In practice, having only a send / receive interface might make this a hard/expensive problem to solve (tracking and managing in-memory expirations).  One thing that stands out to me about the Redis channel layer is that it *does* do expiration in redis, but once messages are loaded into memory, there doesn't seem to be any sort expiration logic for the `receive_buffer` data structure (and I'm able to see it grow in an unbounded way fairly easily in my reproduction).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow memory leak in RedisChannelLayer .receive_buffer due to lack of cleanup/expiry #212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow memory leak in RedisChannelLayer .receive_buffer due to lack of cleanup/expiry #212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions