Description
I'm observing what looks like a slow memory leak in channels-redis
in a project I maintain that uses Daphne (Ansible AWX):
/var/lib/awx/venv/awx/bin/pip3 freeze | egrep "chan|daphne|Twisted"
channels==2.4.0
channels-redis==2.4.2
daphne==2.4.1
Twisted==20.3.0
https://github.com/ansible/awx/blob/devel/awx/main/consumers.py#L114
This looks quite similar to a recently reported (and resolved) memory leak: django/channels#1181
After noticing a slow memory leak in Daphe at high volumes of (large) messages:
I decided to add some instrumentation to my routing code with tracemalloc
to try to spot the leak:
diff --git a/awx/main/routing.py b/awx/main/routing.py
index 2866d46ed0..71e40c693f 100644
--- a/awx/main/routing.py
+++ b/awx/main/routing.py
@@ -1,3 +1,6 @@
+from datetime import datetime
+import tracemalloc
+
import redis
import logging
@@ -10,11 +13,16 @@ from channels.routing import ProtocolTypeRouter, URLRouter
from . import consumers
+TRACE_INTERVAL = 60 * 5 # take a tracemalloc snapshot every 5 minutes
+
+
logger = logging.getLogger('awx.main.routing')
class AWXProtocolTypeRouter(ProtocolTypeRouter):
def __init__(self, *args, **kwargs):
+ tracemalloc.start()
+class SnapshottedEventConsumer(consumers.EventConsumer):
+
+ last_snapshot = datetime.min
+
+ async def connect(self):
+ if (datetime.now() - SnapshottedEventConsumer.last_snapshot).total_seconds() > TRACE_INTERVAL:
+ snapshot = tracemalloc.take_snapshot()
+ top_stats = snapshot.statistics('lineno')
+ top_stats = [
+ stat for stat in top_stats[:25]
+ if 'importlib._bootstrap_external' not in str(stat)
+ ]
+ for stat in top_stats[:25]:
+ logger.error('[TRACE] ' + str(stat))
+ SnapshottedEventConsumer.last_snapshot = datetime.now()
+ await super().connect()
+
+
websocket_urlpatterns = [
- url(r'websocket/$', consumers.EventConsumer),
+ url(r'websocket/$', SnapshottedEventConsumer),
...and what I found was that we were leaking the actual messages somewhere:
2020-07-23 18:20:44,628 ERROR awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/channels_redis/core.py:782: size=121 MiB, count=201990, average=626 B
2020-07-23 18:20:44,630 ERROR awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:295: size=7628 KiB, count=71910, average=109 B
2020-07-23 18:20:44,630 ERROR awx.main.routing [TRACE] /usr/lib64/python3.6/asyncio/base_events.py:615: size=342 KiB, count=663, average=528 B
2020-07-23 18:20:44,631 ERROR awx.main.routing [TRACE] /var/lib/awx/venv/awx/lib64/python3.6/site-packages/redis/client.py:91: size=191 KiB, count=1791, average=109 B
2020-07-23 18:20:44,632 ERROR awx.main.routing [TRACE] /usr/lib64/python3.6/tracemalloc.py:65: size=142 KiB, count=2020, average=72 B
I spent some time reading over https://github.com/django/channels_redis/blob/master/channels_redis/core.py and was immediately suspicious of self.receive_buffer
. In my reproduction, here's what I'm seeing:
- I can open a new browser tab that establishes a new websocket connection.
- In the background, I'm generating very high volumes of data (in our case, the data represents the output of a very noisy Ansible playbook) that is going over the websocket and being printed to the user in their browser.
- Part of the way through this process, I'm closing the browser, and disconnecting.
- In the meantime, however,
self.receive_buffer[channel]
for the prior browser client is still growing, and these messages never end up being freed. I can open subsequent browser sessions, establish new websocket connections, and using an interactive debugger, I can see that the channel layer'sreceive_buffer
is keeping around stale messages for old channels which will never be delivered (or freed).
I think the root of this issue is that the self.receive_buffer
data structure doesn't have any form of internal cleanup or expiry (which is supported by this code comment):
channels_redis/channels_redis/core.py
Line 535 in 90129a6
In practice, having only a send / receive interface might make this a hard/expensive problem to solve (tracking and managing in-memory expirations). One thing that stands out to me about the Redis channel layer is that it does do expiration in redis, but once messages are loaded into memory, there doesn't seem to be any sort expiration logic for the receive_buffer
data structure (and I'm able to see it grow in an unbounded way fairly easily in my reproduction).