Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak and high CPU usage caused by frequent Data Channel restarts #357

Open
sirzooro opened this issue Nov 29, 2024 · 5 comments
Open

Comments

@sirzooro
Copy link
Contributor

sirzooro commented Nov 29, 2024

Your environment.

  • Version: sctp v1.8.16, webrtc v3.2.42
  • Browser: Chrome

What did you do?

I have two apps, server one written in go and JS client running in Chrome. These apps open WebRTC session with DataChannel and no RTP streams. Client then requests stream of some data from server, sent via data channel. When data stops flowing (e.g. due to some network issue), client closes data channel and requests new one from the server (WebRTC session is not restarted when this happens). I have case when this data channel restart was triggered every about 10-20 minutes.

What did you expect?

Server app can run 24/7 for a long time without issues.

What happened?

Memory and CPU usage grows, and server app has to be restarted periodically as a workaround. Before restart server app had lots of memory allocated from pion/sctp/Stream.packetize. Looks that data not received by client before it closed data channel somehow got stuck in some pion sctp queue and never dropped after its data channel became closed. It stays there until WebRTC sessions closes or whole app is restarted.

Additionally I noticed that new data channels are added to pion/webrtc/SCTPTransport.dataChannels list but never removed from it. This also caused small memory leak in my case, although much smaller than one described above.

CC @enobufs @edaniels

Edit: I have tried to recreate it and was able to sometimes get following error:
sctp ERROR: 2024/11/29 23:01:24 [0xc001132000] stream 99 not found)
It is logged from here: https://github.com/pion/pion/blob/94171946f00b6acd784fa5c520acdc96aaea5a8b/sctp/association.go#L2259

Probably these chunks should be marked as abandoned?

@sirzooro
Copy link
Contributor Author

sirzooro commented Dec 3, 2024

@jerry-tao looks that we need something like #239 . Could you return to that issue? If we cannot remove chunks when stream closes, maybe we need some timeout to check queues for them? Also keep in mind that when CPU is busy, some chunks may be stuck in pending queue for more time than usual, so some extra protection against this may be needed too.

@sirzooro
Copy link
Contributor Author

sirzooro commented Dec 4, 2024

I have performed some tests trying to reproduce this and found that Chrome sends SACKs for chunks enqueued after stream was closed and pion removes them from pending queue. So this may be caused by retransmissions on pion side or delayed SACKs from Chrome. This needs more testing.

@jerry-tao
Copy link
Member

It seems @edaniels and @enobufs are planning on V2 in #314, you could attach this to it.
Could you try the #239 patch to see if it solves your problem?

@sirzooro
Copy link
Contributor Author

sirzooro commented Dec 5, 2024

Hi @jerry-tao, I have tried your patch. Now memory is reclaimed much faster after stream is closed (I also lowered max RTO value to few seconds). So my problem was caused by retransmissions.

I saw that your patch was reverted because it caused issues with some tests. Could you add it again, but this time together with configuration option so it would be disabled by default? Bo doing it this way tests would not break, and people like me who need this feature could enable it at runtime via webrtc.SettingsEngine.

@jerry-tao
Copy link
Member

The approach discussed in #314, blocking I/O, could resolve this issue in a better way, if I’ve understood it correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants