-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: libp2p-kad-dht goroutine leak crash #243
Comments
Looks like there is an unattented context for https://github.com/libp2p/go-libp2p-kad-dht/blob/master/providers/providers_manager.go#L263 Probably the problem is in the subpub package (multirpc) |
The library seems to generally have no default limit, and we've encountered gateways in some environments spiking to as many as 200K goroutines, so this seems like a likely culprit. And even if this is not the cause, setting a limit is a good idea anyway. As a start, we're using four times MaxDHTpeers as the discovery limit for both FindPeers and Advertise. It seems high enough to keep things working, but low enough that we can't run out of memory with hundreds of thousands of goroutines. Updates vocdoni#243.
The library seems to generally have no default limit, and we've encountered gateways in some environments spiking to as many as 200K goroutines, so this seems like a likely culprit. And even if this is not the cause, setting a limit is a good idea anyway. As a start, we're using four times MaxDHTpeers as the discovery limit for both FindPeers and Advertise. It seems high enough to keep things working, but low enough that we can't run out of memory with hundreds of thousands of goroutines. Updates #243.
The library seems to generally have no default limit, and we've encountered gateways in some environments spiking to as many as 200K goroutines, so this seems like a likely culprit. And even if this is not the cause, setting a limit is a good idea anyway. As a start, we're using four times MaxDHTpeers as the discovery limit for both FindPeers and Advertise. It seems high enough to keep things working, but low enough that we can't run out of memory with hundreds of thousands of goroutines. Updates #243.
The library seems to generally have no default limit, and we've encountered gateways in some environments spiking to as many as 200K goroutines, so this seems like a likely culprit. And even if this is not the cause, setting a limit is a good idea anyway. As a start, we're using four times MaxDHTpeers as the discovery limit for both FindPeers and Advertise. It seems high enough to keep things working, but low enough that we can't run out of memory with hundreds of thousands of goroutines. Updates #243.
When the connection to a peer is lost, broadcastHandler errors in its SendMessage call, and the entire goroutine stops. No goroutine will continue receiving on the write channel, and sooner than later, sends to the write channel will start blocking. This starts causing deadlocks further up in IPFSsync. SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever, and further up the chain in IPFSsync, that can mean some goroutines hold onto mutexes forever. On one hand, this chain of events can hang IPFSsync, stopping it from doing anything useful until a restart. On the other hand, it causes goroutine leaks. When more calls to IPFSsync.Handle come through, using new goroutines via the router, those try to grab the deadlocked mutexes and hang forever. First, fix the root cause: peerSub now has a "closed" channel, which gets closed by peersManager when the peer is dropped. Its goroutines, both for reading and writing messages, keep running until that happens. Second, make the symptom of the deadlock less severe: prevent blocking on channel sends forever. Any send on the "write" channel now stops on "closed". And the send on BroadcastWriter, which could also block forever, now has a fallback timeout of five minutes. Updates vocdoni#243. Perhaps not a total fix, as there might be other leaks.
When the connection to a peer is lost, broadcastHandler errors in its SendMessage call, and the entire goroutine stops. No goroutine will continue receiving on the write channel, and sooner than later, sends to the write channel will start blocking. This starts causing deadlocks further up in IPFSsync. SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever, and further up the chain in IPFSsync, that can mean some goroutines hold onto mutexes forever. On one hand, this chain of events can hang IPFSsync, stopping it from doing anything useful until a restart. On the other hand, it causes goroutine leaks. When more calls to IPFSsync.Handle come through, using new goroutines via the router, those try to grab the deadlocked mutexes and hang forever. First, fix the root cause: peerSub now has a "closed" channel, which gets closed by peersManager when the peer is dropped. Its goroutines, both for reading and writing messages, keep running until that happens. Second, make the symptom of the deadlock less severe: prevent blocking on channel sends forever. Any send on the "write" channel now stops on "closed". And the send on BroadcastWriter, which could also block forever, now has a fallback timeout of five minutes. Updates #243. Perhaps not a total fix, as there might be other leaks.
When the connection to a peer is lost, broadcastHandler errors in its SendMessage call, and the entire goroutine stops. No goroutine will continue receiving on the write channel, and sooner than later, sends to the write channel will start blocking. This starts causing deadlocks further up in IPFSsync. SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever, and further up the chain in IPFSsync, that can mean some goroutines hold onto mutexes forever. On one hand, this chain of events can hang IPFSsync, stopping it from doing anything useful until a restart. On the other hand, it causes goroutine leaks. When more calls to IPFSsync.Handle come through, using new goroutines via the router, those try to grab the deadlocked mutexes and hang forever. First, fix the root cause: peerSub now has a "closed" channel, which gets closed by peersManager when the peer is dropped. Its goroutines, both for reading and writing messages, keep running until that happens. Second, make the symptom of the deadlock less severe: prevent blocking on channel sends forever. Any send on the "write" channel now stops on "closed". And the send on BroadcastWriter, which could also block forever, now has a fallback timeout of five minutes. Updates #243. Perhaps not a total fix, as there might be other leaks.
When the connection to a peer is lost, broadcastHandler errors in its SendMessage call, and the entire goroutine stops. No goroutine will continue receiving on the write channel, and sooner than later, sends to the write channel will start blocking. This starts causing deadlocks further up in IPFSsync. SubPub.Subscribe and SubPub.PeerStreamWrite can now block forever, and further up the chain in IPFSsync, that can mean some goroutines hold onto mutexes forever. On one hand, this chain of events can hang IPFSsync, stopping it from doing anything useful until a restart. On the other hand, it causes goroutine leaks. When more calls to IPFSsync.Handle come through, using new goroutines via the router, those try to grab the deadlocked mutexes and hang forever. First, fix the root cause: peerSub now has a "closed" channel, which gets closed by peersManager when the peer is dropped. Its goroutines, both for reading and writing messages, keep running until that happens. Second, make the symptom of the deadlock less severe: prevent blocking on channel sends forever. Any send on the "write" channel now stops on "closed". And the send on BroadcastWriter, which could also block forever, now has a fallback timeout of five minutes. Updates #243. Perhaps not a total fix, as there might be other leaks.
The previous commits of mine did fix deadlocks and potential goroutine leaks, but those were minor goroutine leaks and unrelated to the much larger goroutine leak that litxi is running into. I've done two hours of digging into this, and I've got good news and bad news. The good news is that I went over the code step by step multiple times, and I could not spot a mistake on our part. It's worth noting that the subpub code is hard to follow though, so I could have missed something. The bad news is that I don't see an easy fix for what I think is happening here. We essentially get tens of thousands of goroutines stuck handling "GetProviders" libp2p messages in go-libp2p-kad-dht. And it makes sense; its ProviderManager handles all of those requests serially, without even supporting parallel "gets". So if a node discovers a thousand other nodes at once, a thousand such messages will go in each direction, and the queue will get pretty long for some time. Handling each request involves some input/output as well, which might block. If the messages come in faster than the node can keep up with, they will pile up and use more and more memory. Each pending request uses one goroutine as well. I'm not sure why so many messages are coming in for just one of the gateways. Three ideas come to mind:
I tend to think this has something to do with litxi's network setup, because as far as I can tell no other gateway has this issue with go-libp2p-kad-dht. Ideally, we'd fix the root cause of all those peers, which would presumably be a devops/systems fix. That devops fix aside, I think our best bet in terms of mitigating the goroutine leak in the source code would be upstream's libp2p/go-libp2p-kad-dht#675, which already has a PR. It could be a few weeks until that's merged, though, but we could always try using that branch as a temporary measure. Another potential fix would be to switch to gossipsub, since https://github.com/libp2p/go-libp2p-pubsub does not import the kad-dht library which has this potential memory leak :) |
I was finally able to catch gw5 while it was using 30k goroutines. It, too, was filled with kad-dht goroutines like:
So the goroutine leak seems to be exactly the same as litxi. We also can't put the blame on litxi for who is causing all these sudden requests - gw5 was getting spikes of 10k goroutines a few days ago, during which time litxi was down. Here are my two current best guesses as to the root causes of the spike in requests:
I'm not sure how likely the second one is, though. Do we have anything else that uses libp2p-kad-dht besides vocdoni-node gateways themselves? |
Nothing else uses libp2p, so it should be 1 |
It could also be something in between, like a frontend or client sending 100 requests to a gateway, and handling those 100 requests results in 100+ libp2p requests done by the gateway. |
Each file a client tries to get from the gateway (header images, voting process metadata, etc.) goes to the IPFS daemon, handled by the Retrieve method here https://github.com/vocdoni/vocdoni-node/blob/master/data/ipfs.go#L253 There is a LRU cache (I introduced it recently, such as 1 month ago) that saves a maximum of 128 files which if available are directly used to fetch the file data (without fetching the IPFS daemon) https://github.com/vocdoni/vocdoni-node/blob/master/data/ipfs.go#L44 But apart from that (files fetched by clients) the libp2p-kad-dht has not directly interaction with a client request. |
In the other side, we have the ipfsSync daemon that uses subpub. It is constantly receiving protocol messages (HELLO and UPDATE) from other cluster peers. Currently gw5 has 7 cluster-peers, so its receiving messages almost every second. However I don't see why that should end up in a memory leak
|
That could explain it, actually. Right now, libp2p-kad-dht only handles one request at a time, be it read-only or read-write. If handling one of those requests blocks for a little while, then all other requests pile up, "leaking" their goroutines until they can be processed. Handling one request could block if an operation on the underlying database blocks, for example. This could happen because the database is currently doing periodic work like flushing or compressing, or because another goroutine is currently writing the same keys, or because the disk/IO are currently being used heavily by another process. |
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates vocdoni#243.
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates #243.
This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates #243.
Describe the bug
one of the production gateways dies from time to time due a goroutine leak, reaching the mem limit.
To Reproduce (please complete the following information)
Compressed logs:
log.gz
The text was updated successfully, but these errors were encountered: