Skip to content

pending dial queue never empties, causing node to not accept new connectionsΒ #3289

@marcus-pousette

Description

@marcus-pousette
  • Version:
    libp2p: 2.10.0

  • Platform:

  • Subsystem:
    ReconnectQueue

Severity:

Critical

Description:

I have a pretty simple test where I am dialing a node, and waiting for my dialing node to confirm they share the same protocol (handler) and closing the connection at some point,, and then I do it all over again.

I can do this 2 times until my remote node no longer accept new incoming dials. And logging the stats of it I see

connections total=0 inbound=0 outbound=0 | dialQueue pending=2

(it stays like this for more than 24 hours)

Steps to reproduce the error:

https://github.com/libp2p/js-libp2p/blob/main/packages/libp2p/src/connection-manager/reconnect-queue.ts

 this.queue.add(async (options) => {
      await pRetry(async (attempt) => {
        if (!this.started) {
          return
        }
       
        try {
          await this.connectionManager.openConnection(peerId, {
            signal: options?.signal
          })
        } catch (err) {
          this.log('reconnecting to %p attempt %d of %d failed - %e', peerId, attempt, this.retries, err)
          throw err
        }
      }, {
        signal: options?.signal,
        retries: this.retries,
        factor: this.backoffFactor,
        minTimeout: this.retryInterval
      })
    }, {
      peerId
    })

In this code I have put a log before "this.connectionManager.openConnection"

and it seems to be stuck forever for me in that call. (I hotpatched am logging before this.connectionManager.openConnection and in a finally statement, and I never see the finally statement to be processed.

and I wonder whether
a timeout signal should be passed to
this.connectionManager.openConnection

Another problematic code path thinking about is that if a peer is redialing before a connection is setup, are we closing/aborting the call in process? and so we can restart all over directly, quickly, without having to wait for potential timeout?

I apologize for not given a isolated, reproducible example, but wanted to write this issue quickly to get awarness/help and also make other devs see this too

Metadata

Metadata

Assignees

No one assigned

    Labels

    need/triageNeeds initial labeling and prioritization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions