server: using  `experimental-dns-srv` can sometimes cause a wrong port to be used

Reported by @meridional 

Setup: we create MR clusters in CC using one kubernetes cluster per region. Nodes in each region has a single SRV hostname to represent their grpc port. And we use the following flags for nodes to discover each other:

```
--listen-addr 0.0.0.0:26258 
--http-addr 0.0.0.0:8080 
--sql-addr 0.0.0.0:26257
--advertise-addr cockroachdb-42bqr.cockroachdb.us-east4.svc.cluster.local:26258
--advertise-sql-addr cockroachdb-42bqr.cockroachdb.us-east4.svc.cluster.local:26257
--join _grpc._tcp.cockroachdb.asia-southeast1.svc,_grpc._tcp.cockroachdb.us-east4.svc,_grpc._tcp.cockroachdb.us-west2.svc
```

The above is taken from one of the nodes in a 3-region cluster, running on version http://us-docker.pkg.dev/cockroach-cloud-images/cockroachdb/cockroach:v22.2.7.

The DNS records are setup by using k8s services with the following spec:

```yaml
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: sql
    port: 26257
    protocol: TCP
    targetPort: 26257
  - name: grpc
    port: 26258
    protocol: TCP
    targetPort: grpc
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  publishNotReadyAddresses: true
  selector:
    crdb.cockroachlabs.com/cluster: cockroachdb
    svc: cockroachdb
  sessionAffinity: None
  type: ClusterIP
```

It supports lookups for SRV records as well as A records for the names we use in join flag _grpc._tcp.cockroachdb.asia-southeast1.svc. We don’t control how and when the records are generated. But an educated guess is that k8s populates them after crdb pods start to run (and have an assigned pod IP).

The issue is if region A’s nodes' start to run when region B’s nodes haven’t (and DNS records are missing), region A will try to join B’s port 26257, and be stuck in a retry loop, even after B’s nodes are up. A restart in region A fixes the issue. The logs from region A when the issue happens: 

```
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935  ‹[core]›‹[Channel #5841 SubChannel #5842] grpc: addrConn.createTransport failed to connect to {›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "Addr": "_grpc._tcp.cockroachdb.asia-southeast1.svc:26257",›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "ServerName": "_grpc._tcp.cockroachdb.asia-southeast1.svc:26257",›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "Attributes": null,›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "BalancerAttributes": null,›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "Type": 0,›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹  "Metadata": null›
W230410 16:42:17.662906 54727 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 1935 +‹}. Err: connection error: desc = "transport: authentication handshake failed: context deadline exceeded"›
W230410 16:42:17.663353 108 server/init.go:420 ⋮ [n?] 1936  outgoing join rpc to ‹_grpc._tcp.cockroachdb.asia-southeast1.svc:26257› unsuccessful: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"›
```

Logs from region B:

```
E230410 16:40:02.963899 178402 1@server/server_sql.go:1464 ⋮ [n1,client=‹10.16.0.11:33866›] 4295  serving SQL client conn: message size ‹352 MiB› bigger than maximum allowed message size ‹16 MiB›
```

10.16.0.11is the IP of a crdb pod in region A.

Jira issue: CRDB-27433

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: using `experimental-dns-srv` can sometimes cause a wrong port to be used #102415

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server: using experimental-dns-srv can sometimes cause a wrong port to be used #102415

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

server: using `experimental-dns-srv` can sometimes cause a wrong port to be used #102415