Skip to content

RFE: Start nodes without immediately accepting KV or SQL requests #70122

Open
@bobvawter

Description

@bobvawter

This RFE is motivated by a desire to reduce the risk of adding new nodes to production environments, especially those with non-trivial network configurations. Without presupposing an implementation, it would be useful to be able to require newly-added nodes to be explicitly activated by the operator after they have joined the RPC/gossip mesh, but before they begin accepting KV or SQL requests.

Many of our enterprise customers do not have the luxury of working in flat network topologies, where arbitrary in- or cross-region traffic is guaranteed to "just work". Consider this actual customer scenario:

  • Kubernetes pod IPs are not directly reachable, but must have a per-pod, dedicated Services, necessitating the use of the --advertise-addr flags.
  • Every network flow between a pair of IPs and/or Regions must be accounted for by firewall rules, acted upon by some other team within the company.
  • The teams that manage the CockroachDB cluster, k8s configurations, and network firewalls are disjoint and high-latency.

These sorts of O(n) or O(n^2) configuration issues would ideally be taken care of in an automated, repeatable fashion, but that is not a reality in all situations. We have had customers suffer cluster disfunction due to asymmetric network reachability that could not be tested for without actually launching a new Cockroach node. Past discussions about a network-quality simulator have uniformly converged to "use CockroachDB itself".

As a straw-man proposal, here is a possible set of ergonomics around an implementation:

  • A new cluster setting cluster.require_node_activation
  • When a new node is cockroach started, it will connect to existing nodes, obtain a node id, but behave as though it were drained and not a valid target for rebalancing.
  • Operators (human or otherwise) would be able to verify node functionality (e.g.: examine the network latency data to verify that full-mesh communication is possible with the newly-added node).
  • An explicit cockroach node activate # command is executed at a time of the operator's choosing.
  • Once a node has been marked as activated, it can never be deactivated, just drained and/or decommissioned.

Jira issue: CRDB-9952

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-configurabilityPertains to cluster settings, CLI flags, env vars etcC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-server-and-securityDB Server & Security

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions