-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Holorouter Version
9.0.2
VCF Version
VCF 9.0.1.0
Target Type
vCenter Cluster
Describe the bug
In Holorouter, dnsmasq might eventually become unresponsive. For example, doing a lookup of ops-a.site-a.vcf.lab may time out, even though dnsmasq-dns pods remain Running and service endpoints still exist.
Lookups to the holorouter front-end IP and the dns-service pod/cluster IPs (on the holorouter) all time out. This usually breaks VCF component name resolution, until manual intervention.
I actually have a zone delegation and a static route rule set up on my router, so my machines can reach vcf.lab without depending on the webtop or anything fancy.
When this issue occurs, from the LAN, I see connection refused against this delegated DNS server, upon trying to do lookups. Internal on the holorouter, I just see timeouts.
This has the appearance of prior AAAA/upstream issues. However, filter-AAAA is already present in current configs.
Reproduction steps
I don't yet have a deterministic repro sequence, but I suspect it could be a result of a performance-related pressure, or infrequent upstream DNS failures. I speculate that, because there have been logged issues which pointed at "unreliable upstream DNS."
With that said, I think dnsmasq in the Holorouter's architecture could be a lot more resilient, anyway. In practice, the issue appears occasionally during normal operation, long after a successful (and possibly intensive) extended period of deploying.
Observed checks during failure:
kubectl get pods -n default -l app=dnsmasq-dns shows pods Running
kubectl get endpoints -n default dns-service shows valid pod IPs
nslookup ops-a.site-a.vcf.lab <management IP> times out
nslookup ops-a.site-a.vcf.lab <dns-service-IP or clusterIP> times out
Workaround that sometimes works:
Reapply/delete/reapply dnsmasq manifests, or restart/delete dnsmasq pods. This usually brings it all back. My configmap is never empty/blank, nor do I have issues with AAAA filtering being lost.
Expected behavior
If dnsmasq becomes unhealthy or stops answering queries, Kubernetes should detect this with an appropriate liveliness/health check, and self-heal. Perhaps remove bad endpoints and restart pods.
DNS should remain available for delegated/internal records without requiring manual reapply/delete/apply cycles.
Relevant logs
Unfortunately, the issue is opaque and runtime/deployment/pod logs have proven to be pretty unhelpful so far.Additional context
Small note, and probably nothing, but I deployed 9.0.1 to reproduce an issue I have been troubleshooting in VCF Ops. I have since upgraded to 9.0.2 via VCF Ops. I've encountered this bug once during part of an NSX precheck, and again long after while configuring log integration stuff. It isn't easily relatable to load of the underlying host, or anything like that.