[Bug]: Dnsmasq can still become non-responsive while pod remains Running

### Holorouter Version

9.0.2

### VCF Version

VCF 9.0.1.0

### Target Type

vCenter Cluster

### Describe the bug

In Holorouter, dnsmasq might eventually become unresponsive. For example, doing a lookup of ops-a.site-a.vcf.lab may time out, even though dnsmasq-dns pods remain Running and service endpoints still exist.

Lookups to the holorouter front-end IP and the dns-service pod/cluster IPs (on the holorouter) all time out. This usually breaks VCF component name resolution, until manual intervention.

I actually have a zone delegation and a static route rule set up on my router, so my machines can reach vcf.lab without depending on the webtop or anything fancy.

When this issue occurs, from the LAN, I see connection refused against this delegated DNS server, upon trying to do lookups. Internal on the holorouter, I just see timeouts.

This has the appearance of prior AAAA/upstream issues. However, filter-AAAA is already present in current configs.

### Reproduction steps

I don't yet have a deterministic repro sequence, but I suspect it could be a result of a performance-related pressure, or infrequent upstream DNS failures. I speculate that, because there have been logged issues which pointed at "unreliable upstream DNS."

With that said, I think dnsmasq in the Holorouter's architecture could be a lot more resilient, anyway. In practice, the issue appears occasionally during normal operation, long after a successful (and possibly intensive) extended period of deploying.

Observed checks during failure:
`kubectl get pods -n default -l app=dnsmasq-dns` shows pods Running
`kubectl get endpoints -n default dns-service` shows valid pod IPs
`nslookup ops-a.site-a.vcf.lab <management IP>` times out
`nslookup ops-a.site-a.vcf.lab <dns-service-IP or clusterIP>` times out

Workaround that sometimes works:
Reapply/delete/reapply dnsmasq manifests, or restart/delete dnsmasq pods. This usually brings it all back. My configmap is never empty/blank, nor do I have issues with AAAA filtering being lost.

### Expected behavior

If dnsmasq becomes unhealthy or stops answering queries, Kubernetes should detect this with an appropriate liveliness/health check, and self-heal. Perhaps remove bad endpoints and restart pods.

DNS should remain available for delegated/internal records without requiring manual reapply/delete/apply cycles.

### Relevant logs

```shell
Unfortunately, the issue is opaque and runtime/deployment/pod logs have proven to be pretty unhelpful so far.
```

### Additional context

Small note, and probably nothing, but I deployed 9.0.1 to reproduce an issue I have been troubleshooting in VCF Ops. I have since upgraded to 9.0.2 via VCF Ops. I've encountered this bug once during part of an NSX precheck, and again long after while configuring log integration stuff. It isn't easily relatable to load  of the underlying host, or anything like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Dnsmasq can still become non-responsive while pod remains Running #114

Holorouter Version

VCF Version

Target Type

Describe the bug

Reproduction steps

Expected behavior

Relevant logs

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Dnsmasq can still become non-responsive while pod remains Running #114

Description

Holorouter Version

VCF Version

Target Type

Describe the bug

Reproduction steps

Expected behavior

Relevant logs

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions