-
-
Notifications
You must be signed in to change notification settings - Fork 431
Description
Describe the bug
Unbound starts to drop some queries and increase requestlist.exceeded counters when caches are full or nearly full.
To reproduce
Steps to reproduce the behavior:
-
Unbound 1.24.2-1, running in a VM with 12 cores for couple of days. 12 threads configured. Validating resolver, with rather small RPZ. Acting as a backend in a pool behind a loadbalancer. A standard load of about 1k qps in ISP environment. Cache hitrate about 20% due to cache on the balancer. No excessive load peaks, no silly traffic, no bad clients. No susspicious queries in request list.
-
When Unbound is restarted and everything is fresh it is fine for couple of days. I can see dropped queries just in case there is a real issue with upstream authoritative NS and SERVFAIL is not cached yest. Ater some days I can see random drops of a queries in rate of about 0.5% and increasing even for domains that are clearly without any issues (google.com, apple.com or any other frequently used). I can also see that counters requestlist.exceeded are going up for some threads. The number of events here seems to be corresponding with amount of dropped queries.
Drops are confirmed by a packet capture directly on the VM.
When analyzing this I have noticed that the moment when Unbound starts to drops roughly corresponds with the moment when configured caches are full or nearly full (Grafana in use). Until this moment everything seems to be fine.
- The cache size does not matter here. Neither amount of cache slabs. I can reproduce this with default slab values and these caches:
num-threads: 12
num-queries-per-thread: 4096
msg-cache-size: 128M
rrset-cache-size: 2G
key-cache-size: 256M
neg-cache-size: 64M
infra-cache-numhosts: 1000000
infra-cache-max-rtt: 4000
max-query-restarts: 7
and also with following settings:
msg-cache-size: 8G
rrset-cache-size: 16G
key-cache-size: 2G
neg-cache-size: 512M
msg-cache-slabs: 64
rrset-cache-slabs: 64
infra-cache-slabs: 64
key-cache-slabs: 64
and even with this settings:
msg-cache-size: 8G
rrset-cache-size: 16G
key-cache-size: 2G
neg-cache-size: 512M
msg-cache-slabs: 32
rrset-cache-slabs: 64
infra-cache-slabs: 16
key-cache-slabs: 16
- This issue is fully reproducible on all the Unbound servers in the pool - 11 in total . Normal "drop rate" is up to 0.2 qps, when this starts to appear it goes over 1-1.5 qps.
Expected behavior
No drops
System:
- Unbound version: Debian package 1.24.2-1
- OS: Debian Trixie
# unbound -V
Version 1.24.2
Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --enable-cachedb --with-libhiredis --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --enable-tfo-server
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.5.4 30 Sep 2025
Linked modules: dns64 python cachedb subnetcache respip validator iterator
TCP Fastopen feature available
Additional information
I am runnig also Knot resolver in the pool of servers where I do not see this issue. The more cache is configured the more time it takes until this issue appears. I can provide a graphical stats of drops in time, cache usage as well as the full config. I could see this also in previous versions of Unbound, so I have upgraded to lastest and it still persists.