Skip to content

Unbound starts to drop queries when caches are nearly full #1414

@rygl

Description

@rygl

Describe the bug
Unbound starts to drop some queries and increase requestlist.exceeded counters when caches are full or nearly full.

To reproduce
Steps to reproduce the behavior:

  1. Unbound 1.24.2-1, running in a VM with 12 cores for couple of days. 12 threads configured. Validating resolver, with rather small RPZ. Acting as a backend in a pool behind a loadbalancer. A standard load of about 1k qps in ISP environment. Cache hitrate about 20% due to cache on the balancer. No excessive load peaks, no silly traffic, no bad clients. No susspicious queries in request list.

  2. When Unbound is restarted and everything is fresh it is fine for couple of days. I can see dropped queries just in case there is a real issue with upstream authoritative NS and SERVFAIL is not cached yest. Ater some days I can see random drops of a queries in rate of about 0.5% and increasing even for domains that are clearly without any issues (google.com, apple.com or any other frequently used). I can also see that counters requestlist.exceeded are going up for some threads. The number of events here seems to be corresponding with amount of dropped queries.
    Drops are confirmed by a packet capture directly on the VM.

When analyzing this I have noticed that the moment when Unbound starts to drops roughly corresponds with the moment when configured caches are full or nearly full (Grafana in use). Until this moment everything seems to be fine.

  1. The cache size does not matter here. Neither amount of cache slabs. I can reproduce this with default slab values and these caches:
        num-threads: 12
        num-queries-per-thread: 4096

        msg-cache-size:    128M
        rrset-cache-size:  2G
        key-cache-size:    256M
        neg-cache-size:    64M

        infra-cache-numhosts: 1000000
        infra-cache-max-rtt:  4000
        max-query-restarts:   7

and also with following settings:

        msg-cache-size:    8G
        rrset-cache-size:  16G
        key-cache-size:    2G
        neg-cache-size:    512M

        msg-cache-slabs:   64
        rrset-cache-slabs: 64
        infra-cache-slabs: 64
        key-cache-slabs:   64

and even with this settings:

        msg-cache-size:    8G
        rrset-cache-size:  16G
        key-cache-size:    2G
        neg-cache-size:    512M

        msg-cache-slabs:   32
        rrset-cache-slabs: 64
        infra-cache-slabs: 16
        key-cache-slabs:   16
  1. This issue is fully reproducible on all the Unbound servers in the pool - 11 in total . Normal "drop rate" is up to 0.2 qps, when this starts to appear it goes over 1-1.5 qps.

Expected behavior
No drops

System:

  • Unbound version: Debian package 1.24.2-1
  • OS: Debian Trixie
# unbound -V
Version 1.24.2

Configure line: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --enable-cachedb --with-libhiredis --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --enable-tfo-server
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.5.4 30 Sep 2025
Linked modules: dns64 python cachedb subnetcache respip validator iterator
TCP Fastopen feature available

Additional information
I am runnig also Knot resolver in the pool of servers where I do not see this issue. The more cache is configured the more time it takes until this issue appears. I can provide a graphical stats of drops in time, cache usage as well as the full config. I could see this also in previous versions of Unbound, so I have upgraded to lastest and it still persists.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions