-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | FreeBSD |
| Distribution Version | 14.3-STABLE stable/14-n273382-8e49c6c84cae |
| Kernel Version | 14.3-STABLE 1403508 |
| Architecture | amd64 |
| OpenZFS Version | zfs-2.2.9-FreeBSD_g079ba86d7 |
Describe the problem you're observing
We use ZFS extensively on our servers running FreeBSD. We have recently upgraded OS from FreeBSD 14.0 to FreeBSD 14.3. Soon thereafter we have started encountering rare, mysterious SQLite3 database corruptions. We have literally millions of those.
And today, after upgrading yet another server and overloading it a bit, we have found the culprit, it seems. We have a service that routinely scans a collection of files for any changes to mtime. And while the server was somewhat heavily loaded, the said service has observed a spurious ENOENT error, on a file that wasn't modified for months. During that period of time, we have noticed [kernel{arc_prune}] to be very busy. Apart from that spurious ENOENT, around same time, we have noticed problems when building FreeBSD port:
chmod /tmp/usr/ports/devel/cmake-core/work/cmake-3.31.10/Tests/CMakeOnly/CheckOBJCCompilerFlag : Bad file descriptor
It's the first time ever we have encountered this kind of error, while building ports.
Having spurious ENOENT on actually existing files perfectly explains all previous occurrences of SQLite3 database corruption on other servers. As we use delete journaling mode. And failure to find or remove the journal file is a disaster.
It all points to some race between stat(2) (and probably other syscalls as well) and an actively busy arc_prune. As if it's pruning metadata that is concurrently being requested by the syscall. And instead of falling back to cold data on drives, it immediately returns. As if the file system entity no longer exists. Though, a subsequent stat(2) will find the entity without problems.
Describe how to reproduce the problem
Produce a lot of CPU load and disk I/O and stat(2) different files that you are sure must exists. Occasionally observer the ENOENT error.