Skip to content

errors corrected by scrub but device stats still showing 0 #974

Open
@gschintgen

Description

@gschintgen

I recently observed my first case of a readtime correction:

Mar 17 03:01:44 server kernel: BTRFS warning (device sdb1): csum failed root 1562 ino 59747 off 1602211840 csum 0xf6f45a1f expected csum 0x0e8a49e7 mirror 2
Mar 17 03:01:44 server kernel: BTRFS error (device sdb1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Mar 17 03:01:44 server kernel: BTRFS info (device sdb1): read error corrected: ino 59747 off 1602211840 (dev /dev/sdc1 sector 1709502656)

I was alerted to this by my monitoring script which keeps an eye on btrfs device stats (they've always been 0 until now).

When giving a closer look to my logs I noticed that even months earlier there were errors being found and corrected by the monthly scrub:

# journalctl -g 'csum='
May 01 09:26:06 server btrfs-scrub[3160846]: Error summary:    csum=2
Jun 01 10:57:30 server btrfs-scrub[55336]: Error summary:    csum=1
-- Boot d914366d1f104b07b3d8b4b97dbb43b5 --
-- Boot 67bceee111804b87b9985f381609928e --
-- Boot 7b08640341ac4b21b00deec76eebeff0 --
-- Boot fe2cd19f44d34bcb93ee647626f97428 --
-- Boot 5e5fe89ab6dd4bcd99848a7d022661e7 --
-- Boot 286afe30be4547f991116a465baa0d16 --
Oct 01 09:46:49 server btrfs-scrub[3470551]: Error summary:    csum=3
Nov 01 10:18:03 server btrfs-scrub[2197016]: Error summary:    csum=1
-- Boot 7a57608c89b54b75ad75e19b9d9532d3 --
-- Boot e6c81574ae6048148fdb11fc9fca134f --
Jan 01 10:03:59 server btrfs-scrub[1388868]: Error summary:    csum=3
-- Boot b81b09bb7fa4438881c613f184fdc3da --
Feb 01 10:48:54 server btrfs-scrub[2128061]: Error summary:    csum=3
-- Boot 3befaeadcd1b4c71bae23e39c50f2eae --
Mar 01 10:36:43 server btrfs-scrub[4131960]: Error summary:    csum=1

Yet, none of those entered the device stats, even though the btrfs-device man page states:

The current values are printed at mount time and updated during filesystem lifetime or from a scrub run.

Currently I'm using the following versions:

# uname -a
Linux shigi-server 6.8.0-52-generic #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15 19:18:46 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
# btrfs --version
btrfs-progs v5.16.2

I also checked that the culprit is indeed always the same drive: sdc. After (at least) those errors, the device stats are as follows:

# btrfs device stats /mnt/btrfsraid
[/dev/sdb1].write_io_errs    0
[/dev/sdb1].read_io_errs     0
[/dev/sdb1].flush_io_errs    0
[/dev/sdb1].corruption_errs  0
[/dev/sdb1].generation_errs  0
[/dev/sde3].write_io_errs    0
[/dev/sde3].read_io_errs     0
[/dev/sde3].flush_io_errs    0
[/dev/sde3].corruption_errs  0
[/dev/sde3].generation_errs  0
[/dev/sdc1].write_io_errs    0
[/dev/sdc1].read_io_errs     0
[/dev/sdc1].flush_io_errs    0
[/dev/sdc1].corruption_errs  4
[/dev/sdc1].generation_errs  0

where the 4 errors have been accumulated over the last three days during regular use. (I.e. they have their individual entries in the systemd journal.) Needless to say this drive will be replaced ASAP. (SMART is completely oblivious to this... and I do have regular extended selftests scheduled as well.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions