Skip to content

fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342

Merged
jakubno merged 1 commit intomainfrom
fix/nbd-timeouts
Apr 10, 2026
Merged

fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342
jakubno merged 1 commit intomainfrom
fix/nbd-timeouts

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 9, 2026

Fixes NBD reliability issues that caused sandboxes to die with I/O errors:

The kernel NBD driver had ioTimeout = deadconnTimeout = 30s. When a GCS chunk fetch took longer than 30s (cold cache, GCS latency spike), the kernel gave up waiting for a read response, declared the connection dead, and returned EIO to the Firecracker guest. The guest's block device then errored out, causing the VM to crash.

Raises ioTimeout to 90s — above the 60s GCS fetch timeout — so the kernel waits long enough for the response before declaring the connection dead.

Observable symptoms before this fix:

  • block nbd: Send data failed (result -11) → Dead connection in host kernel logs
  • nbd error cmd write + closing handler for NBD commands in orchestrator
  • Failed to execute In virtio block request: FileEngine(...IOError...) in Firecracker logs
  • error flushing cow device: failed to fsync path: input/output error during cleanup
image

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 9, 2026

PR Summary

Medium Risk
Touches low-level NBD mount behavior by changing kernel timeout defaults; misconfiguration could cause either premature EIOs or slower failure detection in production.

Overview
Raises the kernel NBD per-request I/O timeout to exceed backend fetch latency (with separate dead-connection timeout), makes both timeouts configurable via new MountOptions, and adds a slow-backend regression test that reproduces the EIO failure with short timeouts and verifies reads succeed when the timeout is sufficiently large.

Reviewed by Cursor Bugbot for commit 8f26c76. Bugbot is set up for automated code reviews on this repo. Configure here.

@jakubno jakubno marked this pull request as ready for review April 10, 2026 11:34
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward fix raising NBD I/O timeout above the GCS fetch timeout, with regression tests validating both the failure and the fix.

Extended reasoning...

Overview

Changes 3 files in the NBD subsystem: splits the old single connectTimeout=30s constant into separate ioTimeout=90s and deadconnTimeout=30s constants, adds a functional options pattern (MountOption) to allow test overrides, updates the GetNBDDevice test helper to pass through options, and adds a slow-backend regression test.

Security risks

None. These are kernel timeout parameters for an internal block device driver, with no auth, permissions, or data-exposure surface.

Level of scrutiny

Low. The root cause is clearly documented (kernel I/O timeout < GCS fetch timeout → EIO), the fix is minimal and targeted (one constant change, 90s > 60s), and the new regression tests directly validate both the broken behavior (short timeout → EIO) and the fixed behavior (sufficient timeout → success). The functional options pattern is backwards-compatible — existing callers of NewDirectPathMount need no changes.

Other factors

No existing callers are broken (variadic opts). The deadconnTimeout stays at 30s, which is the separate per-connection death declaration timer and does not need to change. No bugs were flagged by the automated bug hunting system.

@jakubno jakubno merged commit d0ac010 into main Apr 10, 2026
48 checks passed
@jakubno jakubno deleted the fix/nbd-timeouts branch April 10, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants