fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342
fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 8f26c76. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
LGTM — straightforward fix raising NBD I/O timeout above the GCS fetch timeout, with regression tests validating both the failure and the fix.
Extended reasoning...
Overview
Changes 3 files in the NBD subsystem: splits the old single connectTimeout=30s constant into separate ioTimeout=90s and deadconnTimeout=30s constants, adds a functional options pattern (MountOption) to allow test overrides, updates the GetNBDDevice test helper to pass through options, and adds a slow-backend regression test.
Security risks
None. These are kernel timeout parameters for an internal block device driver, with no auth, permissions, or data-exposure surface.
Level of scrutiny
Low. The root cause is clearly documented (kernel I/O timeout < GCS fetch timeout → EIO), the fix is minimal and targeted (one constant change, 90s > 60s), and the new regression tests directly validate both the broken behavior (short timeout → EIO) and the fixed behavior (sufficient timeout → success). The functional options pattern is backwards-compatible — existing callers of NewDirectPathMount need no changes.
Other factors
No existing callers are broken (variadic opts). The deadconnTimeout stays at 30s, which is the separate per-connection death declaration timer and does not need to change. No bugs were flagged by the automated bug hunting system.
Fixes NBD reliability issues that caused sandboxes to die with I/O errors:
The kernel NBD driver had
ioTimeout=deadconnTimeout=30s. When a GCS chunk fetch took longer than 30s (cold cache, GCS latency spike), the kernel gave up waiting for a read response, declared the connection dead, and returned EIO to the Firecracker guest. The guest's block device then errored out, causing the VM to crash.Raises
ioTimeoutto90s— above the60sGCS fetch timeout — so the kernel waits long enough for the response before declaring the connection dead.Observable symptoms before this fix: