fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO by jakubno · Pull Request #2342 · e2b-dev/infra

jakubno · 2026-04-09T15:39:21Z

Fixes NBD reliability issues that caused sandboxes to die with I/O errors:

The kernel NBD driver had ioTimeout = deadconnTimeout = 30s. When a GCS chunk fetch took longer than 30s (cold cache, GCS latency spike), the kernel gave up waiting for a read response, declared the connection dead, and returned EIO to the Firecracker guest. The guest's block device then errored out, causing the VM to crash.

Raises ioTimeout to 90s — above the 60s GCS fetch timeout — so the kernel waits long enough for the response before declaring the connection dead.

Observable symptoms before this fix:

block nbd: Send data failed (result -11) → Dead connection in host kernel logs
nbd error cmd write + closing handler for NBD commands in orchestrator
Failed to execute In virtio block request: FileEngine(...IOError...) in Firecracker logs
error flushing cow device: failed to fsync path: input/output error during cleanup

…event EIO

cursor · 2026-04-09T15:39:27Z

PR Summary

Medium Risk
Touches low-level NBD mount behavior by changing kernel timeout defaults; misconfiguration could cause either premature EIOs or slower failure detection in production.

Overview
Raises the kernel NBD per-request I/O timeout to exceed backend fetch latency (with separate dead-connection timeout), makes both timeouts configurable via new MountOptions, and adds a slow-backend regression test that reproduces the EIO failure with short timeouts and verifies reads succeed when the timeout is sufficiently large.

^{Reviewed by Cursor Bugbot for commit 8f26c76. Bugbot is set up for automated code reviews on this repo. Configure here.}

claude

LGTM — straightforward fix raising NBD I/O timeout above the GCS fetch timeout, with regression tests validating both the failure and the fix.

Extended reasoning...

Overview

Changes 3 files in the NBD subsystem: splits the old single connectTimeout=30s constant into separate ioTimeout=90s and deadconnTimeout=30s constants, adds a functional options pattern (MountOption) to allow test overrides, updates the GetNBDDevice test helper to pass through options, and adds a slow-backend regression test.

Security risks

None. These are kernel timeout parameters for an internal block device driver, with no auth, permissions, or data-exposure surface.

Level of scrutiny

Low. The root cause is clearly documented (kernel I/O timeout < GCS fetch timeout → EIO), the fix is minimal and targeted (one constant change, 90s > 60s), and the new regression tests directly validate both the broken behavior (short timeout → EIO) and the fixed behavior (sufficient timeout → success). The functional options pattern is backwards-compatible — existing callers of NewDirectPathMount need no changes.

Other factors

No existing callers are broken (variadic opts). The deadconnTimeout stays at 30s, which is the separate per-connection death declaration timer and does not need to change. No bugs were flagged by the automated bug hunting system.

fix(orch): raise kernel I/O timeout above backend fetch timeout to pr…

8f26c76

…event EIO

e2b-request-same-site-reviewers bot assigned arkamar Apr 9, 2026

jakubno marked this pull request as ready for review April 10, 2026 11:34

jakubno requested review from ValentaTomas and dobrac as code owners April 10, 2026 11:34

claude bot reviewed Apr 10, 2026

View reviewed changes

arkamar approved these changes Apr 10, 2026

View reviewed changes

jakubno merged commit d0ac010 into main Apr 10, 2026
48 checks passed

jakubno deleted the fix/nbd-timeouts branch April 10, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342

fix(orch): raise kernel I/O timeout above backend fetch timeout to prevent EIO#2342
jakubno merged 1 commit intomainfrom
fix/nbd-timeouts

jakubno commented Apr 9, 2026 •

edited

Loading

Uh oh!

cursor bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakubno commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jakubno commented Apr 9, 2026 •

edited

Loading

cursor bot commented Apr 9, 2026 •

edited

Loading