-
Notifications
You must be signed in to change notification settings - Fork 59
Description
I am not clear on almost any of the details here, but I gather from today's dogfood update that under some conditions, Linux guests on the Oxide system can mark the Crucible-backed storage device read-only, and this is a permanent condition. This can be triggered by self-service update, but I'd be surprised if it were specific to self-service update. The update process can cause multiple Crucible downstairs instances for the same disk to fail transiently for some time. We assumed (apparently incorrectly) that upstack software would always recover from transient failures, but it appears that this read-only behavior is permanent. But in that case, this problem can probably also occur without self-service update on the scene if we just have a set of sled failures.
@askfongjojo has more details about a specific instance affected by this.
@iliana mentioned in chat that "this might be why AWS tells customers to set the nvme timeout to INT_MAX".
See also oxidecomputer/crucible#1555.