Skip to content

storage: make cmdTimeout configurable via plugin config#236

Merged
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
WaylandYang:feat/configurable-cmd-timeout
May 15, 2026
Merged

storage: make cmdTimeout configurable via plugin config#236
kinwin-ustc merged 1 commit into
TencentCloud:masterfrom
WaylandYang:feat/configurable-cmd-timeout

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Implements proposal (1) from #235.

Adds a new optional cmd_timeout field to the storage plugin's TOML config. When unset (or zero) the existing 3 s default is preserved exactly — no behavior change for current deployments.

Motivation

shell.go's cp / truncate / e2fsck / resize2fs / mkfs.ext4 chain shares a single 3 s timeout per command. On the pre-formatted pool fast path the small images comfortably finish inside 3 s. On the live-create slow path (writable_layer_size that doesn't match pool_default_format_size_list), 3 s on multi-GiB images is tight under concurrent load — that's the contention window where the reflink-copy race in #235 surfaces.

Letting operators raise the per-command timeout from config is the smallest knob that lets the live-create path survive without recompiling cubelet.

Config example

[plugins."io.cubelet.internal.v1.storage"]
  cmd_timeout = "30s"

What's deliberately not changed

  • Default value stays 3 s. Bumping the default is a behavior change you may want to do separately after measuring P95 on a host that actually hits the slow path.
  • Single timeout shared by all the commands. Splitting per-command (e.g. faster cp, slower e2fsck) is doable but expands scope; happy to do as a follow-up PR if you'd like.
  • shell.go's cmdTimeout is now var instead of const, set during (*local).init. Other styles (passing timeout into each function, plumbing through a context) are also options; this matches the existing package-level state in this package (localStorage, defaultPoolSize, ...).

Out-of-scope observations

Tested

  • go vet ./storage/... — clean
  • go build ./storage/... — clean
  • gofmt -l — clean

@fslongjin
Copy link
Copy Markdown
Member

/cubebot review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Claude finished @fslongjin's task in 3m 11s —— View job


PR Review: storage: make cmdTimeout configurable via plugin config

  • Gather context: read changed files and diff
  • Analyze implementation for bugs, security, and correctness
  • Post review feedback

Overall: The approach is clean and minimal — a tomlext.Duration config field, zero-means-default, package-level var set once during init. This matches existing patterns (ReconcileInterval, defaultPoolSize, etc.). A few things to consider:


Bug: cmdTimeout is not applied during checkPoolType

checkPoolType calls newExt4BaseRaw and newExt4RawByReflinkCopy (both use cmdTimeout), but it runs before localStorage.init(ic), which is where cmdTimeout gets set from config:

plugin.go:84-97 (init order):

checkPoolType(localStorage.config)   // ← uses cmdTimeout, but it's still 3s here
...
if err := localStorage.init(ic); err != error {  // ← cmdTimeout is set here

local.go:182-185 (where it's actually set):

if l.config.CmdTimeout == 0 {
    l.config.CmdTimeout = tomlext.FromStdTime(defaultCmdTimeout)
}
cmdTimeout = tomlext.ToStdTime(l.config.CmdTimeout)

So during checkPoolType, the user's cmd_timeout config is silently ignored and the 3s default is used instead. In practice the probe image is tiny (512000 bytes), so 3s is likely fine — but it's still a logical gap that could surprise operators who set a high timeout expecting it to apply everywhere.

Suggested fix: Move the cmdTimeout assignment earlier, before checkPoolType:

Fix this →


Minor: No validation for negative durations

tomlext.Duration.UnmarshalText will happily parse "-1s", producing a negative time.Duration. A negative timeout passed to context.WithTimeout immediately expires the context, making every shell command fail silently at startup. Consider adding a guard:

if l.config.CmdTimeout < 0 {
    return fmt.Errorf("cmd_timeout must be non-negative, got %v", l.config.CmdTimeout)
}

This is low severity — operators are unlikely to misconfigure this — but a single comparison prevents a confusing failure mode.


Nit: Comment in shell.go references local.init

The comment on cmdTimeout says "set by (l *local).init", which is accurate today but creates a subtle coupling — if someone moves the assignment (e.g. to fix the checkPoolType ordering above), the comment becomes stale. Consider a more generic phrasing like "set during plugin initialization" rather than naming a specific function.


What looks good

  • Backward compatibility: Zero/unset falls through to the existing 3s default. No behavior change for current deployments.
  • Consistent pattern: Matches how ReconcileInterval and other config fields handle defaults.
  • var instead of const: Pragmatic choice, consistent with localStorage, defaultPoolSize, etc.
  • tomlext.Duration: Reuses the existing TOML duration type with ParseDuration semantics — "30s", "5m", etc. work naturally.
  • PR scope: Focused on the single knob. Per-command timeouts and better error diagnostics correctly deferred.

Comment thread Cubelet/storage/local.go Outdated
if l.config.ReconcileInterval == 0 {
l.config.ReconcileInterval = tomlext.FromStdTime(5 * time.Minute)
}
if l.config.CmdTimeout == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more appropriate to move this initialization code to the InitFn in plugin.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hlz260 — good catch, same ordering issue cubebot flagged. Pushed an update that moves cmdTimeout init into InitFn before checkPoolType, plus a non-negative guard and the comment cleanup cubebot also suggested. PTAL.

Implements proposal (1) from TencentCloud#235.

Adds an optional `cmd_timeout` field to the storage plugin's TOML
config. When unset, the existing 3 s default is preserved — no
behavior change. Multi-GiB ext4 ops on the live-create slow path
can need longer than 3 s under concurrent load; this knob lets
operators raise it without recompiling.

Example:

    [plugins."io.cubelet.internal.v1.storage"]
      cmd_timeout = "30s"

Signed-off-by: Wayland Yang <wayland0916@gmail.com>
@WaylandYang WaylandYang force-pushed the feat/configurable-cmd-timeout branch from 98cab76 to b40deec Compare May 14, 2026 08:04
Copy link
Copy Markdown
Contributor

@hlz260 hlz260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kinwin-ustc kinwin-ustc merged commit 2d327d2 into TencentCloud:master May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants