Skip to content

fix(managedstream): stop terminal ingest retry storms#294

Closed
michiosw wants to merge 1 commit into
mainfrom
feat/stop-managedstream-terminal-400-storm
Closed

fix(managedstream): stop terminal ingest retry storms#294
michiosw wants to merge 1 commit into
mainfrom
feat/stop-managedstream-terminal-400-storm

Conversation

@michiosw

@michiosw michiosw commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Where We Are

Managed observe retried hosted ingest 400s as if smaller batches might fix them. An organization mismatch or schema/config validation error is terminal until inputs change, so one broken install could replay rejected ledger data and create an ingest storm.

Where We Want To Go

Terminal hosted 400/422 validation failures should stop hot-looping without dropping unsent ledger rows. Payload-size responses should still get smaller-batch retry. A credential/config change should clear the terminal state and let streaming resume.

How do we get there

Classify hosted ingest failures before retrying. Keep smaller-batch retry for 413 and explicit size-like 400/422 responses, pause terminal hosted validation failures in persisted cooldown state without advancing the cursor, and clear that state when the effective upload config changes. Added regressions for org mismatch, generic 400, 422 validation, auth failures, malformed cooldown state, credential refresh, scalar size-like validation, and persisted state shape.

Verified with go test ./..., go vet ./..., go test -race ./internal/managedstream ./internal/managedobserve, codex-review, and autoreview.

Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@michiosw michiosw force-pushed the feat/stop-managedstream-terminal-400-storm branch from 87c52bc to 4cf2d22 Compare June 15, 2026 15:15
@michiosw michiosw marked this pull request as ready for review June 15, 2026 15:15
@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown

Greptile Summary

This PR changes managed stream ingest retries to persist terminal hosted failures. The main changes are:

  • Adds cooldown failure fields to managed stream state.
  • Classifies hosted 400/422 responses before smaller-batch retry.
  • Preserves cursor position for terminal validation failures.
  • Clears terminal failure state when the stream config fingerprint changes.
  • Adds tests for cooldown, auth, redaction, malformed state, and state shape.

Confidence Score: 2/5

This should be fixed before merging.

  • Terminal hosted failures can still be replayed on every cooldown expiry with unchanged inputs.
  • Some effective upload config changes do not clear the persisted pause state.
  • The daemon path does not reload refreshed credentials/config while running.

internal/managedstream/stream.go and internal/managedobserve/daemon.go

Important Files Changed

Filename Overview
internal/managedstream/stream.go Adds terminal failure cooldown state and retry classification, but the cooldown and fingerprint logic can keep streaming paused or retrying incorrectly.
internal/managedobserve/daemon.go Starts managed streaming with a fixed snapshot of config and credentials, which prevents live refresh from clearing terminal stream state.

Reviews (1): Last reviewed commit: "fix(managedstream): stop terminal ingest..." | Re-trigger Greptile

if err != nil {
return false, err
}
return now.Before(until), nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Cooldown still retries After CooldownUntil passes, cooldownActive returns false for the same FailureConfig, so Flush posts the same terminally rejected rows again and then records a new 15 minute cooldown. For an organization mismatch or schema/config 400, this keeps replaying the same unsent ledger data every cooldown window even though nothing changed, which only slows the retry storm instead of stopping terminal failures until inputs change.

Comment on lines +430 to +435
values := strings.Join([]string{
strings.TrimSpace(opts.CloudURL),
strings.TrimSpace(opts.OrganizationID),
strings.TrimSpace(opts.InstallationID),
strings.TrimSpace(opts.InstallToken),
}, "\x00")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fingerprint misses payload config The persisted failure only clears when this fingerprint changes, but the request payload also includes DeviceLabel and the per-flush DeploymentVersion(). If hosted rejects device.label or device.deployment_version with a terminal 400/422, fixing that effective upload config does not change FailureConfig, and the cooldown check runs before the payload is rebuilt. The stream can remain paused even after the bad device/deployment value was corrected.

@michiosw

Copy link
Copy Markdown
Contributor Author

Closing this version. The persisted cooldown approach is too broad and Greptile found that unchanged terminal failures can still replay after cooldown expiry, some config changes do not clear the pause, and the daemon keeps a fixed config/credential snapshot. I am replacing it with a leaner implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant