fix(managedstream): stop terminal ingest retry storms#294
Conversation
87c52bc to
4cf2d22
Compare
Greptile SummaryThis PR changes managed stream ingest retries to persist terminal hosted failures. The main changes are:
Confidence Score: 2/5This should be fixed before merging.
internal/managedstream/stream.go and internal/managedobserve/daemon.go Important Files Changed
Reviews (1): Last reviewed commit: "fix(managedstream): stop terminal ingest..." | Re-trigger Greptile |
| if err != nil { | ||
| return false, err | ||
| } | ||
| return now.Before(until), nil |
There was a problem hiding this comment.
Cooldown still retries After
CooldownUntil passes, cooldownActive returns false for the same FailureConfig, so Flush posts the same terminally rejected rows again and then records a new 15 minute cooldown. For an organization mismatch or schema/config 400, this keeps replaying the same unsent ledger data every cooldown window even though nothing changed, which only slows the retry storm instead of stopping terminal failures until inputs change.
| values := strings.Join([]string{ | ||
| strings.TrimSpace(opts.CloudURL), | ||
| strings.TrimSpace(opts.OrganizationID), | ||
| strings.TrimSpace(opts.InstallationID), | ||
| strings.TrimSpace(opts.InstallToken), | ||
| }, "\x00") |
There was a problem hiding this comment.
Fingerprint misses payload config The persisted failure only clears when this fingerprint changes, but the request payload also includes
DeviceLabel and the per-flush DeploymentVersion(). If hosted rejects device.label or device.deployment_version with a terminal 400/422, fixing that effective upload config does not change FailureConfig, and the cooldown check runs before the payload is rebuilt. The stream can remain paused even after the bad device/deployment value was corrected.
|
Closing this version. The persisted cooldown approach is too broad and Greptile found that unchanged terminal failures can still replay after cooldown expiry, some config changes do not clear the pause, and the daemon keeps a fixed config/credential snapshot. I am replacing it with a leaner implementation. |

Where We Are
Managed observe retried hosted ingest 400s as if smaller batches might fix them. An organization mismatch or schema/config validation error is terminal until inputs change, so one broken install could replay rejected ledger data and create an ingest storm.
Where We Want To Go
Terminal hosted 400/422 validation failures should stop hot-looping without dropping unsent ledger rows. Payload-size responses should still get smaller-batch retry. A credential/config change should clear the terminal state and let streaming resume.
How do we get there
Classify hosted ingest failures before retrying. Keep smaller-batch retry for 413 and explicit size-like 400/422 responses, pause terminal hosted validation failures in persisted cooldown state without advancing the cursor, and clear that state when the effective upload config changes. Added regressions for org mismatch, generic 400, 422 validation, auth failures, malformed cooldown state, credential refresh, scalar size-like validation, and persisted state shape.
Verified with go test ./..., go vet ./..., go test -race ./internal/managedstream ./internal/managedobserve, codex-review, and autoreview.