fix(managedstream): stop terminal ingest retry storms by michiosw · Pull Request #294 · kontext-security/kontext-cli

michiosw · 2026-06-15T14:34:34Z

Where We Are

Managed observe retried hosted ingest 400s as if smaller batches might fix them. An organization mismatch or schema/config validation error is terminal until inputs change, so one broken install could replay rejected ledger data and create an ingest storm.

Where We Want To Go

Terminal hosted 400/422 validation failures should stop hot-looping without dropping unsent ledger rows. Payload-size responses should still get smaller-batch retry. A credential/config change should clear the terminal state and let streaming resume.

How do we get there

Classify hosted ingest failures before retrying. Keep smaller-batch retry for 413 and explicit size-like 400/422 responses, pause terminal hosted validation failures in persisted cooldown state without advancing the cursor, and clear that state when the effective upload config changes. Added regressions for org mismatch, generic 400, 422 validation, auth failures, malformed cooldown state, credential refresh, scalar size-like validation, and persisted state shape.

Verified with go test ./..., go vet ./..., go test -race ./internal/managedstream ./internal/managedobserve, codex-review, and autoreview.

michiosw · 2026-06-15T14:34:48Z

fix(managedstream): stop terminal ingest retry storms #294 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

greptile-apps · 2026-06-15T15:17:53Z

Greptile Summary

This PR changes managed stream ingest retries to persist terminal hosted failures. The main changes are:

Adds cooldown failure fields to managed stream state.
Classifies hosted 400/422 responses before smaller-batch retry.
Preserves cursor position for terminal validation failures.
Clears terminal failure state when the stream config fingerprint changes.
Adds tests for cooldown, auth, redaction, malformed state, and state shape.

Confidence Score: 2/5

This should be fixed before merging.

Terminal hosted failures can still be replayed on every cooldown expiry with unchanged inputs.
Some effective upload config changes do not clear the persisted pause state.
The daemon path does not reload refreshed credentials/config while running.

internal/managedstream/stream.go and internal/managedobserve/daemon.go

Important Files Changed

Filename	Overview
internal/managedstream/stream.go	Adds terminal failure cooldown state and retry classification, but the cooldown and fingerprint logic can keep streaming paused or retrying incorrectly.
internal/managedobserve/daemon.go	Starts managed streaming with a fixed snapshot of config and credentials, which prevents live refresh from clearing terminal stream state.

_{Reviews (1): Last reviewed commit: "fix(managedstream): stop terminal ingest..." | Re-trigger Greptile}

greptile-apps · 2026-06-15T15:17:57Z

+	if err != nil {
+		return false, err
+	}
+	return now.Before(until), nil


Cooldown still retries After CooldownUntil passes, cooldownActive returns false for the same FailureConfig, so Flush posts the same terminally rejected rows again and then records a new 15 minute cooldown. For an organization mismatch or schema/config 400, this keeps replaying the same unsent ledger data every cooldown window even though nothing changed, which only slows the retry storm instead of stopping terminal failures until inputs change.

greptile-apps · 2026-06-15T15:17:57Z

+	values := strings.Join([]string{
+		strings.TrimSpace(opts.CloudURL),
+		strings.TrimSpace(opts.OrganizationID),
+		strings.TrimSpace(opts.InstallationID),
+		strings.TrimSpace(opts.InstallToken),
+	}, "\x00")


Fingerprint misses payload config The persisted failure only clears when this fingerprint changes, but the request payload also includes DeviceLabel and the per-flush DeploymentVersion(). If hosted rejects device.label or device.deployment_version with a terminal 400/422, fixing that effective upload config does not change FailureConfig, and the cooldown check runs before the payload is rebuilt. The stream can remain paused even after the bad device/deployment value was corrected.

michiosw · 2026-06-15T15:20:31Z

Closing this version. The persisted cooldown approach is too broad and Greptile found that unchanged terminal failures can still replay after cooldown expiry, some config changes do not clear the pause, and the daemon keeps a fixed config/credential snapshot. I am replacing it with a leaner implementation.

fix(managedstream): stop terminal ingest retry storms

4cf2d22

michiosw force-pushed the feat/stop-managedstream-terminal-400-storm branch from 87c52bc to 4cf2d22 Compare June 15, 2026 15:15

michiosw marked this pull request as ready for review June 15, 2026 15:15

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

michiosw closed this Jun 15, 2026

michiosw mentioned this pull request Jun 15, 2026

fix(managedstream): avoid terminal ingest retry storms #295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(managedstream): stop terminal ingest retry storms#294

fix(managedstream): stop terminal ingest retry storms#294
michiosw wants to merge 1 commit into
mainfrom
feat/stop-managedstream-terminal-400-storm

michiosw commented Jun 15, 2026 •

edited

Loading

Uh oh!

michiosw commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Uh oh!

greptile-apps Bot Jun 15, 2026

Uh oh!

greptile-apps Bot Jun 15, 2026

Uh oh!

michiosw commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michiosw commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Where We Are

Where We Want To Go

How do we get there

Uh oh!

michiosw commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

michiosw commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michiosw commented Jun 15, 2026 •

edited

Loading