fix(watcher): self-heal orphaned watchers when agent profile is soft-deleted#1094
fix(watcher): self-heal orphaned watchers when agent profile is soft-deleted#1094nlenepveu wants to merge 3 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughDetects soft-deleted agent profiles, stamps and disables referencing watches across GitHub/Jira/Linear during dispatch pre-flight or forced profile deletion, persists error metadata in watch rows, enumerates watcher dependencies for the deletion UI, and wires these adapters at orchestrator startup. ChangesWatcher Self-Heal on Soft-Deleted Profiles
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| apps/backend/internal/orchestrator/watcher_dispatch.go | Adds ProfileLookup interface, preflightDeletedProfile, SelfHeal to WatcherSource, and the formatDeletedProfileCause helper — the core self-heal pipeline. Minor: formatDeletedProfileCause uses string concatenation instead of fmt.Sprintf %q, producing malformed output for profile names with embedded double quotes. |
| apps/backend/internal/orchestrator/watcher_dispatch_wiring.go | Renames initWatcherCoordinator to initWatcherCoordinatorLocked and holds s.mu across SetIssueTaskCreator, fixing the previously flagged unsynchronised access. Adds SetProfileLookup, getProfileLookup, and GitHub-specific preflight helpers with correct lock ordering (s.mu → c.mu, never nested in reverse). |
| apps/web/components/settings/agent-profile-delete-dialog.tsx | Expands AgentProfileDeleteConflictDialog to accept the new AgentProfileDeleteConflict shape and renders a watchers section. The static footer text is always shown even in watcher-only or session-only conflict scenarios. |
| apps/backend/internal/agent/runtime/lifecycle/profile_resolver.go | Adds DeletedProfileError and checkSoftDeleted helper; ResolveProfile now disambiguates soft-deleted from never-existed via a secondary GetAgentProfileIncludingDeleted lookup. Logic is sound and correctly fail-opens on non-ErrNoRows errors. |
| apps/backend/internal/agent/settings/controller/profile_crud.go | Extends prepareProfileDeletion to enumerate referencing watchers and returns ErrProfileInUseDetail with both ActiveSessions and Watchers; adds disableReferencingWatchers for the force-delete eager path. |
| apps/backend/cmd/kandev/orchestrator.go | Adds profileLookupAdapter, watcherDepsAdapter (lists/disables watchers across all four integrations), and wiring for WatcherDependencyChecker. All integration fields are nil-safe; DisableWatchersByAgentProfile is best-effort across integrations and logs per-failure. |
| apps/backend/internal/github/store.go | Adds addWatchSelfHealColumns (column-precheck ALTERs for both watch tables), tableColumns helper, and DisableIssueWatchWithError / DisableReviewWatchWithError. Uses fail-loud errors for self-heal columns as documented. |
| apps/backend/internal/linear/store_issue_watch.go | Splits issueWatchColumns into insert/select variants (select uses COALESCE for last_error), adds LastError/LastErrorAt to issueWatchRow and IssueWatch, and implements DisableIssueWatchWithError. |
| apps/backend/internal/jira/store.go | Mirrors linear store changes: column-precheck migrations, split insert/select columns with COALESCE, and DisableIssueWatchWithError. Consistent with the established Jira store migration idiom. |
| apps/backend/internal/agent/settings/store/sqlite.go | Extracts agentProfileSelectColumns constant shared across all read paths, adds GetAgentProfileIncludingDeleted alongside GetAgentProfile. |
| apps/web/app/actions/agents.ts | Expands DeleteProfileResult conflict shape with watchers array; 409 detection now triggers on either active_sessions or watchers being present in the response body. |
Sequence Diagram
sequenceDiagram
participant Bus
participant Coord as WatcherDispatchCoordinator
participant Lookup as ProfileLookup
participant Src as WatcherSource (Linear/Jira)
participant Store as Watcher Store
participant GH as Service (GitHub legacy)
Bus->>Coord: Dispatch(evt)
Coord->>Lookup: LookupProfile(agentProfileID)
alt profile soft-deleted
Lookup-->>Coord: (true, name, nil)
Coord->>Src: SelfHeal(evt, cause)
Src->>Store: DisableIssueWatchWithError(id, cause)
Store-->>Src: ok
Coord-->>Bus: return (no task created)
else profile live or lookup error
Lookup-->>Coord: (false, _, nil) or (_, _, err)
Coord->>Src: Reserve → BuildTaskRequest → CreateTask → AttachTaskID
end
Bus->>GH: createIssueTask(evt)
GH->>Lookup: LookupProfile(evt.AgentProfileID)
alt profile soft-deleted
Lookup-->>GH: (true, name, nil)
GH->>Store: DisableIssueWatchWithError(watchID, cause)
GH-->>Bus: return
else live
GH->>Src: reserveIssueWatch → createTask → …
end
Reviews (10): Last reviewed commit: "fix(e2e): unwrap {watches} envelope in l..." | Re-trigger Greptile
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/backend/internal/agent/settings/controller/reconciler_test.go`:
- Around line 120-121: The test fake fakeStore currently returns (nil, nil) from
GetAgentProfileIncludingDeleted which hides broken call paths; update
fakeStore.GetAgentProfileIncludingDeleted to return realistic semantics by
returning a valid *models.AgentProfile (with minimal required fields) and nil
error for known/test IDs and returning a not-found error (or
repository-equivalent error like sql.ErrNoRows or a declared ErrNotFound) for
unknown IDs so callers can exercise both success and failure paths.
In `@apps/backend/internal/linear/store_issue_watch_disable_test.go`:
- Around line 30-34: The assertion that watch.LastErrorAt lies between before :=
time.Now().UTC() and after := time.Now().UTC() is flaky due to DB timestamp
precision; modify the test around DisableIssueWatchWithError (and the similar
check at lines 52-53) to normalize precision or allow a small tolerance: either
truncate before/after and watch.LastErrorAt to time.Second (or the DB precision)
before comparing, or assert LastErrorAt is within ±1 second of the expected
window so the check is stable across different DB timestamp resolutions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1175cea5-70b4-4488-af3c-7a8daaec2506
📒 Files selected for processing (39)
apps/backend/cmd/kandev/main.goapps/backend/cmd/kandev/orchestrator.goapps/backend/internal/agent/runtime/lifecycle/profile_resolver.goapps/backend/internal/agent/runtime/lifecycle/profile_resolver_test.goapps/backend/internal/agent/settings/controller/controller.goapps/backend/internal/agent/settings/controller/profile_crud.goapps/backend/internal/agent/settings/controller/profile_crud_watcher_deps_test.goapps/backend/internal/agent/settings/controller/reconciler_test.goapps/backend/internal/agent/settings/handlers/handlers.goapps/backend/internal/agent/settings/store/errors.goapps/backend/internal/agent/settings/store/sqlite.goapps/backend/internal/agent/settings/store/sqlite_soft_delete_test.goapps/backend/internal/agent/settings/store/store.goapps/backend/internal/github/models.goapps/backend/internal/github/service.goapps/backend/internal/github/store.goapps/backend/internal/github/store_watch_disable_test.goapps/backend/internal/jira/models.goapps/backend/internal/jira/store.goapps/backend/internal/jira/store_issue_watch_disable_test.goapps/backend/internal/linear/models.goapps/backend/internal/linear/store.goapps/backend/internal/linear/store_issue_watch.goapps/backend/internal/linear/store_issue_watch_disable_test.goapps/backend/internal/orchestrator/event_handlers_github.goapps/backend/internal/orchestrator/event_handlers_github_selfheal_test.goapps/backend/internal/orchestrator/event_handlers_github_test.goapps/backend/internal/orchestrator/event_handlers_jira.goapps/backend/internal/orchestrator/event_handlers_jira_test.goapps/backend/internal/orchestrator/event_handlers_linear.goapps/backend/internal/orchestrator/service.goapps/backend/internal/orchestrator/source_jira.goapps/backend/internal/orchestrator/source_jira_test.goapps/backend/internal/orchestrator/source_linear.goapps/backend/internal/orchestrator/source_linear_test.goapps/backend/internal/orchestrator/watcher_dispatch.goapps/backend/internal/orchestrator/watcher_dispatch_selfheal_test.goapps/backend/internal/orchestrator/watcher_dispatch_test.goapps/backend/internal/orchestrator/watcher_dispatch_wiring.go
There was a problem hiding this comment.
5 issues found across 39 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…y check watcherDepsAdapter previously called ListAll*IssueWatches and counted every row including already-disabled ones. After the self-heal pre-flight starts running, watchers will routinely sit in enabled=0 state — those should not trigger a 409 confirmation prompt when the user deletes the orphaned profile that originally bound them. Switch to ListEnabled* in all three adapters (linear, jira, github). Adds ListEnabledIssueWatches / ListEnabledReviewWatches shims on github.Service to mirror the store-level methods (linear and jira already exposed them). Spotted by Greptile on PR kdlbs#1094 cycle 4.
carlosflorencio
left a comment
There was a problem hiding this comment.
Reviewed end to end. The self-heal logic is solid: preflight runs before Reserve (so a deleted-profile event never leaves a stale dedup row), the typed *DeletedProfileError is consumed via errors.Is/errors.As everywhere it matters, all three disable-watcher store tests already use a ±1-second window so the timestamp-precision flake CodeRabbit raised earlier is mitigated, and the 409 watchers field is additive on the wire. Test coverage matches the body checklist case-for-case.
Four small things worth doing before merge, none of them blockers, all left as inline comments:
github/store.goswallows real migration errors with_, _ =. The newlast_error/last_error_atcolumns are the only ones whose readers panic on scan if they end up missing, so the fail-loud guard the Jira/Linear stores use is worth borrowing.GetAgentProfileIncludingDeletedduplicates the 35-column SELECT projection fromGetAgentProfile. Worth extracting as aconstso the next column add can't drift between them.s.watcherCoordinatorands.issueTaskCreatorare written without the mutex thatSetProfileLookupcarefully added in 0f4824d. Production is safe because both setters run at boot, but the asymmetry will catch out a future test under-race.- The new GitHub preflight logger calls concatenate
kindinto the message string instead of passing it as a zap field, which fragments log aggregation.
(Skipped CodeRabbit's reconciler_test.go fakeStore finding because it's wrong: the fake delegates to GetAgentProfile and returns sql.ErrNoRows on miss, not (nil, nil).)
Nice work overall, looking forward to seeing the follow-up that covers the adjacent surfaces (automations, office_routines, etc.) you called out at the bottom.
Generated by Claude Code
| _, _ = s.db.Exec(`ALTER TABLE github_review_watches ADD COLUMN last_error TEXT NOT NULL DEFAULT ''`) | ||
| _, _ = s.db.Exec(`ALTER TABLE github_review_watches ADD COLUMN last_error_at DATETIME`) | ||
| _, _ = s.db.Exec(`ALTER TABLE github_issue_watches ADD COLUMN last_error TEXT NOT NULL DEFAULT ''`) | ||
| _, _ = s.db.Exec(`ALTER TABLE github_issue_watches ADD COLUMN last_error_at DATETIME`) |
There was a problem hiding this comment.
These four ALTER TABLE calls drop every error on the floor with _, _ =. I know the surrounding initSchema already does this, but the new columns are the only ones where the reader side panics if they end up missing: models.IssueWatch.LastError / LastErrorAt get scanned unconditionally, so a driver-level failure here (lock contention, FS permissions, disk pressure, etc.) turns into a confusing scan panic on the next poll instead of a clear boot error.
The sibling stores already solve this. apps/backend/internal/jira/store.go:116-132 and apps/backend/internal/linear/store.go:107 use a tableColumns() precheck so a fresh install skips the ALTER and only real failures bubble up. Worth doing the same here:
| _, _ = s.db.Exec(`ALTER TABLE github_review_watches ADD COLUMN last_error TEXT NOT NULL DEFAULT ''`) | |
| _, _ = s.db.Exec(`ALTER TABLE github_review_watches ADD COLUMN last_error_at DATETIME`) | |
| _, _ = s.db.Exec(`ALTER TABLE github_issue_watches ADD COLUMN last_error TEXT NOT NULL DEFAULT ''`) | |
| _, _ = s.db.Exec(`ALTER TABLE github_issue_watches ADD COLUMN last_error_at DATETIME`) | |
| if err := s.addWatchSelfHealColumns(); err != nil { | |
| return err | |
| } |
with a helper alongside the other migration helpers in this file:
func (s *Store) addWatchSelfHealColumns() error {
for _, table := range []string{"github_review_watches", "github_issue_watches"} {
cols, err := s.tableColumns(table)
if err != nil {
return fmt.Errorf("read %s columns: %w", table, err)
}
if _, ok := cols["last_error"]; !ok {
if _, err := s.db.Exec("ALTER TABLE " + table + " ADD COLUMN last_error TEXT NOT NULL DEFAULT ''"); err != nil {
return fmt.Errorf("add %s.last_error: %w", table, err)
}
}
if _, ok := cols["last_error_at"]; !ok {
if _, err := s.db.Exec("ALTER TABLE " + table + " ADD COLUMN last_error_at DATETIME"); err != nil {
return fmt.Errorf("add %s.last_error_at: %w", table, err)
}
}
}
return nil
}github.Store doesn't have a tableColumns helper yet, so you'd need to copy the small one from jira/store.go:264 over. Cheap, and it pays off the moment the next ALTER lands.
Generated by Claude Code
There was a problem hiding this comment.
Addressed in 07fdf0d — added tableColumns + addWatchSelfHealColumns to github/store.go and replaced the four _, _ = ALTER calls. Idempotent precheck on github_review_watches + github_issue_watches; mirrors the jira/linear shape, with the boot now failing loud rather than silently dropping a driver error that the unconditional readers would later turn into a scan panic.
| func (r *sqliteRepository) GetAgentProfileIncludingDeleted(ctx context.Context, id string) (*models.AgentProfile, error) { | ||
| row := r.ro.QueryRowContext(ctx, r.ro.Rebind(` | ||
| SELECT id, agent_id, name, agent_display_name, model, mode, migrated_from, | ||
| auto_approve, dangerously_skip_permissions, allow_indexing, | ||
| cli_passthrough, user_modified, plan, cli_flags, | ||
| COALESCE(env_vars, '[]'), | ||
| created_at, updated_at, deleted_at, | ||
| COALESCE(workspace_id, ''), COALESCE(role, ''), COALESCE(icon, ''), | ||
| COALESCE(reports_to, ''), COALESCE(skill_ids, '[]'), | ||
| COALESCE(desired_skills, '[]'), COALESCE(custom_prompt, ''), | ||
| COALESCE(status, 'idle'), COALESCE(pause_reason, ''), | ||
| last_run_finished_at, | ||
| COALESCE(max_concurrent_sessions, 1), COALESCE(cooldown_sec, 0), | ||
| COALESCE(skip_idle_runs, 0), COALESCE(consecutive_failures, 0), | ||
| COALESCE(failure_threshold, 3), COALESCE(executor_preference, ''), | ||
| COALESCE(budget_monthly_cents, 0), | ||
| COALESCE(settings, '{}'), COALESCE(permissions, '{}') | ||
| FROM agent_profiles WHERE id = ? | ||
| `), id) | ||
| profile, err := scanAgentProfile(row) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| return r.applyLegacyBackfill(ctx, profile), nil | ||
| } |
There was a problem hiding this comment.
This SELECT projection is byte-for-byte identical to GetAgentProfile at lines 732-756 except for the trailing AND deleted_at IS NULL. The next column added to agent_profiles will need to land in both spots, and if it only makes it into one, the soft-delete path will silently scan zero values for that field, which is exactly the kind of bug this PR is trying to prevent in a different layer.
The ListAgentProfiles query right below at 787 has the same shape too. A const extracted once would cover all three:
const agentProfileSelectColumns = `
SELECT id, agent_id, name, agent_display_name, model, mode, migrated_from,
auto_approve, dangerously_skip_permissions, allow_indexing,
cli_passthrough, user_modified, plan, cli_flags,
COALESCE(env_vars, '[]'),
created_at, updated_at, deleted_at,
COALESCE(workspace_id, ''), COALESCE(role, ''), COALESCE(icon, ''),
COALESCE(reports_to, ''), COALESCE(skill_ids, '[]'),
COALESCE(desired_skills, '[]'), COALESCE(custom_prompt, ''),
COALESCE(status, 'idle'), COALESCE(pause_reason, ''),
last_run_finished_at,
COALESCE(max_concurrent_sessions, 1), COALESCE(cooldown_sec, 0),
COALESCE(skip_idle_runs, 0), COALESCE(consecutive_failures, 0),
COALESCE(failure_threshold, 3), COALESCE(executor_preference, ''),
COALESCE(budget_monthly_cents, 0),
COALESCE(settings, '{}'), COALESCE(permissions, '{}')
FROM agent_profiles`Then GetAgentProfile calls Rebind(agentProfileSelectColumns + " WHERE id = ? AND deleted_at IS NULL") and this method calls Rebind(agentProfileSelectColumns + " WHERE id = ?"). Touching ListAgentProfiles too is optional, but folding it in keeps all three in lockstep going forward.
cubic flagged this one too.
Generated by Claude Code
There was a problem hiding this comment.
Addressed in 07fdf0d — extracted agentProfileSelectColumns const at the top of internal/agent/settings/store/sqlite.go. All three call sites (GetAgentProfile, GetAgentProfileIncludingDeleted, ListAgentProfiles) now use Rebind(agentProfileSelectColumns + " WHERE ..."). Next column add lands in one spot.
| func (s *Service) preflightDeletedProfileForGitHub( | ||
| ctx context.Context, kind, profileID, watchID string, | ||
| disable func(ctx context.Context, watchID, cause string) error, | ||
| ) bool { | ||
| lookup := s.getProfileLookup() | ||
| if lookup == nil || profileID == "" { | ||
| return false | ||
| } | ||
| deleted, name, err := lookup.LookupProfile(ctx, profileID) | ||
| if err != nil { | ||
| s.logger.Warn("github "+kind+": profile lookup failed, falling through", | ||
| zap.String("profile_id", profileID), zap.Error(err)) | ||
| return false | ||
| } | ||
| if !deleted { | ||
| return false | ||
| } | ||
| cause := formatDeletedProfileCause(profileID, name) | ||
| s.logger.Warn("github "+kind+": agent profile soft-deleted, self-healing watcher", | ||
| zap.String("watch_id", watchID), | ||
| zap.String("profile_id", profileID), | ||
| zap.String("profile_name", name)) | ||
| if disable != nil { | ||
| if err := disable(ctx, watchID, cause); err != nil { | ||
| s.logger.Error("github "+kind+": self-heal disable failed", | ||
| zap.String("watch_id", watchID), zap.Error(err)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Tiny one: the three log calls below concatenate kind into the message string ("github "+kind+": ..."). That gives the log aggregator two separate message strings to group on ("github issue: ..." and "github review: ...") instead of one filterable family. Most of the orchestrator passes that kind of axis as a structured zap field instead.
| func (s *Service) preflightDeletedProfileForGitHub( | |
| ctx context.Context, kind, profileID, watchID string, | |
| disable func(ctx context.Context, watchID, cause string) error, | |
| ) bool { | |
| lookup := s.getProfileLookup() | |
| if lookup == nil || profileID == "" { | |
| return false | |
| } | |
| deleted, name, err := lookup.LookupProfile(ctx, profileID) | |
| if err != nil { | |
| s.logger.Warn("github "+kind+": profile lookup failed, falling through", | |
| zap.String("profile_id", profileID), zap.Error(err)) | |
| return false | |
| } | |
| if !deleted { | |
| return false | |
| } | |
| cause := formatDeletedProfileCause(profileID, name) | |
| s.logger.Warn("github "+kind+": agent profile soft-deleted, self-healing watcher", | |
| zap.String("watch_id", watchID), | |
| zap.String("profile_id", profileID), | |
| zap.String("profile_name", name)) | |
| if disable != nil { | |
| if err := disable(ctx, watchID, cause); err != nil { | |
| s.logger.Error("github "+kind+": self-heal disable failed", | |
| zap.String("watch_id", watchID), zap.Error(err)) | |
| } | |
| } | |
| func (s *Service) preflightDeletedProfileForGitHub( | |
| ctx context.Context, kind, profileID, watchID string, | |
| disable func(ctx context.Context, watchID, cause string) error, | |
| ) bool { | |
| lookup := s.getProfileLookup() | |
| if lookup == nil || profileID == "" { | |
| return false | |
| } | |
| deleted, name, err := lookup.LookupProfile(ctx, profileID) | |
| if err != nil { | |
| s.logger.Warn("github watcher: profile lookup failed, falling through", | |
| zap.String("kind", kind), | |
| zap.String("profile_id", profileID), | |
| zap.Error(err)) | |
| return false | |
| } | |
| if !deleted { | |
| return false | |
| } | |
| cause := formatDeletedProfileCause(profileID, name) | |
| s.logger.Warn("github watcher: agent profile soft-deleted, self-healing", | |
| zap.String("kind", kind), | |
| zap.String("watch_id", watchID), | |
| zap.String("profile_id", profileID), | |
| zap.String("profile_name", name)) | |
| if disable != nil { | |
| if err := disable(ctx, watchID, cause); err != nil { | |
| s.logger.Error("github watcher: self-heal disable failed", | |
| zap.String("kind", kind), | |
| zap.String("watch_id", watchID), | |
| zap.Error(err)) | |
| } | |
| } | |
| return true | |
| } |
Generated by Claude Code
There was a problem hiding this comment.
Addressed in 07fdf0d — kind is now passed as zap.String("kind", kind) in all three log calls. Message strings collapse to "github watcher: profile lookup failed, falling through" / "github watcher: agent profile soft-deleted, self-healing" / "github watcher: self-heal disable failed" so the aggregator groups them as one filterable family.
| if lookup := s.getProfileLookup(); lookup != nil { | ||
| s.watcherCoordinator.SetProfileLookup(lookup) | ||
| } | ||
| } |
There was a problem hiding this comment.
A bit of an asymmetric locking situation here that's worth tightening. Commit 0f4824d correctly put s.profileLookup behind s.mu (read in getProfileLookup, write in SetProfileLookup). But s.watcherCoordinator and s.issueTaskCreator are still written without the lock, both in this initWatcherCoordinator body and in SetIssueTaskCreator over at event_handlers_github.go:133-136. Meanwhile SetProfileLookup reads s.watcherCoordinator under the lock and dispatchWatcherEvent (line 161 below) reads it with no lock at all.
In production both setters run sequentially at boot before any bus handlers fire, so today it's fine. But the comment on lines 26-31 explicitly says tests swap creators between scenarios, and that's the path -race will eventually catch.
Suggested fix: take s.mu.Lock() around the whole body here, and have SetIssueTaskCreator set s.issueTaskCreator under the same lock before calling this. Inside the locked section, read s.profileLookup directly instead of calling getProfileLookup() (which re-takes the read lock and would deadlock).
Generated by Claude Code
There was a problem hiding this comment.
Addressed in 07fdf0d — renamed to initWatcherCoordinatorLocked (callers MUST hold s.mu write). SetIssueTaskCreator now takes the lock, writes s.issueTaskCreator, then calls the locked variant. Inside the locked section we read s.profileLookup directly (no getProfileLookup() re-entry deadlock). dispatchWatcherEvent reads both fields via new getIssueTaskCreator() / getWatcherCoordinator() RLock helpers. Race-detector tests still pass.
| "error": "agent profile is used by active session(s)", | ||
| "error": "agent profile is used by active session(s) or watcher(s)", | ||
| "active_sessions": inUseErr.ActiveSessions, | ||
| "watchers": inUseErr.Watchers, |
There was a problem hiding this comment.
The backend now returns watchers in the 409 response, but the frontend delete flow still only models active sessions. deleteAgentProfileAction only returns { status: "conflict", activeSessions }, and AgentProfileDeleteConflictDialog only renders sessions.
That means the new backend behavior does not deliver the intended confirmation UX for the main new case: “this profile is referenced by watchers, continue?” A watcher-only conflict can become a generic failed-delete path or a dialog without any watcher details, depending on the response shape.
Can we carry watchers through DeleteProfileResult, add a WatcherReference type, and render them in AgentProfileDeleteConflictDialog alongside active sessions? A small frontend test for a watcher-only 409 would cover the regression.
There was a problem hiding this comment.
Addressed in 07fdf0d (+ formatting in e31bfaa). Threaded watchers through: new WatcherReference type in agent-profile-errors.ts, DeleteProfileResult carries both lists, AgentProfileDeleteConflict carrier prop on AgentProfileDeleteConflictDialog renders watchers grouped by integration kind (Linear / Jira / GitHub Issues / GitHub PR Reviews). 409 trigger now fires on either active_sessions OR watchers non-empty so a watcher-only conflict pops the dialog. New agent-profile-delete-dialog.test.tsx covers all four shapes (watcher-only / sessions-only / combined / null).
| // then cleans up ephemeral tasks. Returns *ErrProfileInUseDetail when force | ||
| // is false and any active session OR referencing watcher exists — the UI uses | ||
| // the breakdown to render a confirmation dialog. Force=true skips both checks | ||
| // and lets the dispatch coordinator's pre-flight handle the watcher self-heal |
There was a problem hiding this comment.
This force-delete path does not actually disable the referencing watchers. It only skips the dependency check and relies on “the dispatch coordinator’s pre-flight handle the watcher self-heal on its next poll,” but that pre-flight only runs after a watcher emits a new unseen issue/PR event.
If the watcher has no new matching external item after the profile is deleted, it stays enabled indefinitely while pointing at a soft-deleted profile. That leaves the UI and stored watcher state inconsistent with the confirmation copy.
I think the forced delete path should disable the currently referencing watchers immediately, using the same Disable*WithError methods added in this PR, and stamp the deleted-profile cause there. The self-heal preflight can remain as a fallback for profiles deleted by reconciler/other paths.
There was a problem hiding this comment.
Addressed in 07fdf0d — WatcherDependencyChecker grew DisableWatchersByAgentProfile(ctx, profileID, cause). When prepareProfileDeletion runs with force=true, the new disableReferencingWatchers helper invokes it and logs the disable count. The cmd/kandev adapter dispatches to the per-integration Disable*WithError store methods (linear / jira / github_issue / github_review). The lazy preflight stays as the safety net for reconciler-driven deletes that don't pass through this controller. Test covers the force-path eager disable.
|
Thanks for working on this! Added some comments to further improve the fix 💪 |
|
Thanks for the thorough review @carlosflorencio — your verdict checklist is exactly the maintainer bar I was hoping to hit. Folding all six findings into this PR:
Pushing the round shortly. |
- github/store: replace silent _, _ = ALTER calls for the new last_error / last_error_at columns with a column-precheck (mirrors jira/linear). These are the only columns whose readers scan unconditionally — a driver-level failure would turn into a confusing scan panic instead of a clear boot error. New tableColumns helper. - agent/settings/store: extract agentProfileSelectColumns const so GetAgentProfile / GetAgentProfileIncludingDeleted / ListAgentProfiles stay in lockstep on future column additions. - orchestrator: pass kind as a zap.String field instead of concatenating into the log message; log aggregator groups "github watcher: ..." as one family with kind on the axis. - orchestrator: tighten lock symmetry. SetIssueTaskCreator now holds s.mu while writing s.issueTaskCreator and (re)initialising the coordinator. initWatcherCoordinatorLocked reads s.profileLookup directly (avoids read-lock re-entry deadlock). dispatchWatcherEvent uses lock-aware reads for s.issueTaskCreator and s.watcherCoordinator. - agent/settings/controller: force-delete now eagerly disables every referencing watcher with the deletion cause via new DisableWatchersByAgentProfile on WatcherDependencyChecker, instead of relying on the lazy preflight which never fires for filters that don't match new external events. - web: thread watchers through the 409 conflict response. New WatcherReference type, AgentProfileDeleteConflict carrier, dialog renders watcher rows grouped by integration kind. Frontend test covers watcher-only, sessions-only, combined, and null-conflict shapes.
|
There are some conflicts now, let me know if you want me to address them. Nice work 💪 |
|
I went back through this carefully and checked each of the six change requests against the current head. All six are genuinely addressed:
Test coverage is good. The only thing left is the rebase you already mentioned, since the branch is still conflicting with main and the failing E2E shards are tied to the stale base. Once it is rebased and CI is green this is good to go. Two minor optional notes: the eager-disable last_error string omits the profile name while the lazy path includes it, and force delete disables watchers just before the row delete, so a failed delete would leave them disabled. Both are low risk. |
…deleted When the reconciler soft-deletes an agent profile, watchers bound to it previously looped forever on "profile not found". The dispatch pipeline now self-heals: a coordinator/GitHub pre-flight detects the soft-deleted profile, disables the watcher with a stamped last_error, and short-circuits — no dead-on-arrival task, no dedup reservation. The 409 delete response carries the referencing watchers so the UI can confirm, and force-delete disables them eagerly. Folds in all PR kdlbs#1094 review feedback (CodeRabbit, cubic, greptile, maintainer).
Two low-risk follow-ups from the PR kdlbs#1094 maintainer review: - Eager force-delete cause now includes the (truncated) profile name via a formatDeletedProfileCause helper, matching the orchestrator preflight's cause shape — the settings banner shows "Kilo Profile" not a bare UUID. - Disable referencing watchers AFTER the row delete succeeds, not before, so a failed delete never strands watchers disabled against a still-live profile. The dispatch preflight remains the fallback if the post-delete disable itself fails.
b67df82 to
ae22ef0
Compare
|
Rebased onto latest
Also folded in your two optional notes:
Local gates all green: |
`getLinearIssueWatch` called `.find()` directly on the
`GET /api/v1/linear/watches/issue` response, but that endpoint returns a
`{ watches: [...] }` envelope (it always has — same on the pre-rebase
branch), so the call threw `TypeError: watches.find is not a function` and
failed the "force-deleting profile with watcher disables the watcher row"
spec in E2E Shard 5/10. Destructure the envelope before searching.
Also refresh the spec's explanatory comment: the eager watcher-disable now
runs in DeleteProfile after the row delete succeeds (not in
prepareProfileDeletion), per the maintainer's optional-note follow-up.
Verified locally: `pnpm e2e:run --host tests/settings/agent-profile-delete.spec.ts` → 1 passed.
|
CI is fully green now ✅ — all 10 E2E shards pass, including Shard 5/10 which was the only one still red after the rebase. That shard turned out to be a pre-existing bug in this PR's own e2e helper, not the stale base: Final state: rebased onto latest |
|
Heads up: one CI check came back red after the last push — Run Backend Tests (run 26719846305) — but it's a goroutine-leak flake, not a regression:
I don't have admin rights to re-run the job — could you kick off a re-run of the failed Backend Tests job? It should go green. Everything else (lint, Build, Frontend, Backend windows, all 10 E2E shards incl. the previously-failing 5/10) is passing. |
Summary
When the reconciler soft-deletes an agent profile because its agent type left the registry (CLI uninstalled, feature flag toggled off, host-utility probe failed), watchers that reference that profile loop forever on
"profile not found: sql: no rows in result set". Each poll creates a dead-on-arrival task; the watcher never disables itself; the user has to manually clean up.This PR makes the dispatch pipeline self-healing.
Behaviour change
WatcherDispatchCoordinator.Dispatch) checks the watcher's bound profile beforeReserve. If the profile is soft-deleted, the source'sSelfHealruns (setsenabled=0+ stampslast_error/last_error_at) and the dispatch short-circuits — no task created, no dedup reservation taken. Linear and Jira flow through this seam.createIssueTask/createReviewTask). The same check fires at the top of each function via a shared helper.ProfileResolver.ResolveProfilenow returns a typed*DeletedProfileError(wrappingstore.ErrAgentProfileDeleted) when the row is found but soft-deleted, distinguishing it from a genuinely missing row. Backed by a newGetAgentProfileIncludingDeletedmethod on the settings store.ErrProfileInUseDetailcarries a newWatchers []WatcherReferencefield. The 409 response now serializes it alongsideactive_sessionsso the delete confirmation dialog can say "this will also disable N watchers — continue?".force=truebypasses both checks; the dispatch pre-flight handles the cleanup on the next poll.Architecture
GitHub bypasses the coordinator but uses the same
ProfileLookupviaService.preflightDeletedProfileForGitHub*.Schema
Four watcher tables (
linear_issue_watches,jira_issue_watches,github_issue_watches,github_review_watches) gainedlast_error TEXT NOT NULL DEFAULT ''andlast_error_at DATETIMEcolumns via idempotentALTER TABLE ... ADD COLUMNmigrations. No FK changes.Test plan
make fmtcleango vet ./...cleango test -race -count=1 ./internal/...passesgolangci-lint run ./...— 0 issues*DeletedProfileErrorreturned on soft-deleted lookup;ErrAgentProfileDeletedmatches viaerrors.IsDisable*WithErrorflipsenabled, stampslast_error+last_error_atcreateIssueTaskandcreateReviewTaskself-heal on deleted profile and skip preflight for legacy empty-id eventsErrProfileInUseDetailblocks on watchers,force=truebypasses, no-watchers preserves happy pathOut of scope / follow-ups
automations,office_routines,office_channels,office_agent_memory) also storeagent_profile_idand could be orphaned by the same reconciler. Their dispatch paths differ enough to warrant separate design — filing as a follow-up.watchersarray in the 409 response. The backend wire format is in place; the React component update is a separate PR.