Skip to content

feat(sources): add Discord source indexing#750

Closed
aaf2tbz wants to merge 13 commits into
mainfrom
codex/discord-source-clean
Closed

feat(sources): add Discord source indexing#750
aaf2tbz wants to merge 13 commits into
mainfrom
codex/discord-source-clean

Conversation

@aaf2tbz
Copy link
Copy Markdown
Collaborator

@aaf2tbz aaf2tbz commented May 23, 2026

Summary

Rebuilds #728 as a clean Discord source implementation on current main. Adds Discord guild/channel/thread indexing through the live Discord REST API v10, source config/CLI/API support, direct embeddings, knowledge graph structure, and docs.

Changes

  • Adds discord source config support in @signet/core, including guild ID validation, token references, channel filters, message bounds, thread inclusion, and ISO since normalization.
  • Adds daemon Discord REST fetching, message chunking/embedding, graph indexing, startup sync, purge support, and /api/sources/discord.
  • Adds signet sources add discord CLI support.
  • Fixes feat(sources): add Discord source with live API indexing + knowledge graph #728 reviewer findings: guild-level active-thread route, real since filtering, and stable user-ID participant graph keys.
  • Documents Discord Sources in docs/SOURCES.md and docs/API.md.

Type

  • feat — new user-facing feature (bumps minor)
  • fix — bug fix
  • refactor — restructure without behavior change
  • chore — build, deps, config, docs
  • perf — performance improvement
  • test — test coverage

Packages affected

  • @signet/core
  • @signet/daemon
  • @signet/cli / dashboard
  • @signet/sdk
  • @signet/connector-*
  • @signet/web
  • predictor
  • Other: docs

Screenshots

N/A — no UI changes.

PR Readiness (MANDATORY)

  • Spec alignment validated (INDEX.md + dependencies.yaml)
  • Agent scoping verified on all new/changed data queries
  • Input/config validation and bounds checks added
  • Error handling and fallback paths tested (no silent swallow)
  • Security checks applied to admin/mutation endpoints
  • Docs updated for API/spec/status changes
  • Regression tests added for each bug fix
  • Lint/typecheck/tests pass locally

Migration Notes (if applicable)

  • Migration is idempotent
  • Daemon Rust parity reviewed or explicitly N/A
  • Rollback / compatibility note included in PR description

Rollback / compatibility: no database migration or persisted schema change. Discord source entries are stored in existing sources.json; rollback is removing the Discord source config entry and reverting these commits. Rust daemon parity is N/A because this extends the TypeScript daemon source indexing path only.

Testing

  • bun test passes
  • bun run typecheck passes
  • bun run lint passes
  • Tested against running daemon
  • N/A

Validated with:

  • bunx biome check platform/core/src/sources-config.ts platform/core/src/sources-config.test.ts platform/daemon/src/discord-source-fetch.ts platform/daemon/src/discord-source-fetch.test.ts platform/daemon/src/discord-source-bridge.ts platform/daemon/src/discord-source-embeddings.ts platform/daemon/src/discord-source-embeddings.test.ts platform/daemon/src/discord-source-graph.ts platform/daemon/src/discord-source-graph.test.ts platform/daemon/src/routes/sources-routes.ts platform/daemon/src/routes/sources-routes.test.ts surfaces/cli/src/commands/sources.ts surfaces/cli/src/features/sources.ts
  • cd platform/core && bun test src/sources-config.test.ts
  • cd platform/daemon && bun test src/discord-source-fetch.test.ts src/discord-source-embeddings.test.ts src/discord-source-graph.test.ts src/routes/sources-routes.test.ts
  • cd platform/core && bun run build
  • cd platform/daemon && bun run build.ts

Known local validation limits:

  • cd platform/daemon && bun run build:types still fails on existing broad repo rootDir/SQLite typing issues outside this Discord source change.
  • cd surfaces/cli && bun run build:cli requires connector workspace packages to be built/linked first and failed resolving those connector workspaces in this fresh worktree.

AI disclosure

  • No AI tools were used in this PR
  • AI tools were used (see Assisted-by tags in commits)

Notes

Replaces #728. Old PR #728 is closed in favor of this clean replacement.

@aaf2tbz aaf2tbz marked this pull request as ready for review May 23, 2026 04:48
@aaf2tbz aaf2tbz changed the title [codex] feat(sources): add Discord source indexing feat(sources): add Discord source indexing May 23, 2026
@PR-Reviewer-Ant
Copy link
Copy Markdown
Collaborator

Hi @aaf2tbz - I'm taking a look at the feature work in feat(sources): add Discord source indexing (commit a61abdd9) and will follow up shortly.

This comment is updated in place by pr-reviewer.

Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: a61abdd9

I found two issues worth fixing: the daemon records a Discord source as indexed even when sync was skipped, and the token reference field can persist an obvious raw Discord token into source config.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - Both findings are directly visible in the changed code: syncDiscordSource returns 0 on missing tokens, while daemon.ts marks the source indexed for any resolved sync promise; sources-config.ts only trims tokenRef before saving it into sources.json.

Comment thread platform/daemon/src/daemon.ts
Comment thread platform/core/src/sources-config.ts
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: b7fade31

I found two data-lifecycle issues in the Discord source indexing path. The new startup sync can outlive source removal, and source-owned participant entities can be reassigned across Discord sources in a way that makes purge behavior unreliable.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The issues are visible directly in the changed daemon startup sync, delete/purge route, and Discord graph upsert code. I did not need a live Discord repro to see that the sync is not tracked as in-flight and that entity upserts key by canonical_name plus agent_id while later purges delete by source_id.

Since my last review:

  • Fixed or likely addressed: 2
  • Still blocking: 0
  • Newly found: 2
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/daemon.ts
Comment thread platform/daemon/src/discord-source-graph.ts
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: 56b4571a

I found a couple of correctness issues in the Discord indexing path: one can leave stale graph rows behind after channels or participants disappear, and one can fail ordinary Discord API calls when the response body is empty or non-JSON.

Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The stale graph behavior follows directly from discord-source-graph.ts only upserting current entities/dependencies and only purging on full source removal; there is no per-sync deletion for channels, threads, conversations, or participants that disappear. The Discord response parsing issue is visible in discord-source-fetch.ts where every non-204 response is parsed with response.json(), including error responses. I did not have a live Discord runtime trace, so confidence is medium rather than high.

Since my last review:

  • Fixed or likely addressed: 2
  • Still blocking: 0
  • Newly found: 2
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/discord-source-graph.ts
Comment thread platform/daemon/src/discord-source-fetch.ts
Comment thread platform/daemon/src/discord-source-graph.ts Fixed
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: b1a9d633

I found a data-integrity problem in the Discord graph upsert path: overlapping Discord source configs can make one source take ownership of another source's graph rows. That should be fixed before merge because purge/reconcile can then delete or mutate rows that still belong to another configured source.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The changed graph code explicitly looks up existing entities by canonical_name and agent_id only, while several Discord canonical names are not source-scoped. The PR also allows distinct Discord source IDs for different guild sets, so overlapping guild configs are reachable from the added source config path.

Since my last review:

  • Fixed or likely addressed: 2
  • Still blocking: 0
  • Newly found: 1
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/discord-source-graph.ts
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: 01c3db69

I found one correctness issue in the Discord message fetch path: a malformed or unexpected message id can abort the whole channel sync instead of being reported as a fetch error. The rest of the implementation broadly matches the PR description from the provided diff.

Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The issue is directly visible in platform/daemon/src/discord-source-fetch.ts where BigInt(msg.id) is called inside the message loop without validation or containment. I did not have a live Discord/runtime repro, so confidence is medium rather than high.

Since my last review:

  • Fixed or likely addressed: 1
  • Still blocking: 0
  • Newly found: 1
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/discord-source-fetch.ts
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: 349f6f6f

I found one lifecycle bug in the new Discord startup sync: it is not tracked as an in-flight source index job, so removal can race with indexing and leave supposedly purged Discord rows behind.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The diff shows Discord sync is launched directly from daemon startup without source-index in-flight tracking, while DELETE only keeps a tombstone when isSourceIndexInFlight(source.id) is true. I did not run a live daemon reproduction, but the race follows from the changed daemon startup path and existing deletion logic in sources-routes.ts.

Since my last review:

  • Fixed or likely addressed: 1
  • Still blocking: 0
  • Newly found: 1
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/daemon.ts
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: 996dcbd7

I found two lifecycle bugs in the Discord source path: adding a source does not actually kick off indexing, and refreshes can leave stale Discord chunks searchable after the current fetch returns no messages.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The add route returns queued:false after only writing the source config, and the only sync wiring in daemon.ts runs during startup. The stale-chunk issue follows directly from discord-source-bridge.ts only calling indexDiscordSourceEmbeddings when msgResult.data.length > 0, while stale cleanup lives inside that embedding function.

Since my last review:

  • Fixed or likely addressed: 1
  • Still blocking: 0
  • Newly found: 2
  • Suppressed due to human rebuttal/out-of-scope: 0

Comment thread platform/daemon/src/routes/sources-routes.ts
Comment thread platform/daemon/src/discord-source-bridge.ts
@aaf2tbz aaf2tbz marked this pull request as draft May 23, 2026 16:43
@aaf2tbz aaf2tbz marked this pull request as ready for review May 23, 2026 16:56
Copy link
Copy Markdown
Collaborator

@PR-Reviewer-Ant PR-Reviewer-Ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review metadata
  • Reviewer: pr-reviewer
  • Model: gpt-5.5
  • Commit: 0edad3fe

I found one data-integrity bug in the Discord sync path: transient Discord fetch failures can be interpreted as an authoritative empty state and purge previously indexed graph rows.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The new Discord bridge logs fetch errors but still feeds empty/partial results into reconciliation. The provided diff includes the relevant sync and reconcile code, but the full routes file and existing source-index conventions were truncated in the prompt.

Since my last review:

  • Fixed or likely addressed: 2
  • Still blocking: 0
  • Newly found: 1
  • Suppressed due to human rebuttal/out-of-scope: 0

}
if (isSourceActive()) {
reconcileDiscordGuildStructure({
agentId,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reconciliation runs even when fetchGuildChannels() failed and returned data: [] with errors. In that case currentChannelIds is empty and reconciledChannels is empty, so reconcileDiscordGuildStructure() will treat a transient Discord/API/permission failure as an authoritative empty guild and delete previously indexed channel, conversation, and participant graph rows. The archived-thread path has the same shape: fetchPublicArchivedThreads() errors are ignored, so previously indexed archived thread conversations can be dropped from conversationPaths and purged. Please only reconcile after a successful authoritative listing, or preserve existing rows when fetches are partial/failed.

@NicholaiVogel
Copy link
Copy Markdown
Collaborator

Thanks for the work on this. This PR did several things well: it proved the Discord source shape in Signet, added useful config validation and token-reference guardrails, exercised Discord REST API v10 fetching, wired CLI/API entry points, and surfaced the right lifecycle problems around source removal, stale rows, and partial fetches.

After comparing this with Discrawl and the existing Obsidian Sources architecture, we are going to close this PR and replace it with a cleaner stacked implementation. The main issue is not that the code is unusable; it is that the abstraction boundary is too provider-specific. We do not want Discord to grow a second source pipeline beside Obsidian. The replacement work will create a unified Sources substrate first, then move Obsidian and Discord onto that substrate.

The replacement direction is:

  • provider adapters handle provider-specific discovery and fetching only;
  • shared source infrastructure owns index jobs, cancellation, tombstones, purge, chunks, embeddings, stats, recall, and graph ingestion;
  • Discord will target Discrawl-level completeness, including multiple guilds, full channel/thread coverage, members, mentions, attachments metadata, sync cursors, and safe partial-failure handling;
  • semantic entities, aspects, attributes, claims, and dependencies will be derived through the shared Signet source graph path with provenance back to the Discord source.

We will reuse the good pieces from this PR where they fit: token-ref validation, REST fetch helpers/tests, CLI/API ergonomics, and lifecycle regression cases. The Discord-specific bridge/embedding/graph split will not be the foundation for the replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants