feat(sources): add GitHub source indexing by aaf2tbz · Pull Request #749 · Signet-AI/signetai

aaf2tbz · 2026-05-23T03:57:30Z

Summary

Rebuilds #727 as a clean single-commit GitHub source implementation on current main. Adds GitHub repo sources for issues, PRs, discussions, docs, embeddings, graph structure, polling, CLI, API, and docs.

Changes

Adds github source config support in @signet/core with agent-scoped ownership, validation, defaults, and parser coverage.
Adds daemon GitHub fetching, embedding, graph indexing, stale reconciliation, polling bridge, source-index jobs, and purge support.
Adds POST /api/sources/github plus delete hardening for GitHub source ownership and in-flight syncs.
Adds signet sources add github CLI support and API docs.
Rebuilds the old feat(sources): add GitHub project source with embedding + knowledge graph #727 feature shape while tightening the review-failure areas: token resolution, admin permissions, bounded fetches, partial-failure handling, canonical graph keys, and source deletion races.

Type

feat — new user-facing feature (bumps minor)
fix — bug fix
refactor — restructure without behavior change
chore — build, deps, config, docs
perf — performance improvement
test — test coverage

Packages affected

Screenshots

N/A — no UI changes.

PR Readiness (MANDATORY)

Spec alignment validated (INDEX.md + dependencies.yaml)
Agent scoping verified on all new/changed data queries
Input/config validation and bounds checks added
Error handling and fallback paths tested (no silent swallow)
Security checks applied to admin/mutation endpoints
Docs updated for API/spec/status changes
Regression tests added for each bug fix
Lint/typecheck/tests pass locally

Migration Notes (if applicable)

Migration is idempotent
Daemon Rust parity reviewed or explicitly N/A
Rollback / compatibility note included in PR description

Rollback / compatibility: no database migration or persisted schema change. GitHub source entries are stored in existing sources.json; rollback is removing the GitHub source config entry and reverting this commit. Rust daemon parity is N/A because this extends the TypeScript daemon source indexing path only.

Testing

Validated with:

bunx biome check platform/core/src/sources-config.ts platform/core/src/sources-config.test.ts platform/daemon/src/github-source-bridge.ts platform/daemon/src/github-source-fetch.ts platform/daemon/src/github-source-graph.ts platform/daemon/src/routes/sources-routes.ts surfaces/cli/src/features/sources.ts surfaces/cli/src/commands/sources.ts
bun test platform/core/src/sources-config.test.ts platform/daemon/src/github-source-embeddings.test.ts platform/daemon/src/github-source-graph.test.ts
bun test platform/daemon/src/routes/sources-routes.test.ts
cd platform/core && bun run build
cd platform/daemon && bun run build

cd platform/daemon && bun run build:types still hits existing broad repo TypeScript/rootDir issues unrelated to this branch, so targeted tests plus production builds were used for local validation.

AI disclosure

No AI tools were used in this PR
AI tools were used (see Assisted-by tags in commits)

Notes

This replaces #727, which accumulated a long review/comment trail after the broad feature landed before the hardening was complete. This PR keeps the same feature goal but presents it as one reviewed implementation commit.

Assisted-by: Codex:gpt-5

PR-Reviewer-Ant · 2026-05-23T15:25:57Z

Hi @aaf2tbz - I'm taking a look at the feature work in feat(sources): add GitHub source indexing (commit 95964bed) and will follow up shortly.

This comment is updated in place by pr-reviewer.

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 95964bed

I found one blocking scoping bug in the new GitHub source config path: sources added without an explicit agent id can be persisted under default and then skipped by the daemon bridge. I also found a smaller bounded-fetch gap for wildcard docs indexing.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The default-agent mismatch is visible in platform/core/src/sources-config.ts and platform/daemon/src/github-source-bridge.ts: config creation defaults missing agentId to default, while the bridge only syncs entries whose source.agentId equals resolveDaemonAgentId(). The route and CLI changed files were referenced by the PR but not included in the available checked-out prompt, so I could not verify whether every caller always passes the daemon agent id.

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: d9a65a51

I found one boundedness issue in the new GitHub source fetcher. The implementation mostly follows the stated shape, but wildcard repo sources can fan out without a hard cap before the per-repo limits ever apply.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The unbounded pagination is visible directly in expandRepoGlob, and the PR description explicitly claims bounded fetches. I did not run a live daemon or GitHub API repro, so the review is based on the supplied diff and changed-file excerpts.

Since my last review:

Fixed or likely addressed: 2
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 78b8760e

I found one boundary issue in the GitHub docs path support: the implementation says it indexes selected docs, but direct doc paths can point at arbitrary repo files. That makes the no-code-indexing boundary too easy to cross.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The config validation and fetch path are visible in the diff: direct docPaths are only checked for relative path safety, then fetched and embedded as docs. I could not inspect the full routes/CLI diff because the provided context truncates those files, so I am treating this as a boundary warning rather than a blocking API-contract finding.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 43fab43e

I found one lifecycle gap in the new GitHub bridge: it runs syncs but never records successful completion back to the source config or source-index job state. That makes the advertised async source-indexing contract incomplete and leaves the source looking perpetually unindexed.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The bridge imports and uses the in-flight guard, but the changed startGitHubSourceBridge path only marks and clears in-flight state around syncGitHubSource. I do not see a corresponding markSourceIndexed, completeSourceIndexJobFromProgress, or failure completion in the GitHub bridge code included in the diff; a daemon runtime check would confirm the exact /api/sources status shape.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 52456e4d

I did not find a concrete correctness, security, data integrity, or race-condition issue in the provided diff. The implementation appears to cover the claimed GitHub source config validation, bounded fetching, indexing, stale reconciliation, purge hooks, CLI/API/docs surfaces, and targeted tests from the material available here.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The provided diff and changed-file snapshots include the core config parser, daemon bridge, GitHub fetcher, embedding/graph indexers, daemon startup integration, docs, and focused tests. I could not run or inspect the full local checkout because the sandbox did not contain the referenced PR files, and the PR itself notes no running-daemon validation, so runtime behavior of the API/CLI path remains unproven here.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 0
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: bc23c567

I found one behavior mismatch: the new GitHub source state option is applied to issues and PRs, but discussions ignore it and are indexed regardless of the requested state.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The bridge passes settings.state into fetchIssues and fetchPullRequests, but the discussions branch calls fetchDiscussions without any state value and the fetch implementation hard-codes discussion resources as open. I could not inspect live route behavior because the provided workspace did not contain the repo files.

Since my last review:

Fixed or likely addressed: 0
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 2211bf68

I found one bounded-fetch bug in the GitHub issues path. The implementation mostly matches the PR shape, but this path can still walk far more GitHub pages than the configured cap on PR-heavy repositories.

Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The issue is visible directly in platform/daemon/src/github-source-fetch.ts: fetchIssues only increments its cap counter for non-PR issues, but the GitHub issues endpoint returns both issues and pull requests. I could not run a targeted runtime repro because the provided workspace contains the review prompt rather than the full checkout.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 10cdde5b

I found one correctness issue in the GitHub issue fetcher: the new indexing cap is applied to scanned /issues rows instead of indexed issue resources, so PR-heavy repos can silently under-index issues. The rest of the visible diff generally matches the stated feature shape and includes the expected ownership, permission, and purge hooks.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The issue is directly visible in platform/daemon/src/github-source-fetch.ts: fetchIssues increments the cap counter before filtering pull_request rows, and the added test codifies returning only one issue when maxItems is two and one row is a PR. I did not have the full route/CLI file contents in the prompt, so endpoint wiring was reviewed only from the provided diff and docs.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 3f0f06f9

I found a concrete correctness bug in the new discussions fetch path: the GraphQL query never requests the field that the parser treats as required, so discussion indexing can fail at runtime as soon as a repository returns discussion nodes.

Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new fetchDiscussions implementation in platform/daemon/src/github-source-fetch.ts includes the full GraphQL query and the parser in the diff. The query selects number, title, body, url, labels, createdAt, updatedAt, answer, and comments, but the parser immediately reads d.state.toLowerCase(); that value is not returned by the query or guaranteed by the GitHub GraphQL API response shape shown here.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: a3281e02

I found two non-blocking correctness issues in the GitHub source path: partial sync failures are reported as completed jobs, and PR label filters are interpolated into GitHub search syntax without escaping.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The bridge and fetch behavior is directly visible in platform/daemon/src/github-source-bridge.ts and platform/daemon/src/github-source-fetch.ts. I did not have the full untruncated route and CLI file bodies, so I limited findings to issues provable from the included changed code.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 2
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant · 2026-05-23T19:51:24Z

+					continue;
+				}
+				try {
+					const result = await syncGitHubSource(source, { ...options, agentId });


This marks the source-index job complete even when syncGitHubSource returned hadErrors: true. Several fetch paths intentionally convert resource/comment/repo failures into hadErrors instead of throwing, so a source can fail partially, or even index zero rows after caught repo failures, while the progress API says the job completed. That undercuts the PR's partial-failure handling claim and makes operators trust stale or incomplete GitHub source state. Consider representing this as failed/partial, or at least preserving the error state in the job instead of calling the normal complete path.

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: d353af76

I found one deletion race in the new GitHub source bridge. It can leave embeddings behind for a source after the source has been deleted, which undercuts the PR's stated delete hardening.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The race is visible in github-source-bridge.ts: sourceActiveCheck is evaluated before structure indexing, but the embedding path can perform async work afterward without rechecking whether the source still exists. I did not run a live daemon repro because the sandbox only includes the supplied PR diff, not a checkout.

Since my last review:

Fixed or likely addressed: 2
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 75a4686c

I found one correctness issue in the daemon indexing bridge: GitHub sources are stored with per-agent ownership, but the poller only syncs sources owned by the daemon’s single resolved agent. That leaves valid non-default agent sources permanently unindexed.

Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The PR explicitly claims agent-scoped ownership and daemon polling. The bridge filters sources by one resolved daemon agent before syncing, which is directly visible in platform/daemon/src/github-source-bridge.ts. I did not run the daemon because the provided environment is read-only and the repository files were not present under the working directory.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: ba69326c

I found two correctness issues in the new GitHub source indexing path: wildcard repo matching can include repos outside the configured pattern, and requested discussion indexing is reported as successful even when it is skipped for lack of a token.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The wildcard regex and discussion skip behavior are directly visible in the added GitHub source files. The checkout available here only included the review prompt, so I could not inspect the full route implementation for additional validation behavior.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 2
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 9e20bea7

I found one correctness issue in the GitHub source defaults: the default config includes discussions, but discussion indexing is treated as an error whenever no token is configured. That makes the advertised no-token/default GitHub source path report a failed sync every time.

Confidence: High [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The default resourceTypes are set from DEFAULT_GITHUB_RESOURCE_TYPES in platform/core/src/sources-config.ts, which includes discussions, and syncGitHubSource marks missing-token discussions as hadErrors. I did not need a live GitHub repro because the failure follows directly from the changed control flow.

Since my last review:

Fixed or likely addressed: 2
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 6e4a2932

I found one correctness issue in the GitHub source update path: updating an existing source can silently drop its token and custom indexing settings.

Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The core config update path is visible in platform/core/src/sources-config.ts and the API docs explicitly describe POST /api/sources/github as add-or-update. I could not inspect the full route/CLI implementations from the provided context, but any caller that omits optional fields while updating the same repo set will hit this core behavior.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 1
Suppressed due to human rebuttal/out-of-scope: 0

PR-Reviewer-Ant

Review metadata

Reviewer: pr-reviewer
Model: gpt-5.5
Commit: 935fdc1d

I did not find a concrete blocker in the provided diff. The rebuilt GitHub source path has targeted validation, agent-scoped source identity, in-flight job handling, partial-error reporting, and focused tests around the prior risky areas.

Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The visible diff and changed-file excerpts cover the core source config, daemon bridge, fetch, embedding, docs, and daemon startup behavior, and those are enough to check the main claims around validation, polling, token resolution shape, and deletion race guards. The prompt truncates parts of github-source-fetch.ts and does not include full sources-routes.ts, github-source-graph.ts, or CLI command contents, so I cannot independently verify every API/CLI ownership path or graph write detail from this review context alone.

Since my last review:

Fixed or likely addressed: 1
Still blocking: 0
Newly found: 0
Suppressed due to human rebuttal/out-of-scope: 0

aaf2tbz · 2026-05-24T01:45:01Z

Ready for human review.

NicholaiVogel · 2026-05-24T17:25:29Z

Closing this in favor of a provider-substrate rebuild.

This PR proved out the GitHub source behavior and captured useful hardening around bounded fetches, token handling, docs-path limits, partial failures, update preservation, and deletion races. Since it was opened, #757 landed the provider-neutral source substrate, and #759 showed the intended shape for non-Obsidian providers through the shared source job path.

Rather than merge a conflicted branch with a parallel GitHub-specific bridge/poller, I’m going to reimplement this on current main as a github source provider. The new PR should reuse the fetcher/tests from here where they still fit, but route config, indexing, purge, progress, and deletion through the shared source provider/job infrastructure.

Thanks for the review loop here; the findings from this PR become the regression checklist for the rebuild.

aaf2tbz mentioned this pull request May 23, 2026

feat(sources): add GitHub project source with embedding + knowledge graph #727

Closed

11 tasks

aaf2tbz force-pushed the codex/github-source-clean branch from 24fefaa to f812bca Compare May 23, 2026 04:03

feat(sources): add GitHub source indexing

8f0d415

Assisted-by: Codex:gpt-5

aaf2tbz force-pushed the codex/github-source-clean branch from f812bca to 8f0d415 Compare May 23, 2026 04:26

aaf2tbz marked this pull request as ready for review May 23, 2026 04:27

aaf2tbz requested a review from PR-Reviewer-Ant May 23, 2026 04:33

aaf2tbz changed the title ~~[codex] feat(sources): add GitHub source indexing~~ feat(sources): add GitHub source indexing May 23, 2026

Merge branch 'main' into codex/github-source-clean

95964be

PR-Reviewer-Ant requested changes May 23, 2026

View reviewed changes

Comment thread platform/core/src/sources-config.ts

Comment thread platform/daemon/src/github-source-fetch.ts

Fix GitHub source agent scoping and doc bounds

d9a65a5