feat(sources): add GitHub source indexing#749
Conversation
24fefaa to
f812bca
Compare
Assisted-by: Codex:gpt-5
f812bca to
8f0d415
Compare
|
Hi @aaf2tbz - I'm taking a look at the feature work in This comment is updated in place by pr-reviewer. |
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
95964bed
I found one blocking scoping bug in the new GitHub source config path: sources added without an explicit agent id can be persisted under default and then skipped by the daemon bridge. I also found a smaller bounded-fetch gap for wildcard docs indexing.
Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The default-agent mismatch is visible in platform/core/src/sources-config.ts and platform/daemon/src/github-source-bridge.ts: config creation defaults missing agentId to default, while the bridge only syncs entries whose source.agentId equals resolveDaemonAgentId(). The route and CLI changed files were referenced by the PR but not included in the available checked-out prompt, so I could not verify whether every caller always passes the daemon agent id.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
d9a65a51
I found one boundedness issue in the new GitHub source fetcher. The implementation mostly follows the stated shape, but wildcard repo sources can fan out without a hard cap before the per-repo limits ever apply.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The unbounded pagination is visible directly in expandRepoGlob, and the PR description explicitly claims bounded fetches. I did not run a live daemon or GitHub API repro, so the review is based on the supplied diff and changed-file excerpts.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
78b8760e
I found one boundary issue in the GitHub docs path support: the implementation says it indexes selected docs, but direct doc paths can point at arbitrary repo files. That makes the no-code-indexing boundary too easy to cross.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The config validation and fetch path are visible in the diff: direct docPaths are only checked for relative path safety, then fetched and embedded as docs. I could not inspect the full routes/CLI diff because the provided context truncates those files, so I am treating this as a boundary warning rather than a blocking API-contract finding.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
43fab43e
I found one lifecycle gap in the new GitHub bridge: it runs syncs but never records successful completion back to the source config or source-index job state. That makes the advertised async source-indexing contract incomplete and leaves the source looking perpetually unindexed.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The bridge imports and uses the in-flight guard, but the changed startGitHubSourceBridge path only marks and clears in-flight state around syncGitHubSource. I do not see a corresponding markSourceIndexed, completeSourceIndexJobFromProgress, or failure completion in the GitHub bridge code included in the diff; a daemon runtime check would confirm the exact /api/sources status shape.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
52456e4d
I did not find a concrete correctness, security, data integrity, or race-condition issue in the provided diff. The implementation appears to cover the claimed GitHub source config validation, bounded fetching, indexing, stale reconciliation, purge hooks, CLI/API/docs surfaces, and targeted tests from the material available here.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The provided diff and changed-file snapshots include the core config parser, daemon bridge, GitHub fetcher, embedding/graph indexers, daemon startup integration, docs, and focused tests. I could not run or inspect the full local checkout because the sandbox did not contain the referenced PR files, and the PR itself notes no running-daemon validation, so runtime behavior of the API/CLI path remains unproven here.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 0
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
bc23c567
I found one behavior mismatch: the new GitHub source state option is applied to issues and PRs, but discussions ignore it and are indexed regardless of the requested state.
Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The bridge passes settings.state into fetchIssues and fetchPullRequests, but the discussions branch calls fetchDiscussions without any state value and the fetch implementation hard-codes discussion resources as open. I could not inspect live route behavior because the provided workspace did not contain the repo files.
Since my last review:
- Fixed or likely addressed: 0
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
2211bf68
I found one bounded-fetch bug in the GitHub issues path. The implementation mostly matches the PR shape, but this path can still walk far more GitHub pages than the configured cap on PR-heavy repositories.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The issue is visible directly in platform/daemon/src/github-source-fetch.ts: fetchIssues only increments its cap counter for non-PR issues, but the GitHub issues endpoint returns both issues and pull requests. I could not run a targeted runtime repro because the provided workspace contains the review prompt rather than the full checkout.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
10cdde5b
I found one correctness issue in the GitHub issue fetcher: the new indexing cap is applied to scanned /issues rows instead of indexed issue resources, so PR-heavy repos can silently under-index issues. The rest of the visible diff generally matches the stated feature shape and includes the expected ownership, permission, and purge hooks.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The issue is directly visible in platform/daemon/src/github-source-fetch.ts: fetchIssues increments the cap counter before filtering pull_request rows, and the added test codifies returning only one issue when maxItems is two and one row is a PR. I did not have the full route/CLI file contents in the prompt, so endpoint wiring was reviewed only from the provided diff and docs.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
3f0f06f9
I found a concrete correctness bug in the new discussions fetch path: the GraphQL query never requests the field that the parser treats as required, so discussion indexing can fail at runtime as soon as a repository returns discussion nodes.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new fetchDiscussions implementation in platform/daemon/src/github-source-fetch.ts includes the full GraphQL query and the parser in the diff. The query selects number, title, body, url, labels, createdAt, updatedAt, answer, and comments, but the parser immediately reads d.state.toLowerCase(); that value is not returned by the query or guaranteed by the GitHub GraphQL API response shape shown here.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
a3281e02
I found two non-blocking correctness issues in the GitHub source path: partial sync failures are reported as completed jobs, and PR label filters are interpolated into GitHub search syntax without escaping.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The bridge and fetch behavior is directly visible in platform/daemon/src/github-source-bridge.ts and platform/daemon/src/github-source-fetch.ts. I did not have the full untruncated route and CLI file bodies, so I limited findings to issues provable from the included changed code.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
| continue; | ||
| } | ||
| try { | ||
| const result = await syncGitHubSource(source, { ...options, agentId }); |
There was a problem hiding this comment.
This marks the source-index job complete even when syncGitHubSource returned hadErrors: true. Several fetch paths intentionally convert resource/comment/repo failures into hadErrors instead of throwing, so a source can fail partially, or even index zero rows after caught repo failures, while the progress API says the job completed. That undercuts the PR's partial-failure handling claim and makes operators trust stale or incomplete GitHub source state. Consider representing this as failed/partial, or at least preserving the error state in the job instead of calling the normal complete path.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
d353af76
I found one deletion race in the new GitHub source bridge. It can leave embeddings behind for a source after the source has been deleted, which undercuts the PR's stated delete hardening.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The race is visible in github-source-bridge.ts: sourceActiveCheck is evaluated before structure indexing, but the embedding path can perform async work afterward without rechecking whether the source still exists. I did not run a live daemon repro because the sandbox only includes the supplied PR diff, not a checkout.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
75a4686c
I found one correctness issue in the daemon indexing bridge: GitHub sources are stored with per-agent ownership, but the poller only syncs sources owned by the daemon’s single resolved agent. That leaves valid non-default agent sources permanently unindexed.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The PR explicitly claims agent-scoped ownership and daemon polling. The bridge filters sources by one resolved daemon agent before syncing, which is directly visible in platform/daemon/src/github-source-bridge.ts. I did not run the daemon because the provided environment is read-only and the repository files were not present under the working directory.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
ba69326c
I found two correctness issues in the new GitHub source indexing path: wildcard repo matching can include repos outside the configured pattern, and requested discussion indexing is reported as successful even when it is skipped for lack of a token.
Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The wildcard regex and discussion skip behavior are directly visible in the added GitHub source files. The checkout available here only included the review prompt, so I could not inspect the full route implementation for additional validation behavior.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
9e20bea7
I found one correctness issue in the GitHub source defaults: the default config includes discussions, but discussion indexing is treated as an error whenever no token is configured. That makes the advertised no-token/default GitHub source path report a failed sync every time.
Confidence: High [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The default resourceTypes are set from DEFAULT_GITHUB_RESOURCE_TYPES in platform/core/src/sources-config.ts, which includes discussions, and syncGitHubSource marks missing-token discussions as hadErrors. I did not need a live GitHub repro because the failure follows directly from the changed control flow.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
6e4a2932
I found one correctness issue in the GitHub source update path: updating an existing source can silently drop its token and custom indexing settings.
Confidence: Medium [sufficient_diff_evidence, missing_cross_module_context] - The core config update path is visible in platform/core/src/sources-config.ts and the API docs explicitly describe POST /api/sources/github as add-or-update. I could not inspect the full route/CLI implementations from the provided context, but any caller that omits optional fields while updating the same repo set will hit this core behavior.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
935fdc1d
I did not find a concrete blocker in the provided diff. The rebuilt GitHub source path has targeted validation, agent-scoped source identity, in-flight job handling, partial-error reporting, and focused tests around the prior risky areas.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_cross_module_context] - The visible diff and changed-file excerpts cover the core source config, daemon bridge, fetch, embedding, docs, and daemon startup behavior, and those are enough to check the main claims around validation, polling, token resolution shape, and deletion race guards. The prompt truncates parts of github-source-fetch.ts and does not include full sources-routes.ts, github-source-graph.ts, or CLI command contents, so I cannot independently verify every API/CLI ownership path or graph write detail from this review context alone.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 0
- Suppressed due to human rebuttal/out-of-scope: 0
|
Ready for human review. |
|
Closing this in favor of a provider-substrate rebuild. This PR proved out the GitHub source behavior and captured useful hardening around bounded fetches, token handling, docs-path limits, partial failures, update preservation, and deletion races. Since it was opened, #757 landed the provider-neutral source substrate, and #759 showed the intended shape for non-Obsidian providers through the shared source job path. Rather than merge a conflicted branch with a parallel GitHub-specific bridge/poller, I’m going to reimplement this on current Thanks for the review loop here; the findings from this PR become the regression checklist for the rebuild. |
Summary
Rebuilds #727 as a clean single-commit GitHub source implementation on current
main. Adds GitHub repo sources for issues, PRs, discussions, docs, embeddings, graph structure, polling, CLI, API, and docs.Changes
githubsource config support in@signet/corewith agent-scoped ownership, validation, defaults, and parser coverage.POST /api/sources/githubplus delete hardening for GitHub source ownership and in-flight syncs.signet sources add githubCLI support and API docs.Type
feat— new user-facing feature (bumps minor)fix— bug fixrefactor— restructure without behavior changechore— build, deps, config, docsperf— performance improvementtest— test coveragePackages affected
@signet/core@signet/daemon@signet/cli/ dashboard@signet/sdk@signet/connector-*@signet/webpredictorScreenshots
N/A — no UI changes.
PR Readiness (MANDATORY)
INDEX.md+dependencies.yaml)Migration Notes (if applicable)
Rollback / compatibility: no database migration or persisted schema change. GitHub source entries are stored in existing
sources.json; rollback is removing the GitHub source config entry and reverting this commit. Rust daemon parity is N/A because this extends the TypeScript daemon source indexing path only.Testing
bun testpassesbun run typecheckpassesbun run lintpassesValidated with:
bunx biome check platform/core/src/sources-config.ts platform/core/src/sources-config.test.ts platform/daemon/src/github-source-bridge.ts platform/daemon/src/github-source-fetch.ts platform/daemon/src/github-source-graph.ts platform/daemon/src/routes/sources-routes.ts surfaces/cli/src/features/sources.ts surfaces/cli/src/commands/sources.tsbun test platform/core/src/sources-config.test.ts platform/daemon/src/github-source-embeddings.test.ts platform/daemon/src/github-source-graph.test.tsbun test platform/daemon/src/routes/sources-routes.test.tscd platform/core && bun run buildcd platform/daemon && bun run buildcd platform/daemon && bun run build:typesstill hits existing broad repo TypeScript/rootDir issues unrelated to this branch, so targeted tests plus production builds were used for local validation.AI disclosure
Assisted-bytags in commits)Notes
This replaces #727, which accumulated a long review/comment trail after the broad feature landed before the hardening was complete. This PR keeps the same feature goal but presents it as one reviewed implementation commit.