docs: Kimi-K2.6 ADE-Bench behavioral analysis + dbt skill improvements#807
docs: Kimi-K2.6 ADE-Bench behavioral analysis + dbt skill improvements#807anandgupta42 wants to merge 6 commits into
Conversation
…ments
Adds research/kimi-k26-ade-bench-2026-05-10/ with a blog-ready writeup of how
the Moonshot Kimi-K2.6 model behaves as a coding agent inside altimate-code's
agent loop, derived from 78 trial traces against ADE-Bench. Findings cover
tool-usage distribution, wall-clock anatomy (~89% model generation, ~5%
tools), prompt-cache amplification (85.8%), per-failure-class taxonomy, and
extended appendices (per-trial manifest, pass-rate by family, skill
invocation log, cost/runtime distribution, reproducibility command, glossary,
open questions).
Also extends two shipped skills with generic dbt-best-practice patterns
surfaced during the analysis (all benchmark-agnostic, applicable to any dbt
project):
- dbt-develop/SKILL.md
* stronger description with explicit invocation triggers
* new section on transformation-logic pitfalls: incremental high-water
marks (>= vs >), snapshot strategy selection, LEFT JOIN + COUNT(*)
phantom rows, type harmonization in COALESCE/CASE/UNION, date-spine
completeness, off-by-one window boundaries, uniqueness enforcement,
window-LIMIT tiebreakers
* deliverable-enumeration step in Validate phase + iron rule
* unit-test verification step + iron rule
- dbt-unit-tests/SKILL.md
* new iron rule requiring mock data to exercise every SQL construct's
failure mode (LEFT JOIN unmatched parents, NULLIF zero, CASE branches,
COALESCE all-null, window boundaries, date spines, etc.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR strengthens dbt skill docs with explicit correctness preconditions and unit-test requirements, adds a Kimi‑K2.6 ADE‑Bench benchmark (README + findings), integrates an Altimate Code ADE‑Bench agent with packaging/install scripts and ADE‑Bench patches, and enables session auto-loading of skills via frontmatter. Changesdbt Skills Documentation Enhancement
Kimi‑K2.6 ADE‑Bench Evaluation Report & Agent
Session Prompt Auto-load via Skill Frontmatter
Sequence Diagram(s)sequenceDiagram
participant Harness
participant AltimateCodeAgent
participant "altimate-code CLI"
participant LogFile
participant Parser as AltimateCodeParser
Harness->>AltimateCodeAgent: perform_task(task_prompt, env)
AltimateCodeAgent->>"altimate-code CLI": run --format json --yolo [--model] (copy local tarball if present)
"altimate-code CLI"->>LogFile: emit JSON event stream
AltimateCodeAgent->>LogFile: read log file
Parser->>LogFile: parse events (step_finish/tool_start/tool_end)
Parser->>AltimateCodeAgent: metrics (runtime_ms, tokens, cost, success)
AltimateCodeAgent->>Harness: return AgentResult (formatted log, metrics, tools_used)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
research/kimi-k26-ade-bench-2026-05-10/findings.md (2)
209-219: 💤 Low valueMinor: Add language identifier to code block.
Static analysis (markdownlint) flags this fenced code block as missing a language specifier.
Suggested fix
-``` +```text [completed] Explore project structure and source models [completed] Query sample data to understand part_types and author_types [in_progress] Create intercom__conversation_metrics.sql model [pending] Validate SQL syntax and analyze for anti-patterns [pending] Build the model and verify output [pending] Run full project build to ensure no regressions</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@research/kimi-k26-ade-bench-2026-05-10/findings.mdaround lines 209 - 219,
The fenced checklist in findings.md is missing a language identifier, which
triggers markdownlint; update the triple-backtick fence surrounding the
checklist (the block that lists the six steps including "Create
intercom__conversation_metrics.sql model" and the status lines) to include a
language tag such as text (e.g., changetotext) so the code block is
properly annotated for markdownlint and renderers.</details> --- `87-93`: _💤 Low value_ **Minor: Add language identifier to code block.** Static analysis (markdownlint) flags this fenced code block as missing a language specifier. Adding `text` or an appropriate identifier improves rendering consistency. <details> <summary>Suggested fix</summary> ```diff - ``` + ```text [pending] Add position_descriptions to f1_dataset.yml sources [pending] Create src_<model>.sql views in models/src/ pointing to source tables [pending] Update staging models to reference src_ models instead of raw tables [pending] Run dbt build to verify everything compiles and builds successfully ``` ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@research/kimi-k26-ade-bench-2026-05-10/findings.mdaround lines 87 - 93, The
fenced code block in findings.md is missing a language identifier which triggers
markdownlint; update the opening triple-backtick for the block that contains the
four "[pending] ..." lines to include a language specifier (e.g., changetotext or another appropriate identifier) so the block reads ```text and
notifies renderers/linting tools of the content type; ensure you only modify the
opening fence and keep the block contents unchanged.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@research/kimi-k26-ade-bench-2026-05-10/findings.md:
- Line 276: The line contains a branding leak: replace the phrase "beyond
OpenCode's base set" in the findings text with a neutral, non-branded
alternative (e.g., "beyond the base toolset" or "beyond the project's base
toolset"); update the sentence "altimate-code ships dbt-specific tools beyond
OpenCode's base set." to a reworded version such as "altimate-code ships
dbt-specific tools beyond the base toolset." to remove the product name while
preserving meaning.- Line 5: The line containing "Harness: altimate-code (a fork of OpenCode
wrapping the model in a coding-agent loop...)" leaks the OpenCode product name;
remove or reword that parenthetical. Replace "a fork of OpenCode" with a neutral
phrase such as "an internal fork of a coding-agent framework" or simply "a
forked coding-agent wrapper" and keep the rest of the Harness description intact
(refer to the Harness: altimate-code and model id
openrouter/moonshotai/kimi-k2.6-20260420to locate the exact sentence to
edit).- Line 30: The phrase "standard OpenCode toolset" leaks branding; update the
text in the findings entry that mentions OpenCode (the sentence listing tools:
bash,read,write,edit,glob,grep,todowrite) to remove the
product name and use a neutral term such as "standard code toolset" or "standard
toolset" (or similar wording), ensuring the rest of the tool list and
altimate-specific tools (project_scan,sql_analyze,sql_execute, etc.)
remain unchanged.
Nitpick comments:
In@research/kimi-k26-ade-bench-2026-05-10/findings.md:
- Around line 209-219: The fenced checklist in findings.md is missing a language
identifier, which triggers markdownlint; update the triple-backtick fence
surrounding the checklist (the block that lists the six steps including "Create
intercom__conversation_metrics.sql model" and the status lines) to include a
language tag such as text (e.g., changetotext) so the code block is
properly annotated for markdownlint and renderers.- Around line 87-93: The fenced code block in findings.md is missing a language
identifier which triggers markdownlint; update the opening triple-backtick for
the block that contains the four "[pending] ..." lines to include a language
specifier (e.g., changetotext or another appropriate identifier) so the
block reads ```text and notifies renderers/linting tools of the content type;
ensure you only modify the opening fence and keep the block contents unchanged.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Repository UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `5425a1b0-ef0d-4535-b5f1-7894fc31c513` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between c859b57ec46925a7a3c1bcd735c5afa1f365c029 and e7e1d9227ee9409bed1d05da21980a815f5e77f9. </details> <details> <summary>📒 Files selected for processing (4)</summary> * `.opencode/skills/dbt-develop/SKILL.md` * `.opencode/skills/dbt-unit-tests/SKILL.md` * `research/kimi-k26-ade-bench-2026-05-10/README.md` * `research/kimi-k26-ade-bench-2026-05-10/findings.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
|
||
| *Notes from running the Moonshot Kimi-K2.6 model (via OpenRouter) inside altimate-code's dbt-aware agent loop on the ADE-Bench analytics/data-engineering benchmark.* | ||
|
|
||
| Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools). |
There was a problem hiding this comment.
Critical: Branding leak detected.
Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "a fork of OpenCode" must be removed or reworded to comply with branding guidelines.
Suggested fix
-Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
+Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools). | |
| Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools). |
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt
[error] 5-5: Branding audit found leak (OpenCode (product name)). Line 5: "OpenCode (product name)" with model id openrouter/moonshotai/kimi-k2.6-...
🪛 GitHub Actions: CI / Marker Guard
[error] 5-5: Branding audit leak found: "OpenCode (product name)". Context: "Date: 2026-05-10. Model id: openrouter/moonshotai/kimi-k2.6-20260420. Harne..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 5, The line
containing "Harness: altimate-code (a fork of OpenCode wrapping the model in a
coding-agent loop...)" leaks the OpenCode product name; remove or reword that
parenthetical. Replace "a fork of OpenCode" with a neutral phrase such as "an
internal fork of a coding-agent framework" or simply "a forked coding-agent
wrapper" and keep the rest of the Harness description intact (refer to the
Harness: altimate-code and model id `openrouter/moonshotai/kimi-k2.6-20260420`
to locate the exact sentence to edit).
| Each trial: | ||
|
|
||
| 1. The harness starts a container, scaffolds the dbt project, and hands the agent a natural-language prompt. | ||
| 2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`). |
There was a problem hiding this comment.
Critical: Branding leak detected.
Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "standard OpenCode toolset" must be reworded.
Suggested fix
-2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
+2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| 2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`). | |
| 2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`). |
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt
[error] 30-30: Branding audit found leak (OpenCode (product name)). Line 30 references altimate-code and model routing.
🪛 GitHub Actions: CI / Marker Guard
[error] 30-30: Branding audit leak found: "OpenCode (product name)". Context: "2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed throu..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 30, The phrase
"standard OpenCode toolset" leaks branding; update the text in the findings
entry that mentions OpenCode (the sentence listing tools: `bash`, `read`,
`write`, `edit`, `glob`, `grep`, `todowrite`) to remove the product name and use
a neutral term such as "standard code toolset" or "standard toolset" (or similar
wording), ensuring the rest of the tool list and altimate-specific tools
(`project_scan`, `sql_analyze`, `sql_execute`, etc.) remain unchanged.
|
|
||
| ## 6. Where the custom tools helped (or didn't) | ||
|
|
||
| altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations: |
There was a problem hiding this comment.
Critical: Branding leak detected.
Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "beyond OpenCode's base set" must be reworded.
Suggested fix
-altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations:
+altimate-code ships dbt-specific tools beyond the base set. Pass-rate correlations:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations: | |
| altimate-code ships dbt-specific tools beyond the base set. Pass-rate correlations: |
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt
[error] 276-276: Branding audit found leak (OpenCode (product name)). Line 276 mentions altimate-code shipping dbt-specific tools beyond OpenCode.
🪛 GitHub Actions: CI / Marker Guard
[error] 276-276: Branding audit leak found: "OpenCode (product name)". Context: "altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate ..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 276, The line
contains a branding leak: replace the phrase "beyond OpenCode's base set" in the
findings text with a neutral, non-branded alternative (e.g., "beyond the base
toolset" or "beyond the project's base toolset"); update the sentence
"altimate-code ships dbt-specific tools beyond OpenCode's base set." to a
reworded version such as "altimate-code ships dbt-specific tools beyond the base
toolset." to remove the product name while preserving meaning.
There was a problem hiding this comment.
2 issues found across 4 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="research/kimi-k26-ade-bench-2026-05-10/findings.md">
<violation number="1" location="research/kimi-k26-ade-bench-2026-05-10/findings.md:111">
P3: The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.</violation>
<violation number="2" location="research/kimi-k26-ade-bench-2026-05-10/findings.md:236">
P2: The `f1011` taxonomy note inverts pass/fail status for `check_option_b` and contradicts the appendix data.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| | **Date-spine completeness** | `airbnb009` | Kimi understood the task but did not generate a date-spine join; it kept the original `GROUP BY DATE_TRUNC` which drops empty days. dbt_utils was installed; Kimi just didn't reach for it. | | ||
| | **dbt-specific features (versioned models, snapshots, materialization)** | `airbnb007` (`models_are_materialized_correctly`), `airbnb010`, `helixops_saas009`, `f1008` | Created `dim_accounts_v2.sql` instead of using dbt's `versions:` keyword. Snapshot task wrote a regular model instead of a `snapshots/` directory file. | | ||
| | **Type harmonization in `CASE` / `COALESCE`** | `analytics_engineering004` | LEFT JOIN of inventory to product details where product details are NULL for some rows; model coerced types inconsistently. | | ||
| | **Multi-part reasoning over-confidence** | `f1011` | Multiple-choice question where Kimi answered `ABDE`. Only `check_option_b` passed; Kimi rationalized E with apparent confidence, but the gold answer set differed. | |
There was a problem hiding this comment.
P2: The f1011 taxonomy note inverts pass/fail status for check_option_b and contradicts the appendix data.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At research/kimi-k26-ade-bench-2026-05-10/findings.md, line 236:
<comment>The `f1011` taxonomy note inverts pass/fail status for `check_option_b` and contradicts the appendix data.</comment>
<file context>
@@ -0,0 +1,571 @@
+| **Date-spine completeness** | `airbnb009` | Kimi understood the task but did not generate a date-spine join; it kept the original `GROUP BY DATE_TRUNC` which drops empty days. dbt_utils was installed; Kimi just didn't reach for it. |
+| **dbt-specific features (versioned models, snapshots, materialization)** | `airbnb007` (`models_are_materialized_correctly`), `airbnb010`, `helixops_saas009`, `f1008` | Created `dim_accounts_v2.sql` instead of using dbt's `versions:` keyword. Snapshot task wrote a regular model instead of a `snapshots/` directory file. |
+| **Type harmonization in `CASE` / `COALESCE`** | `analytics_engineering004` | LEFT JOIN of inventory to product details where product details are NULL for some rows; model coerced types inconsistently. |
+| **Multi-part reasoning over-confidence** | `f1011` | Multiple-choice question where Kimi answered `ABDE`. Only `check_option_b` passed; Kimi rationalized E with apparent confidence, but the gold answer set differed. |
+| **Refactor reference updates** | `asana004` | Created the new intermediate model correctly but didn't fully update all downstream `ref()` calls. `check_task_references` failed. |
+| **Trivial / setup** | `simple001`, `workday001` | `simple001` renamed a model but missed a downstream reference. `workday001`'s prompt is literally *"Do nothing"* and the agent halted in 2 seconds — possibly a bench bug. |
</file context>
| | Phase | Total time | Share of wall | | ||
| |---|---:|---:| | ||
| | Step duration (`step_start → step_finish`: model generation + tool dispatch) | 22,745 s | 66.1% | | ||
| | Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% | |
There was a problem hiding this comment.
P3: The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At research/kimi-k26-ade-bench-2026-05-10/findings.md, line 111:
<comment>The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.</comment>
<file context>
@@ -0,0 +1,571 @@
+| Phase | Total time | Share of wall |
+|---|---:|---:|
+| Step duration (`step_start → step_finish`: model generation + tool dispatch) | 22,745 s | 66.1% |
+| Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% |
+| Tool execution (sum of all individual `tool_use` durations) | 1,690 s | 4.9% |
+| Total runtime | 34,402 s | 100% |
</file context>
| | Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% | | |
| | Step-to-step gaps (`step_finish → next step_start`) | 30,672 s | 89.2% | |
Adds the source-code + scripts + 4 small patches needed to plug altimate-code into upstream ade-bench. Lets anyone reproduce the 81.3% pass rate described in research/kimi-k26-ade-bench-2026-05-10/ without trusting the pre-aggregated numbers. What's included: - benchmark/ade-bench/README.md — full reproduction guide (prereqs, Docker memory, env-var knobs, step-by-step commands, troubleshooting) - benchmark/ade-bench/altimate_code_agent/ — drop-in agent module (AltimateCodeAgent class, JSON event parser, log formatter, install script that runs inside the trial container, tarball builder) - benchmark/ade-bench/patches/ — 4 small patches against upstream dbt-labs/ade-bench (register AgentName.ALTIMATE_CODE, wire it into the AgentFactory, export from installed_agents/__init__.py, route the existing shared/config/AGENTS.md baseline file the same way Codex receives it — pure parity, no benchmark-specific content) Explicitly NOT in this folder: - Trace files / per-trial agent.log / results.json (regenerable) - The 130 MB built tarball (build-local-tarball.sh recreates it) - Seed DuckDB databases (downloaded from dbt-labs/ade-bench releases) - Per-task ground-truth seeds + test SQL (those live in upstream ade-bench and are never sent to the agent at run time) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (4)
benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py (2)
225-225: 💤 Low valueRemove unnecessary f-string prefix.
The f-prefix is not needed since there are no format placeholders in this string.
🧹 Proposed fix
- command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo" + command = "echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py` at line 225, The string assigned to variable "command" in altimate_code_agent.py is using an unnecessary f-string; replace the f-prefixed string in the assignment to command (currently: command = f"...") with a plain string literal (command = "echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo") so there are no unused format prefixes.
58-59: ⚡ Quick winConsider logging parse errors for debugging.
The bare
except: passsilently swallows all parsing errors, making it difficult to debug malformed log files during benchmark development. While silent failure is acceptable for tooling, adding a minimal error indicator would improve troubleshooting.🔍 Proposed improvement
- except Exception: - pass + except Exception as e: + # Return partial results; log parse errors are non-fatal in benchmark context + import sys + print(f"Warning: log parse error: {e}", file=sys.stderr)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py` around lines 58 - 59, The bare "except: pass" in the parsing block silently swallows errors; change it to "except Exception as e" and log a minimal error message including the exception (e.g., using logging.getLogger(__name__).warning or .exception) with context like "Failed to parse log entry" so malformed inputs are visible during debugging; ensure the module has a logger configured (import logging and getLogger) before using it.benchmark/ade-bench/README.md (1)
9-22: ⚡ Quick winAdd language identifier to the fenced code block.
The code block showing the directory structure would benefit from a language identifier for proper syntax highlighting.
📝 Proposed fix
-``` +```text benchmark/ade-bench/ ├── README.md ← you are here🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/ade-bench/README.md` around lines 9 - 22, Update the fenced code block in README.md to include a language identifier for proper highlighting: change the opening triple backticks that currently start the directory-tree block to use "text" (i.e., ```text) so the tree shown (the block containing benchmark/ade-bench/ and the listed files like altimate_code_agent/ and patches/) is rendered with correct formatting.benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh (1)
83-87: ⚡ Quick winPrefer
findoverlsfor discovering the tarball.The current approach using
lsworks but is sensitive to locale and could behave unexpectedly if multiple tarballs exist. Afind-based approach provides better control and predictability.♻️ Proposed refactor using find
-TARBALL="$(ls -1 "$STAGE"/altimate-code-*.tgz | head -1)" +TARBALL="$(find "$STAGE" -maxdepth 1 -name 'altimate-code-*.tgz' -print -quit)" if [[ -z "$TARBALL" ]]; then echo "pack failed: no tarball produced" >&2 exit 1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh` around lines 83 - 87, Replace the fragile ls-based discovery of the tarball by using find: instead of assigning TARBALL via ls on "$STAGE", run a find rooted at "$STAGE" with -maxdepth 1 -type f -name "altimate-code-*.tgz" -print -quit to reliably pick the first match, then check if TARBALL is empty and exit with the same error handling; update references to TARBALL and keep the existing error message/exit behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py`:
- Line 225: The string assigned to variable "command" in altimate_code_agent.py
is using an unnecessary f-string; replace the f-prefixed string in the
assignment to command (currently: command = f"...") with a plain string literal
(command = "echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo")
so there are no unused format prefixes.
- Around line 58-59: The bare "except: pass" in the parsing block silently
swallows errors; change it to "except Exception as e" and log a minimal error
message including the exception (e.g., using logging.getLogger(__name__).warning
or .exception) with context like "Failed to parse log entry" so malformed inputs
are visible during debugging; ensure the module has a logger configured (import
logging and getLogger) before using it.
In `@benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh`:
- Around line 83-87: Replace the fragile ls-based discovery of the tarball by
using find: instead of assigning TARBALL via ls on "$STAGE", run a find rooted
at "$STAGE" with -maxdepth 1 -type f -name "altimate-code-*.tgz" -print -quit to
reliably pick the first match, then check if TARBALL is empty and exit with the
same error handling; update references to TARBALL and keep the existing error
message/exit behavior unchanged.
In `@benchmark/ade-bench/README.md`:
- Around line 9-22: Update the fenced code block in README.md to include a
language identifier for proper highlighting: change the opening triple backticks
that currently start the directory-tree block to use "text" (i.e., ```text) so
the tree shown (the block containing benchmark/ade-bench/ and the listed files
like altimate_code_agent/ and patches/) is rendered with correct formatting.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 778af701-c01c-4a00-96d9-848f6ea6aded
📒 Files selected for processing (9)
benchmark/ade-bench/README.mdbenchmark/ade-bench/altimate_code_agent/__init__.pybenchmark/ade-bench/altimate_code_agent/altimate-code-setup.shbenchmark/ade-bench/altimate_code_agent/altimate_code_agent.pybenchmark/ade-bench/altimate_code_agent/build-local-tarball.shbenchmark/ade-bench/patches/01-agent_name.py.patchbenchmark/ade-bench/patches/02-agent_factory.py.patchbenchmark/ade-bench/patches/03-installed_agents_init.py.patchbenchmark/ade-bench/patches/04-agent_setup.py.patch
✅ Files skipped from review due to trivial changes (1)
- benchmark/ade-bench/patches/03-installed_agents_init.py.patch
There was a problem hiding this comment.
3 issues found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh">
<violation number="1" location="benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh:29">
P2: Avoid `@latest` in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.</violation>
</file>
<file name="benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh">
<violation number="1" location="benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh:11">
P1: `REPO_ROOT` is computed with too many `..` segments, so package paths resolve outside the repository and the tarball build fails.</violation>
</file>
<file name="benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py">
<violation number="1" location="benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py:228">
P1: Shell command construction does not quote `self._model_name`, which allows command injection or malformed execution when model IDs contain shell metacharacters.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| set -euo pipefail | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" | ||
| REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)" |
There was a problem hiding this comment.
P1: REPO_ROOT is computed with too many .. segments, so package paths resolve outside the repository and the tarball build fails.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh, line 11:
<comment>`REPO_ROOT` is computed with too many `..` segments, so package paths resolve outside the repository and the tarball build fails.</comment>
<file context>
@@ -0,0 +1,90 @@
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)"
+PKG_DIR="$REPO_ROOT/packages/opencode"
+DBT_TOOLS_DIR="$REPO_ROOT/packages/dbt-tools"
</file context>
| REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)" | |
| REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)" |
| command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo" | ||
|
|
||
| if self._model_name: | ||
| command += f" --model {self._model_name}" |
There was a problem hiding this comment.
P1: Shell command construction does not quote self._model_name, which allows command injection or malformed execution when model IDs contain shell metacharacters.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py, line 228:
<comment>Shell command construction does not quote `self._model_name`, which allows command injection or malformed execution when model IDs contain shell metacharacters.</comment>
<file context>
@@ -0,0 +1,264 @@
+ command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"
+
+ if self._model_name:
+ command += f" --model {self._model_name}"
+ command += f" --max-turns 80 {escaped_prompt}"
+
</file context>
| chmod 755 "$PKG_BIN_DIR/.altimate-code" "$PKG_BIN_DIR/.altimate" | ||
| else | ||
| echo "Local tarball not staged; falling back to latest published" | ||
| npm install -g --no-audit --no-fund @altimateai/altimate-code@latest |
There was a problem hiding this comment.
P2: Avoid @latest in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh, line 29:
<comment>Avoid `@latest` in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.</comment>
<file context>
@@ -0,0 +1,106 @@
+ chmod 755 "$PKG_BIN_DIR/.altimate-code" "$PKG_BIN_DIR/.altimate"
+else
+ echo "Local tarball not staged; falling back to latest published"
+ npm install -g --no-audit --no-fund @altimateai/altimate-code@latest
+fi
+
</file context>
…itfalls
Two related changes, both shipped to every altimate-code user.
(1) `feat(skill)`: add `alwaysApply: bool` and `applyPaths: string|string[]`
frontmatter to skill metadata, mirroring Cursor's "Always Apply" and
"Auto Attached" rule modes. When a skill is `alwaysApply: true` or has
`applyPaths` matching at least one file under the worktree, its body
is inlined into the system prompt at session start under an
`<auto_loaded_skill>` block — the model no longer needs to invoke the
Skill tool to access that guidance.
Motivation: benchmark traces show the agent invokes the `Skill` tool
in <1% of tool calls, even after the skill description is rewritten
to be imperative. Many failures occur on patterns the relevant skill
already documents but the agent never loads. Auto-loading puts the
body deterministically in context for projects where the skill
applies.
Files:
• packages/opencode/src/skill/skill.ts — Info schema + both load
paths (filesystem + binary-embedded) pluck the new fields
• packages/opencode/src/session/system.ts — auto-inline matched
skill bodies after the existing available_skills XML block
• .opencode/skills/dbt-develop/SKILL.md — frontmatter now declares
`applyPaths: [dbt_project.yml, **/dbt_project.yml]`, so dbt
projects auto-load this skill's body (~270 lines of dbt
best-practice patterns) at session start
The existing skill-tool-invocation path is unchanged; auto-load is
additive. Skills without `alwaysApply` / `applyPaths` continue to
require explicit invocation. Prompt caching amortizes the extra
tokens across the long agent loop.
(2) `docs(skill)`: three new generic dbt pitfall sections in
`dbt-develop/SKILL.md`, all benchmark-agnostic best practices
surfaced during failure-trace analysis:
• String concatenation with `NULL` operands — `||` / `CONCAT`
propagate `NULL`; wrap with `COALESCE` or use `CONCAT_WS`.
Catches an invisible row-dropper in surrogate-key generation and
derived columns.
• dbt model versioning (dbt 1.8+) — when introducing a v2 of an
existing model, use dbt's `versions:` block in `_models.yml` with
`defined_in:`, not a sibling `_v2.sql` file. Otherwise downstream
lineage and `{{ ref(model, v=2) }}` resolution break.
• Strengthened the existing window-rank + `LIMIT` section to call
out determinism explicitly, including the `QUALIFY ROW_NUMBER()
OVER (... ORDER BY metric, id)` form and the "if you can't think
of a tiebreaker, you don't have a unique key yet" framing.
All three patterns are documented in well-known dbt style guides
and would benefit any real altimate-code user — they are not
benchmark-targeted tweaks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/opencode/src/session/system.ts (1)
74-104:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep auto-loaded skills outside the LLM selector.
collectAutoLoadedSkills(filtered)makesalwaysApply/applyPathscontingent onselectSkillsWithLLM(...). When fingerprint selection is enabled, an omitted skill never auto-loads, which breaks the new “always apply / auto attached” contract.Suggested fix
let filtered: Skill.Info[] if (cfg.experimental?.env_fingerprint_skill_selection === true) { filtered = await selectSkillsWithLLM(list, Fingerprint.get()) } else { filtered = list } - // Sort by name for stable, deterministic output across calls. - filtered = [...filtered].sort((a, b) => a.name.localeCompare(b.name)) + const autoLoaded = await collectAutoLoadedSkills(list) + const visible = [...new Map([...filtered, ...autoLoaded].map((skill) => [skill.name, skill])).values()] + .sort((a, b) => a.name.localeCompare(b.name)) @@ - Skill.fmt(filtered, { verbose: true }), + Skill.fmt(visible, { verbose: true }), @@ - const autoLoaded = await collectAutoLoadedSkills(filtered) if (autoLoaded.length > 0) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opencode/src/session/system.ts` around lines 74 - 104, The auto-load logic is currently run against the LLM-filtered "filtered" list, which makes collectAutoLoadedSkills(filtered) miss skills excluded by selectSkillsWithLLM; change the flow so collectAutoLoadedSkills runs against the unfiltered skill list (the original "list") and use that result for the auto-loaded block, while still using selectSkillsWithLLM(list, Fingerprint.get()) -> filtered for presentation (Skill.fmt) and sorting; update references to filtered only for display and keep collectAutoLoadedSkills(list) (or a separate variable like autoLoadedFromAll) to determine alwaysApply/applyPaths behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.opencode/skills/dbt-develop/SKILL.md:
- Around line 270-272: Update the documentation guidance about CONCAT_WS: remove
the blanket claim that CONCAT_WS skips NULLs in Snowflake and BigQuery and
instead state explicit, dialect-safe advice — note that Snowflake's CONCAT_WS
propagates NULLs, BigQuery lacks CONCAT_WS (use ARRAY_TO_STRING for
NULL-omitting behavior), and recommend using COALESCE on operands or validating
the adapter-specific NULL semantics before relying on any concat function
(mention CONCAT_WS, ARRAY_TO_STRING, COALESCE by name to help locate the
reference).
In `@packages/opencode/src/session/system.ts`:
- Around line 157-168: The helper anyMatchInWorktree currently swallows
Glob.scan errors via .catch(() => []), preventing the caller's warning path from
seeing scan failures; remove that inline catch so await Glob.scan(g, { ... })
can throw (or replace it with a catch that rethrows the original error) and let
the upstream warning/logging handle it; search for the function
anyMatchInWorktree and the Glob.scan call to update the error handling
accordingly.
---
Outside diff comments:
In `@packages/opencode/src/session/system.ts`:
- Around line 74-104: The auto-load logic is currently run against the
LLM-filtered "filtered" list, which makes collectAutoLoadedSkills(filtered) miss
skills excluded by selectSkillsWithLLM; change the flow so
collectAutoLoadedSkills runs against the unfiltered skill list (the original
"list") and use that result for the auto-loaded block, while still using
selectSkillsWithLLM(list, Fingerprint.get()) -> filtered for presentation
(Skill.fmt) and sorting; update references to filtered only for display and keep
collectAutoLoadedSkills(list) (or a separate variable like autoLoadedFromAll) to
determine alwaysApply/applyPaths behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: cb5b75c6-4bf4-4c38-adb7-0222c2d920b6
📒 Files selected for processing (3)
.opencode/skills/dbt-develop/SKILL.mdpackages/opencode/src/session/system.tspackages/opencode/src/skill/skill.ts
| Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it | ||
| skips `NULL` operands instead of propagating them, which is usually safer | ||
| than a static placeholder. |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
`Verify SQL dialect docs:
- Snowflake CONCAT_WS NULL behavior
- BigQuery GoogleSQL support for CONCAT_WS vs CONCAT/ARRAY_TO_STRING`
💡 Result:
- Snowflake CONCAT_WS NULL behavior Snowflake’s CONCAT_WS does not skip NULL values. If any argument to CONCAT_WS is NULL, the function returns NULL. [1] 2) BigQuery (GoogleSQL) support: CONCAT_WS vs CONCAT/ARRAY_TO_STRING BigQuery GoogleSQL does not provide a CONCAT_WS function. [2] BigQuery’s CONCAT returns NULL if any input argument is NULL. [3] To get CONCAT_WS-like “join with delimiter while omitting NULLs”, BigQuery’s ARRAY_TO_STRING is the documented alternative: ARRAY_TO_STRING omits NULL array elements when null_text is not supplied (so it does not emit the delimiter for NULLs). [4] Example (conceptual): ARRAY_TO_STRING(['foo', NULL, 'bar'], '#') produces 'foo#bar' because NULLs are omitted and their preceding delimiter is omitted when null_text is not used. [4] So: Snowflake CONCAT_WS = NULL in any position => NULL result; BigQuery has no CONCAT_WS, and to emulate “skip NULLs with delimiter” you typically use ARRAY_TO_STRING (where NULLs are omitted unless you provide null_text). [1][3][4]
Citations:
- 1: https://docs.snowflake.com/en/sql-reference/functions/concat_ws
- 2: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators
- 3: https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions
- 4: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/array_functions
Fix inaccurate CONCAT_WS NULL handling guidance in documentation.
Lines 270–272 claim CONCAT_WS() is supported in Snowflake and BigQuery while skipping NULL operands. However:
- Snowflake
CONCAT_WSpropagatesNULL(returnsNULLif any argument isNULL) - BigQuery does not provide
CONCAT_WS; useARRAY_TO_STRINGinstead for NULL-omitting behavior
This misguidance risks silent NULL propagation bugs in generated SQL. Replace with explicit dialect-safe guidance recommending COALESCE for operands or verification of adapter-specific NULL semantics before relying on any concat function.
Suggested doc fix
-Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
-skips `NULL` operands instead of propagating them, which is usually safer
-than a static placeholder.
+Use dialect-safe null handling explicitly. In many engines, string concat
+propagates `NULL` unless you `COALESCE` each operand first.
+If you choose `CONCAT_WS`, verify your adapter's NULL semantics in docs
+before relying on it.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.opencode/skills/dbt-develop/SKILL.md around lines 270 - 272, Update the
documentation guidance about CONCAT_WS: remove the blanket claim that CONCAT_WS
skips NULLs in Snowflake and BigQuery and instead state explicit, dialect-safe
advice — note that Snowflake's CONCAT_WS propagates NULLs, BigQuery lacks
CONCAT_WS (use ARRAY_TO_STRING for NULL-omitting behavior), and recommend using
COALESCE on operands or validating the adapter-specific NULL semantics before
relying on any concat function (mention CONCAT_WS, ARRAY_TO_STRING, COALESCE by
name to help locate the reference).
| async function anyMatchInWorktree(globs: string[]): Promise<boolean> { | ||
| // Search from worktree root so a skill that wants `dbt_project.yml` | ||
| // catches the file no matter how deep the user's cwd is. | ||
| const root = Instance.worktree | ||
| for (const g of globs) { | ||
| const matches = await Glob.scan(g, { | ||
| cwd: root, | ||
| absolute: true, | ||
| include: "file", | ||
| dot: false, | ||
| symlink: false, | ||
| }).catch(() => [] as string[]) |
There was a problem hiding this comment.
Let Glob.scan failures reach the warning path.
The inline .catch(() => []) turns invalid glob / scan errors into a silent “no match”, so the warning on Lines 144-146 never fires and applyPaths failures are invisible.
Suggested fix
for (const g of globs) {
- const matches = await Glob.scan(g, {
+ const matches = await Glob.scan(g, {
cwd: root,
absolute: true,
include: "file",
dot: false,
symlink: false,
- }).catch(() => [] as string[])
+ })
if (matches.length > 0) return true
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function anyMatchInWorktree(globs: string[]): Promise<boolean> { | |
| // Search from worktree root so a skill that wants `dbt_project.yml` | |
| // catches the file no matter how deep the user's cwd is. | |
| const root = Instance.worktree | |
| for (const g of globs) { | |
| const matches = await Glob.scan(g, { | |
| cwd: root, | |
| absolute: true, | |
| include: "file", | |
| dot: false, | |
| symlink: false, | |
| }).catch(() => [] as string[]) | |
| async function anyMatchInWorktree(globs: string[]): Promise<boolean> { | |
| // Search from worktree root so a skill that wants `dbt_project.yml` | |
| // catches the file no matter how deep the user's cwd is. | |
| const root = Instance.worktree | |
| for (const g of globs) { | |
| const matches = await Glob.scan(g, { | |
| cwd: root, | |
| absolute: true, | |
| include: "file", | |
| dot: false, | |
| symlink: false, | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/opencode/src/session/system.ts` around lines 157 - 168, The helper
anyMatchInWorktree currently swallows Glob.scan errors via .catch(() => []),
preventing the caller's warning path from seeing scan failures; remove that
inline catch so await Glob.scan(g, { ... }) can throw (or replace it with a
catch that rethrows the original error) and let the upstream warning/logging
handle it; search for the function anyMatchInWorktree and the Glob.scan call to
update the error handling accordingly.
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".opencode/skills/dbt-develop/SKILL.md">
<violation number="1" location=".opencode/skills/dbt-develop/SKILL.md:270">
P2: `CONCAT_WS` support/behavior is documented incorrectly: BigQuery does not support `CONCAT_WS`, and Snowflake `CONCAT_WS` does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.</violation>
</file>
<file name="packages/opencode/src/session/system.ts">
<violation number="1" location="packages/opencode/src/session/system.ts:168">
P2: `Glob.scan` errors are swallowed, so `applyPaths` scan failures are silently ignored instead of being logged by the caller.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it | ||
| skips `NULL` operands instead of propagating them, which is usually safer |
There was a problem hiding this comment.
P2: CONCAT_WS support/behavior is documented incorrectly: BigQuery does not support CONCAT_WS, and Snowflake CONCAT_WS does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .opencode/skills/dbt-develop/SKILL.md, line 270:
<comment>`CONCAT_WS` support/behavior is documented incorrectly: BigQuery does not support `CONCAT_WS`, and Snowflake `CONCAT_WS` does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.</comment>
<file context>
@@ -252,6 +255,44 @@ CASE WHEN cond THEN CAST('0' AS NUMERIC) ELSE CAST(0 AS NUMERIC) END
+-- Right: explicit placeholder
+COALESCE(region, 'UNKNOWN') || '-' || COALESCE(segment, 'UNKNOWN') AS geo_segment
+```
+Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
+skips `NULL` operands instead of propagating them, which is usually safer
+than a static placeholder.
</file context>
| Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it | |
| skips `NULL` operands instead of propagating them, which is usually safer | |
| Use dialect-specific NULL-safe concatenation patterns. In BigQuery, use `ARRAY_TO_STRING([...], '-')` to skip `NULL`s; in Snowflake, `CONCAT_WS` still returns `NULL` when any argument is `NULL`, so wrap operands with `COALESCE(...)`. |
Tip: Review your code locally with the cubic CLI to iterate faster.
| include: "file", | ||
| dot: false, | ||
| symlink: false, | ||
| }).catch(() => [] as string[]) |
There was a problem hiding this comment.
P2: Glob.scan errors are swallowed, so applyPaths scan failures are silently ignored instead of being logged by the caller.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/opencode/src/session/system.ts, line 168:
<comment>`Glob.scan` errors are swallowed, so `applyPaths` scan failures are silently ignored instead of being logged by the caller.</comment>
<file context>
@@ -78,14 +82,93 @@ export namespace SystemPrompt {
+ include: "file",
+ dot: false,
+ symlink: false,
+ }).catch(() => [] as string[])
+ if (matches.length > 0) return true
+ }
</file context>
| }).catch(() => [] as string[]) | |
| }) |
Adds reference for the new auto-load mechanism to docs/docs/configure/skills.md: - Lists the two new frontmatter fields in the Frontmatter Fields table - New "Auto-loading skills" section explaining the lazy-load default, how `alwaysApply` and `applyPaths` change it, a worked example, a "when to use" table, and an honest section on context-size implications + prompt-cache amortization Pure documentation update — no code change in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/docs/configure/skills.md`:
- Around line 68-72: The fenced code block showing the <auto_loaded_skill>
element lacks a language identifier causing lint warnings; update the markdown
code fence to include a language tag (use "xml") so the block becomes ```xml ...
``` around the <auto_loaded_skill name="<skill-name>"> ... </auto_loaded_skill>
snippet to enable proper syntax highlighting and satisfy the linter.
- Around line 48-50: The example in applyPaths lists both "dbt_project.yml" and
"**/dbt_project.yml", which are redundant because a bare filename already
matches at any depth; update the docs by removing the "**/dbt_project.yml" entry
or add a short clarifying sentence explaining why both are shown (e.g., that
both patterns are equivalent and the second is optional/for explicitness).
Ensure the change references the applyPaths example and the "dbt_project.yml"
filename so readers understand the intended behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 85dd32e2-0eb5-436e-a2f1-b942c9209597
📒 Files selected for processing (1)
docs/docs/configure/skills.md
| applyPaths: | ||
| - "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree | ||
| - "**/dbt_project.yml" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Find the glob-matching logic for applyPaths to verify recursive behavior
# Search for applyPaths glob matching implementation
rg -n -C5 'applyPaths.*glob|glob.*applyPaths' --type=ts
# Look for minimatch or glob library usage in session/system context
rg -n -C3 'minimatch|micromatch|glob.*match' packages/opencode/src/session/system.ts
# Find where applyPaths is processed
ast-grep --pattern 'applyPaths'Repository: AltimateAI/altimate-code
Length of output: 3091
🏁 Script executed:
rg -n "normalizeApplyPaths|anyMatchInWorktree" --type=ts -A 10Repository: AltimateAI/altimate-code
Length of output: 2534
🏁 Script executed:
rg -n "import.*Glob|from.*Glob" packages/opencode/src/session/system.ts --type=tsRepository: AltimateAI/altimate-code
Length of output: 106
🏁 Script executed:
cat -n packages/opencode/src/util/glob.tsRepository: AltimateAI/altimate-code
Length of output: 1257
Clarify or remove the redundant glob pattern.
The example shows both "dbt_project.yml" and "**/dbt_project.yml". A bare filename already matches files at any depth in the worktree (as stated in the codebase comment: "a skill that wants dbt_project.yml catches the file no matter how deep the user's cwd is"). The second pattern is functionally identical and may confuse users. Either explain why both are shown or remove the redundant one.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/docs/configure/skills.md` around lines 48 - 50, The example in
applyPaths lists both "dbt_project.yml" and "**/dbt_project.yml", which are
redundant because a bare filename already matches at any depth; update the docs
by removing the "**/dbt_project.yml" entry or add a short clarifying sentence
explaining why both are shown (e.g., that both patterns are equivalent and the
second is optional/for explicitness). Ensure the change references the
applyPaths example and the "dbt_project.yml" filename so readers understand the
intended behavior.
| ``` | ||
| <auto_loaded_skill name="<skill-name>"> | ||
| ... full skill body ... | ||
| </auto_loaded_skill> | ||
| ``` |
There was a problem hiding this comment.
Add language identifier to code block.
The code block is missing a language specifier, which prevents proper syntax highlighting and triggers linting warnings.
📝 Proposed fix
-```
+```xml
<auto_loaded_skill name="<skill-name>">
... full skill body ...
</auto_loaded_skill>
</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 68-68: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/docs/configure/skills.md` around lines 68 - 72, The fenced code block
showing the <auto_loaded_skill> element lacks a language identifier causing lint
warnings; update the markdown code fence to include a language tag (use "xml")
so the block becomes ```xml ... ``` around the <auto_loaded_skill
name="<skill-name>"> ... </auto_loaded_skill> snippet to enable proper syntax
highlighting and satisfy the linter.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="docs/docs/configure/skills.md">
<violation number="1" location="docs/docs/configure/skills.md:49">
P3: The `applyPaths` example comment is inaccurate: `"dbt_project.yml"` does not match anywhere in the worktree, only at the root.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
| - "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree | ||
| - "**/dbt_project.yml" |
There was a problem hiding this comment.
P3: The applyPaths example comment is inaccurate: "dbt_project.yml" does not match anywhere in the worktree, only at the root.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/docs/configure/skills.md, line 49:
<comment>The `applyPaths` example comment is inaccurate: `"dbt_project.yml"` does not match anywhere in the worktree, only at the root.</comment>
<file context>
@@ -28,7 +28,75 @@ Focus on the query: $ARGUMENTS
+---
+name: dbt-develop
+applyPaths:
+ - "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree
+ - "**/dbt_project.yml"
+description: ...
</file context>
| - "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree | |
| - "**/dbt_project.yml" | |
| - "dbt_project.yml" # matches only at the worktree root | |
| - "**/dbt_project.yml" # matches anywhere under the worktree |
Tip: Review your code locally with the cubic CLI to iterate faster.
Two changes informed by trace analysis of the benchmark run with the
initial auto-load mechanism. With the auto-loaded body present in the
system prompt, 6 of 8 sampled failing trials never referenced any of
its guidance keywords (date spine, tiebreaker, deliverable, etc.) —
the model was treating the auto-loaded section as background reference
rather than binding directive. These two changes address the framing.
(1) `feat(system-prompt)`: move auto-loaded skill bodies BEFORE the
lazy-loaded `<available_skills>` XML block in the skills section.
Previously the order was:
1. "Use the skill tool to load a skill..." preamble
2. <available_skills> XML (long, descriptions only)
3. <auto_loaded_skill> body (binding guidance)
Now:
1. <auto_loaded_skill> body (binding guidance — read FIRST)
2. "Skills provide specialized instructions..." preamble
3. <available_skills> XML (lazy-loaded skills the agent can opt into)
Framing the auto-loaded body as "rules of the road" at the start
rather than supplementary documentation at the end. Pure ordering
change in `SystemPrompt.skills()` parts array — no schema or API
change. Applies to any skill using `applyPaths` or `alwaysApply`.
File: packages/opencode/src/session/system.ts
(2) `docs(skill)`: add a "Pre-completion checklist" section (§5) to
dbt-develop that the agent is told to emit with `[x]/[ ]` marks
before declaring the task done.
Each item is a yes/no question against patterns the skill already
documents (LEFT JOIN cardinality, date-spine completeness,
window-rank tiebreaker, type harmonization in COALESCE/CASE/UNION,
string-concat NULL handling, uniqueness enforcement, incremental
high-water mark, snapshot strategy, dbt model versioning v2,
unit-test verification).
The forcing function: the agent must produce the checklist text in
its final message. Unchecked items without a stated "n/a" reason
mean the task is not done. Forces the model to slow down at the
end and verify the patterns against the SQL it just wrote, rather
than silently skip the verification phase.
All items are generic dbt patterns applicable to any project — no
benchmark-specific test names, no solution-seed values, no
grading-rubric hints.
File: .opencode/skills/dbt-develop/SKILL.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
1 similar comment
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.opencode/skills/dbt-develop/SKILL.md:
- Around line 165-167: Replace the non-recursive "ls models/" check with a
recursive file-discovery command so nested model files aren't missed; update the
SKILL.md checklist to use a recursive listing (e.g., recursive ls or find)
targeting model files (reference the current "ls models/" line) and ensure the
new command filters for model file types so deliverable verification includes
files in subdirectories.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c8271a3c-2e25-4585-8a12-9658d74c1430
📒 Files selected for processing (2)
.opencode/skills/dbt-develop/SKILL.mdpackages/opencode/src/session/system.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/opencode/src/session/system.ts
| ls models/ # confirm every requested file exists | ||
| altimate-dbt info # confirm every requested model is in the project | ||
| ``` |
There was a problem hiding this comment.
Use recursive file discovery instead of ls models/ for deliverable verification.
ls models/ only shows top-level entries, so nested model files can be missed during checklist validation. Prefer a recursive check command.
Suggested doc tweak
-ls models/ # confirm every requested file exists
+find models -type f -name "*.sql" # confirm every requested model file exists (including nested dirs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.opencode/skills/dbt-develop/SKILL.md around lines 165 - 167, Replace the
non-recursive "ls models/" check with a recursive file-discovery command so
nested model files aren't missed; update the SKILL.md checklist to use a
recursive listing (e.g., recursive ls or find) targeting model files (reference
the current "ls models/" line) and ensure the new command filters for model file
types so deliverable verification includes files in subdirectories.
…result
The "emit a [x]/[ ] checklist before declaring done" addition to
dbt-develop (§5, shipped two commits ago) was measured negative on
the post-A+B benchmark re-run:
- Checklist appeared in 6 of 14 still-failing trial outputs.
- Zero of those 6 flipped to PASS.
- In multiple traces, the agent self-marked `[x] LEFT JOIN
cardinality correct` while the underlying SQL still had the
exact phantom-row bug the item warned against.
The framing trained the model to perform verification theater
rather than actually re-read its SQL. The two flips attributed
earlier to "A+B" (helixops_saas007, helixops_saas009) trace back
to the placement reorder (A) — the checklist (B) contributed
nothing measurable, and adds 50+ lines of system-prompt content
for no benefit.
This commit:
(1) Removes §5 from `.opencode/skills/dbt-develop/SKILL.md`.
The other sections (Plan → Discover → Write → Validate,
Common Pitfalls in Transformation Logic, Iron Rules) stay
intact. The placement reorder in `system.ts` and the
`applyPaths`/`alwaysApply` frontmatter mechanism stay.
(2) Adds a "What we tried that didn't work" section to
research/kimi-k26-ade-bench-2026-05-10/findings.md so the
negative result is preserved as institutional knowledge.
The broader principle — "soft self-verification (model
promises it checked X) is unreliable on this model class;
hard verification (compile/test failures) still works" — is
worth keeping around.
(3) Updates the findings TL;DR with both the original 81.3%
headline and the post-second-wave 85.3% best-of-runs number,
with the caveat that the body of the post analyzes the
first-wave traces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 This PR was automatically closed by our quality checks. Common reasons:
If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you. |
❌ Tests — Failures DetectedTypeScript — 15 failure(s)
Next StepPlease address the failing cases above and re-run verification. |
PINEAPPLE
Summary
A multi-part PR from a benchmarking session evaluating Moonshot Kimi-K2.6 (via OpenRouter) on ADE-Bench through altimate-code's agent loop. Headline: 61 / 75 = 81.3% pass rate, $14.91 total, ~9.6 hours wall.
The PR splits into four logical groups, each shipping standalone value:
1. Research / blog-ready writeup
research/kimi-k26-ade-bench-2026-05-10/findings.md(~570 lines) — behavioral profile of Kimi-K2.6 as a coding agent. Wall-clock anatomy (~89% model generation, ~5% tools), prompt-cache amplification (85.8% cache hit, 6.86× median ratio), per-failure-class taxonomy, tool-correlation analysis, honest comparison context.research/kimi-k26-ade-bench-2026-05-10/README.md— folder index.2. Reproduction scaffolding (
benchmark/ade-bench/)Everything needed to plug altimate-code into upstream
dbt-labs/ade-benchand reproduce the 81.3% number. Deliberately excludes traces / built tarball / seed data — those regenerate. Includes:altimate_code_agent/— drop-in module (agent class, JSON parser, in-container install script, linux/x64+arm64 tarball builder)patches/— 4 small patches against upstream ade-bench (registersAgentName.ALTIMATE_CODE, wires factory + imports, routesshared/config/AGENTS.mdto altimate the same way Codex receives it)README.md— full prereqs, step-by-step setup, env-var knob reference, troubleshooting3. Shipped skill improvements
Additive, generic dbt patterns surfaced during failure-trace analysis. All applicable to any real dbt project — no benchmark-specific content.
.opencode/skills/dbt-develop/SKILL.md:>=, snapshot strategy selection,LEFT JOIN + COUNT(*)phantom rows, type harmonization inCOALESCE/CASE/UNION, date-spine completeness, off-by-one window boundaries, uniqueness enforcement, window-rank+LIMITdeterminismNULLoperands —||/CONCATpropagate NULL; wrap withCOALESCEor useCONCAT_WSversions:block withdefined_in:, not sibling_v2.sqlfiles.opencode/skills/dbt-unit-tests/SKILL.md:LEFT JOINunmatched parents,NULLIFzero,CASEbranches,COALESCEall-null, window boundaries, date spines, etc.)4. Auto-load skill mechanism (
alwaysApply/applyPaths) — new featureBenchmark trace analysis showed the agent invokes the
Skilltool in <1% of all tool calls, so skill content the agent already has access to often never reaches its context. This adds Cursor-/Claude-Code-style auto-attachment to altimate-code's skill system.API: two optional skill-frontmatter fields:
Wire-up: at session start, after the existing
<available_skills>block,SystemPrompt.skills()runs each skill'sapplyPathsglob viaGlob.scan({ cwd: Instance.worktree }). Matched skills are appended to the system prompt under:Backwards compatible: skills without either field are unaffected (description-only in
<available_skills>, lazy-loaded via theSkilltool exactly as before).Files:
packages/opencode/src/skill/skill.ts— schema extension + parse plumbing (filesystem + binary-embedded paths)packages/opencode/src/session/system.ts— auto-inline logic with helper functions.opencode/skills/dbt-develop/SKILL.md— frontmatter now declaresapplyPaths: ["dbt_project.yml", "**/dbt_project.yml"]docs/docs/configure/skills.md— documents the new fields, includes a "when to use" table and an honest section on context-size implicationsContext-size impact (verified via trace inspection of running benchmark trials):
/root/.local/share/altimate-code/traces/*.jsonconfirm the<auto_loaded_skill>block ships in the system-prompt spanVerification: trace inspection on actual benchmark containers confirms the body lands in the system prompt only when
dbt_project.ymlexists in the worktree.Test Plan
bun run typecheckclean on the auto-load implementationbun run script/build.ts --targets=linuxrecompiles linux/x64 + linux/arm64 binaries;grep -ac auto_loaded_skill <binary>returns 4 on both arches/root/.local/share/altimate-code/traces/*.json— confirmed<auto_loaded_skill name="dbt-develop">is present in the system-prompt span whendbt_project.ymlexistsChecklist
docs/docs/configure/skills.mdcovers the new frontmatter fields and the auto-loading section🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation