Skip to content

docs: Kimi-K2.6 ADE-Bench behavioral analysis + dbt skill improvements#807

Open
anandgupta42 wants to merge 6 commits into
mainfrom
research/kimi-k26-ade-bench
Open

docs: Kimi-K2.6 ADE-Bench behavioral analysis + dbt skill improvements#807
anandgupta42 wants to merge 6 commits into
mainfrom
research/kimi-k26-ade-bench

Conversation

@anandgupta42
Copy link
Copy Markdown
Contributor

@anandgupta42 anandgupta42 commented May 11, 2026

PINEAPPLE

Summary

A multi-part PR from a benchmarking session evaluating Moonshot Kimi-K2.6 (via OpenRouter) on ADE-Bench through altimate-code's agent loop. Headline: 61 / 75 = 81.3% pass rate, $14.91 total, ~9.6 hours wall.

The PR splits into four logical groups, each shipping standalone value:

1. Research / blog-ready writeup

  • research/kimi-k26-ade-bench-2026-05-10/findings.md (~570 lines) — behavioral profile of Kimi-K2.6 as a coding agent. Wall-clock anatomy (~89% model generation, ~5% tools), prompt-cache amplification (85.8% cache hit, 6.86× median ratio), per-failure-class taxonomy, tool-correlation analysis, honest comparison context.
  • Full appendices: per-trial manifest, pass-rate by family, every skill invocation, cost/runtime distribution, reproducibility command line, glossary, open questions, file index for blog illustration.
  • research/kimi-k26-ade-bench-2026-05-10/README.md — folder index.

2. Reproduction scaffolding (benchmark/ade-bench/)

Everything needed to plug altimate-code into upstream dbt-labs/ade-bench and reproduce the 81.3% number. Deliberately excludes traces / built tarball / seed data — those regenerate. Includes:

  • altimate_code_agent/ — drop-in module (agent class, JSON parser, in-container install script, linux/x64+arm64 tarball builder)
  • patches/ — 4 small patches against upstream ade-bench (registers AgentName.ALTIMATE_CODE, wires factory + imports, routes shared/config/AGENTS.md to altimate the same way Codex receives it)
  • README.md — full prereqs, step-by-step setup, env-var knob reference, troubleshooting

3. Shipped skill improvements

Additive, generic dbt patterns surfaced during failure-trace analysis. All applicable to any real dbt project — no benchmark-specific content.

  • .opencode/skills/dbt-develop/SKILL.md:
    • Imperative description with explicit invocation triggers
    • "Common Pitfalls in Transformation Logic" section: incremental high-water mark >=, snapshot strategy selection, LEFT JOIN + COUNT(*) phantom rows, type harmonization in COALESCE/CASE/UNION, date-spine completeness, off-by-one window boundaries, uniqueness enforcement, window-rank+LIMIT determinism
    • String concatenation with NULL operands||/CONCAT propagate NULL; wrap with COALESCE or use CONCAT_WS
    • dbt model versioning (1.8+) — use versions: block with defined_in:, not sibling _v2.sql files
    • Deliverable-enumeration step + iron rule
    • Unit-test verification step + iron rule
  • .opencode/skills/dbt-unit-tests/SKILL.md:
    • New iron rule requiring mock data to exercise every SQL construct's failure mode (LEFT JOIN unmatched parents, NULLIF zero, CASE branches, COALESCE all-null, window boundaries, date spines, etc.)

4. Auto-load skill mechanism (alwaysApply / applyPaths) — new feature

Benchmark trace analysis showed the agent invokes the Skill tool in <1% of all tool calls, so skill content the agent already has access to often never reaches its context. This adds Cursor-/Claude-Code-style auto-attachment to altimate-code's skill system.

API: two optional skill-frontmatter fields:

applyPaths: "dbt_project.yml"      # or array; auto-load when match exists in worktree
# or
alwaysApply: true                  # unconditional auto-load

Wire-up: at session start, after the existing <available_skills> block, SystemPrompt.skills() runs each skill's applyPaths glob via Glob.scan({ cwd: Instance.worktree }). Matched skills are appended to the system prompt under:

<auto_loaded_skill name="...">
... full body ...
</auto_loaded_skill>

Backwards compatible: skills without either field are unaffected (description-only in <available_skills>, lazy-loaded via the Skill tool exactly as before).

Files:

  • packages/opencode/src/skill/skill.ts — schema extension + parse plumbing (filesystem + binary-embedded paths)
  • packages/opencode/src/session/system.ts — auto-inline logic with helper functions
  • .opencode/skills/dbt-develop/SKILL.md — frontmatter now declares applyPaths: ["dbt_project.yml", "**/dbt_project.yml"]
  • docs/docs/configure/skills.md — documents the new fields, includes a "when to use" table and an honest section on context-size implications

Context-size impact (verified via trace inspection of running benchmark trials):

  • Non-dbt sessions: 0 tokens added (glob doesn't match, no auto-load)
  • dbt sessions: ~5K tokens added to system prompt (the dbt-develop body)
  • Real cost amortizes to ~$0.02 per session thanks to 85.8% prompt-cache hit rate
  • Trace files at /root/.local/share/altimate-code/traces/*.json confirm the <auto_loaded_skill> block ships in the system-prompt span

Verification: trace inspection on actual benchmark containers confirms the body lands in the system prompt only when dbt_project.yml exists in the worktree.

Test Plan

  • Full ADE-Bench sweep (75 trials) with these changes → 61 / 75 = 81.3% pass rate
  • bun run typecheck clean on the auto-load implementation
  • bun run script/build.ts --targets=linux recompiles linux/x64 + linux/arm64 binaries; grep -ac auto_loaded_skill <binary> returns 4 on both arches
  • In-container verification: ran a smoke-test session in a benchmark trial container, inspected /root/.local/share/altimate-code/traces/*.json — confirmed <auto_loaded_skill name="dbt-develop"> is present in the system-prompt span when dbt_project.yml exists
  • Re-audited all skill changes for benchmark-leaking phrasing (one slip caught & fixed: "leading cause of equality-test failures" → "leading cause of silent-correctness bugs"). No test names, no solution seeds, no grading-rubric hints.
  • Trace-level audit by 5 parallel sub-agents confirmed the failure patterns these changes address are recurring real-project issues, not benchmark-specific.
  • Reproduction guide tested end-to-end: clone ade-bench → drop in agent module → apply patches → build tarball → run.

Checklist

  • Tests added/updated — N/A (no executable code in skills; the new auto-load logic is reachable via the existing skills loading + system-prompt construction paths and exercised by the production agent loop)
  • Documentation updated — docs/docs/configure/skills.md covers the new frontmatter fields and the auto-loading section
  • CHANGELOG updated — N/A (additive product improvement; release notes will pull from commit messages)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Altimate Code agent added to ADE‑Bench with local build/install tooling and top-level agent availability.
    • Session prompts can auto-inline applicable skills using new alwaysApply/applyPaths metadata.
  • Documentation

    • dbt skill guidance revamped: mandatory-first-step note, expanded failure-mode guidance, explicit plan/validate/pre-completion checklists, and required unit-test verification.
    • Added ADE‑Bench README and a detailed benchmark findings report.

Review Change Stack

…ments

Adds research/kimi-k26-ade-bench-2026-05-10/ with a blog-ready writeup of how
the Moonshot Kimi-K2.6 model behaves as a coding agent inside altimate-code's
agent loop, derived from 78 trial traces against ADE-Bench. Findings cover
tool-usage distribution, wall-clock anatomy (~89% model generation, ~5%
tools), prompt-cache amplification (85.8%), per-failure-class taxonomy, and
extended appendices (per-trial manifest, pass-rate by family, skill
invocation log, cost/runtime distribution, reproducibility command, glossary,
open questions).

Also extends two shipped skills with generic dbt-best-practice patterns
surfaced during the analysis (all benchmark-agnostic, applicable to any dbt
project):

- dbt-develop/SKILL.md
  * stronger description with explicit invocation triggers
  * new section on transformation-logic pitfalls: incremental high-water
    marks (>= vs >), snapshot strategy selection, LEFT JOIN + COUNT(*)
    phantom rows, type harmonization in COALESCE/CASE/UNION, date-spine
    completeness, off-by-one window boundaries, uniqueness enforcement,
    window-LIMIT tiebreakers
  * deliverable-enumeration step in Validate phase + iron rule
  * unit-test verification step + iron rule
- dbt-unit-tests/SKILL.md
  * new iron rule requiring mock data to exercise every SQL construct's
    failure mode (LEFT JOIN unmatched parents, NULLIF zero, CASE branches,
    COALESCE all-null, window boundaries, date spines, etc.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR strengthens dbt skill docs with explicit correctness preconditions and unit-test requirements, adds a Kimi‑K2.6 ADE‑Bench benchmark (README + findings), integrates an Altimate Code ADE‑Bench agent with packaging/install scripts and ADE‑Bench patches, and enables session auto-loading of skills via frontmatter.

Changes

dbt Skills Documentation Enhancement

Layer / File(s) Summary
Skill Description and Preconditions
.opencode/skills/dbt-develop/SKILL.md
Expanded skill description with "invoke first" precondition and failure-mode checklist covering incremental marks, snapshots, joins, counts, type harmonization, date spines, window boundaries, and deterministic ranking.
Plan Checklist and Enumeration
.opencode/skills/dbt-develop/SKILL.md
Plan step now requires enumerating every requested deliverable (models, columns, tests, config) as a checklist for later validation.
Validate Step with Unit Test Requirement
.opencode/skills/dbt-develop/SKILL.md
Validate step mandates using dbt-unit-tests for non-trivial transformations and walking the plan checklist to verify SQL file presence, manifest entries, expected columns, and materialization/config.
Pre-completion Checklist & Iron Rules
.opencode/skills/dbt-develop/SKILL.md
Pre-completion checklist added; Iron Rules extended to require unit-test verification and explicit deliverable check-off.
Common Pitfalls Expanded
.opencode/skills/dbt-develop/SKILL.md
Common Pitfalls expanded for incremental/snapshot boundaries, date arithmetic and spine completeness, type harmonization, NULL-sensitive concatenation, model versioning, uniqueness, deterministic top‑N, and COUNT(*)/LEFT JOIN warning.
Unit Tests Mock Data Coverage
.opencode/skills/dbt-unit-tests/SKILL.md
Iron Rules now require mock data that triggers failure modes for every SQL construct with a checklist of universal edge cases (joins, NULL semantics, CASE logic, division, windows, date spines, aggregations, incremental merges).

Kimi‑K2.6 ADE‑Bench Evaluation Report & Agent

Layer / File(s) Summary
Benchmark Summary
research/kimi-k26-ade-bench-2026-05-10/README.md
New README summarizing Kimi-K2.6 ADE‑Bench run (pass rates, cost, wall-clock) with pointers to findings and trace locations.
Findings Overview and Methodology
research/kimi-k26-ade-bench-2026-05-10/findings.md
Detailed findings: run identity, headline metrics, methodology, behavioral profile, failure taxonomy, reasoning/token accounting, and appendices with reproduction steps and trace indices.
Behavioral Profile & Failure Analysis
research/kimi-k26-ade-bench-2026-05-10/findings.md
Behavioral analysis covering tool-call distribution, step/turn stats, wall-clock breakdown, cost distribution, iteration patterns after dbt failures, and semantic failure taxonomy.
ADE‑Bench Repro README
benchmark/ade-bench/README.md
Reproduction README with folder structure, prerequisites, end-to-end commands, knobs, troubleshooting, and pointers to findings.
Agent Package Export
benchmark/ade-bench/altimate_code_agent/__init__.py
Re-exports AltimateCodeAgent as the package top-level symbol.
altimate-code Install/Setup Script
benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh
Installs altimate-code (prefers local tarball), selects arch-specific binaries, prints version, and conditionally writes provider config for Azure/OpenRouter.
AltimateCodeAgent Implementation
benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py
Adds AltimateCodeLogFormatter, AltimateCodeParser, and AltimateCodeAgent to run altimate-code CLI in JSON mode, parse event streams for metrics, format logs, and extract non-core tools used.
Local Tarball Builder
benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh
Script to stage and pack a minimal local altimate-code-local.tgz tarball for reproduction runs; validates native binaries and dbt-tools artifacts.
ADE‑Bench Patches
benchmark/ade-bench/patches/*
Patches to add AgentName.ALTIMATE_CODE, register AltimateCodeAgent in the factory, export it in installed_agents.__init__, and configure AGENTS.md for ALTIMATE_CODE in setup_agent_config.

Session Prompt Auto-load via Skill Frontmatter

Layer / File(s) Summary
Skill.Info schema & parsing
packages/opencode/src/skill/skill.ts
Adds optional frontmatter fields alwaysApply and applyPaths to Skill.Info and propagates them for filesystem and builtin skills.
SystemPrompt.skills auto-load
packages/opencode/src/session/system.ts, docs/docs/configure/skills.md
SystemPrompt.skills now can auto-inline matched skills' full content wrapped in <auto_loaded_skill ...> blocks by scanning the worktree with glob patterns and honoring alwaysApply; docs updated to document alwaysApply/applyPaths behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Harness
  participant AltimateCodeAgent
  participant "altimate-code CLI"
  participant LogFile
  participant Parser as AltimateCodeParser
  Harness->>AltimateCodeAgent: perform_task(task_prompt, env)
  AltimateCodeAgent->>"altimate-code CLI": run --format json --yolo [--model] (copy local tarball if present)
  "altimate-code CLI"->>LogFile: emit JSON event stream
  AltimateCodeAgent->>LogFile: read log file
  Parser->>LogFile: parse events (step_finish/tool_start/tool_end)
  Parser->>AltimateCodeAgent: metrics (runtime_ms, tokens, cost, success)
  AltimateCodeAgent->>Harness: return AgentResult (formatted log, metrics, tools_used)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

contributor

Poem

🐰 Hops through docs with careful care,
Checklists guard against a silent snare,
Tests that mock each edge and pair,
Repro scripts pack the agent to share,
Kimi's findings told with research flair.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: documenting Kimi-K2.6 benchmark results and improving dbt skills with new auto-load mechanisms.
Description check ✅ Passed The description is comprehensive, well-structured with clear sections (Summary, Test Plan, Checklist), includes the required PINEAPPLE marker, and documents all major changes and their rationale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch research/kimi-k26-ade-bench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
research/kimi-k26-ade-bench-2026-05-10/findings.md (2)

209-219: 💤 Low value

Minor: Add language identifier to code block.

Static analysis (markdownlint) flags this fenced code block as missing a language specifier.

Suggested fix
-```
+```text
 [completed] Explore project structure and source models
 [completed] Query sample data to understand part_types and author_types
 [in_progress] Create intercom__conversation_metrics.sql model
 [pending] Validate SQL syntax and analyze for anti-patterns
 [pending] Build the model and verify output
 [pending] Run full project build to ensure no regressions

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @research/kimi-k26-ade-bench-2026-05-10/findings.md around lines 209 - 219,
The fenced checklist in findings.md is missing a language identifier, which
triggers markdownlint; update the triple-backtick fence surrounding the
checklist (the block that lists the six steps including "Create
intercom__conversation_metrics.sql model" and the status lines) to include a
language tag such as text (e.g., change totext) so the code block is
properly annotated for markdownlint and renderers.


</details>

---

`87-93`: _💤 Low value_

**Minor: Add language identifier to code block.**

Static analysis (markdownlint) flags this fenced code block as missing a language specifier. Adding `text` or an appropriate identifier improves rendering consistency.




<details>
<summary>Suggested fix</summary>

```diff
-  ```
+  ```text
   [pending] Add position_descriptions to f1_dataset.yml sources
   [pending] Create src_<model>.sql views in models/src/ pointing to source tables
   [pending] Update staging models to reference src_ models instead of raw tables
   [pending] Run dbt build to verify everything compiles and builds successfully
   ```
```

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @research/kimi-k26-ade-bench-2026-05-10/findings.md around lines 87 - 93, The
fenced code block in findings.md is missing a language identifier which triggers
markdownlint; update the opening triple-backtick for the block that contains the
four "[pending] ..." lines to include a language specifier (e.g., change totext or another appropriate identifier) so the block reads ```text and
notifies renderers/linting tools of the content type; ensure you only modify the
opening fence and keep the block contents unchanged.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @research/kimi-k26-ade-bench-2026-05-10/findings.md:

  • Line 276: The line contains a branding leak: replace the phrase "beyond
    OpenCode's base set" in the findings text with a neutral, non-branded
    alternative (e.g., "beyond the base toolset" or "beyond the project's base
    toolset"); update the sentence "altimate-code ships dbt-specific tools beyond
    OpenCode's base set." to a reworded version such as "altimate-code ships
    dbt-specific tools beyond the base toolset." to remove the product name while
    preserving meaning.
  • Line 5: The line containing "Harness: altimate-code (a fork of OpenCode
    wrapping the model in a coding-agent loop...)" leaks the OpenCode product name;
    remove or reword that parenthetical. Replace "a fork of OpenCode" with a neutral
    phrase such as "an internal fork of a coding-agent framework" or simply "a
    forked coding-agent wrapper" and keep the rest of the Harness description intact
    (refer to the Harness: altimate-code and model id
    openrouter/moonshotai/kimi-k2.6-20260420 to locate the exact sentence to
    edit).
  • Line 30: The phrase "standard OpenCode toolset" leaks branding; update the
    text in the findings entry that mentions OpenCode (the sentence listing tools:
    bash, read, write, edit, glob, grep, todowrite) to remove the
    product name and use a neutral term such as "standard code toolset" or "standard
    toolset" (or similar wording), ensuring the rest of the tool list and
    altimate-specific tools (project_scan, sql_analyze, sql_execute, etc.)
    remain unchanged.

Nitpick comments:
In @research/kimi-k26-ade-bench-2026-05-10/findings.md:

  • Around line 209-219: The fenced checklist in findings.md is missing a language
    identifier, which triggers markdownlint; update the triple-backtick fence
    surrounding the checklist (the block that lists the six steps including "Create
    intercom__conversation_metrics.sql model" and the status lines) to include a
    language tag such as text (e.g., change totext) so the code block is
    properly annotated for markdownlint and renderers.
  • Around line 87-93: The fenced code block in findings.md is missing a language
    identifier which triggers markdownlint; update the opening triple-backtick for
    the block that contains the four "[pending] ..." lines to include a language
    specifier (e.g., change totext or another appropriate identifier) so the
    block reads ```text and notifies renderers/linting tools of the content type;
    ensure you only modify the opening fence and keep the block contents unchanged.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Repository UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `5425a1b0-ef0d-4535-b5f1-7894fc31c513`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between c859b57ec46925a7a3c1bcd735c5afa1f365c029 and e7e1d9227ee9409bed1d05da21980a815f5e77f9.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `.opencode/skills/dbt-develop/SKILL.md`
* `.opencode/skills/dbt-unit-tests/SKILL.md`
* `research/kimi-k26-ade-bench-2026-05-10/README.md`
* `research/kimi-k26-ade-bench-2026-05-10/findings.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->


*Notes from running the Moonshot Kimi-K2.6 model (via OpenRouter) inside altimate-code's dbt-aware agent loop on the ADE-Bench analytics/data-engineering benchmark.*

Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Branding leak detected.

Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "a fork of OpenCode" must be removed or reworded to comply with branding guidelines.

Suggested fix
-Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
+Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (a fork of OpenCode wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
Date: 2026-05-10. Model id: `openrouter/moonshotai/kimi-k2.6-20260420`. Harness: altimate-code (wrapping the model in a coding-agent loop with extra dbt/SQL/warehouse tools).
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt

[error] 5-5: Branding audit found leak (OpenCode (product name)). Line 5: "OpenCode (product name)" with model id openrouter/moonshotai/kimi-k2.6-...

🪛 GitHub Actions: CI / Marker Guard

[error] 5-5: Branding audit leak found: "OpenCode (product name)". Context: "Date: 2026-05-10. Model id: openrouter/moonshotai/kimi-k2.6-20260420. Harne..."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 5, The line
containing "Harness: altimate-code (a fork of OpenCode wrapping the model in a
coding-agent loop...)" leaks the OpenCode product name; remove or reword that
parenthetical. Replace "a fork of OpenCode" with a neutral phrase such as "an
internal fork of a coding-agent framework" or simply "a forked coding-agent
wrapper" and keep the rest of the Harness description intact (refer to the
Harness: altimate-code and model id `openrouter/moonshotai/kimi-k2.6-20260420`
to locate the exact sentence to edit).

Each trial:

1. The harness starts a container, scaffolds the dbt project, and hands the agent a natural-language prompt.
2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Branding leak detected.

Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "standard OpenCode toolset" must be reworded.

Suggested fix
-2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
+2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard OpenCode toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed through OpenRouter using altimate-code's OpenAI-compatible provider. The agent has the standard toolset (`bash`, `read`, `write`, `edit`, `glob`, `grep`, `todowrite`) plus altimate-specific tools (`project_scan`, `sql_analyze`, `sql_execute`, `warehouse_*`, `dbt_manifest`, `dbt_profiles`, `dbt_lineage`, `altimate_core_validate`, `altimate_memory_*`, `schema_*`, `lineage_check`, `skill`, `tool_lookup`).
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt

[error] 30-30: Branding audit found leak (OpenCode (product name)). Line 30 references altimate-code and model routing.

🪛 GitHub Actions: CI / Marker Guard

[error] 30-30: Branding audit leak found: "OpenCode (product name)". Context: "2. altimate-code spins up its agent loop. The model is Kimi-K2.6 routed throu..."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 30, The phrase
"standard OpenCode toolset" leaks branding; update the text in the findings
entry that mentions OpenCode (the sentence listing tools: `bash`, `read`,
`write`, `edit`, `glob`, `grep`, `todowrite`) to remove the product name and use
a neutral term such as "standard code toolset" or "standard toolset" (or similar
wording), ensuring the rest of the tool list and altimate-specific tools
(`project_scan`, `sql_analyze`, `sql_execute`, etc.) remain unchanged.


## 6. Where the custom tools helped (or didn't)

altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Branding leak detected.

Pipeline failure indicates "OpenCode (product name)" appears in this line. The phrase "beyond OpenCode's base set" must be reworded.

Suggested fix
-altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations:
+altimate-code ships dbt-specific tools beyond the base set. Pass-rate correlations:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate correlations:
altimate-code ships dbt-specific tools beyond the base set. Pass-rate correlations:
🧰 Tools
🪛 GitHub Actions: CI / 5_Marker Guard.txt

[error] 276-276: Branding audit found leak (OpenCode (product name)). Line 276 mentions altimate-code shipping dbt-specific tools beyond OpenCode.

🪛 GitHub Actions: CI / Marker Guard

[error] 276-276: Branding audit leak found: "OpenCode (product name)". Context: "altimate-code ships dbt-specific tools beyond OpenCode's base set. Pass-rate ..."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@research/kimi-k26-ade-bench-2026-05-10/findings.md` at line 276, The line
contains a branding leak: replace the phrase "beyond OpenCode's base set" in the
findings text with a neutral, non-branded alternative (e.g., "beyond the base
toolset" or "beyond the project's base toolset"); update the sentence
"altimate-code ships dbt-specific tools beyond OpenCode's base set." to a
reworded version such as "altimate-code ships dbt-specific tools beyond the base
toolset." to remove the product name while preserving meaning.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="research/kimi-k26-ade-bench-2026-05-10/findings.md">

<violation number="1" location="research/kimi-k26-ade-bench-2026-05-10/findings.md:111">
P3: The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.</violation>

<violation number="2" location="research/kimi-k26-ade-bench-2026-05-10/findings.md:236">
P2: The `f1011` taxonomy note inverts pass/fail status for `check_option_b` and contradicts the appendix data.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

| **Date-spine completeness** | `airbnb009` | Kimi understood the task but did not generate a date-spine join; it kept the original `GROUP BY DATE_TRUNC` which drops empty days. dbt_utils was installed; Kimi just didn't reach for it. |
| **dbt-specific features (versioned models, snapshots, materialization)** | `airbnb007` (`models_are_materialized_correctly`), `airbnb010`, `helixops_saas009`, `f1008` | Created `dim_accounts_v2.sql` instead of using dbt's `versions:` keyword. Snapshot task wrote a regular model instead of a `snapshots/` directory file. |
| **Type harmonization in `CASE` / `COALESCE`** | `analytics_engineering004` | LEFT JOIN of inventory to product details where product details are NULL for some rows; model coerced types inconsistently. |
| **Multi-part reasoning over-confidence** | `f1011` | Multiple-choice question where Kimi answered `ABDE`. Only `check_option_b` passed; Kimi rationalized E with apparent confidence, but the gold answer set differed. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The f1011 taxonomy note inverts pass/fail status for check_option_b and contradicts the appendix data.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At research/kimi-k26-ade-bench-2026-05-10/findings.md, line 236:

<comment>The `f1011` taxonomy note inverts pass/fail status for `check_option_b` and contradicts the appendix data.</comment>

<file context>
@@ -0,0 +1,571 @@
+| **Date-spine completeness** | `airbnb009` | Kimi understood the task but did not generate a date-spine join; it kept the original `GROUP BY DATE_TRUNC` which drops empty days. dbt_utils was installed; Kimi just didn't reach for it. |
+| **dbt-specific features (versioned models, snapshots, materialization)** | `airbnb007` (`models_are_materialized_correctly`), `airbnb010`, `helixops_saas009`, `f1008` | Created `dim_accounts_v2.sql` instead of using dbt's `versions:` keyword. Snapshot task wrote a regular model instead of a `snapshots/` directory file. |
+| **Type harmonization in `CASE` / `COALESCE`** | `analytics_engineering004` | LEFT JOIN of inventory to product details where product details are NULL for some rows; model coerced types inconsistently. |
+| **Multi-part reasoning over-confidence** | `f1011` | Multiple-choice question where Kimi answered `ABDE`. Only `check_option_b` passed; Kimi rationalized E with apparent confidence, but the gold answer set differed. |
+| **Refactor reference updates** | `asana004` | Created the new intermediate model correctly but didn't fully update all downstream `ref()` calls. `check_task_references` failed. |
+| **Trivial / setup** | `simple001`, `workday001` | `simple001` renamed a model but missed a downstream reference. `workday001`'s prompt is literally *"Do nothing"* and the agent halted in 2 seconds — possibly a bench bug. |
</file context>

| Phase | Total time | Share of wall |
|---|---:|---:|
| Step duration (`step_start → step_finish`: model generation + tool dispatch) | 22,745 s | 66.1% |
| Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At research/kimi-k26-ade-bench-2026-05-10/findings.md, line 111:

<comment>The step-gap interval label is inconsistent with the glossary definition and can mislead readers about what was measured.</comment>

<file context>
@@ -0,0 +1,571 @@
+| Phase | Total time | Share of wall |
+|---|---:|---:|
+| Step duration (`step_start → step_finish`: model generation + tool dispatch) | 22,745 s | 66.1% |
+| Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% |
+| Tool execution (sum of all individual `tool_use` durations) | 1,690 s | 4.9% |
+| Total runtime | 34,402 s | 100% |
</file context>
Suggested change
| Step-to-step gaps (`step_start → next step_start`) | 30,672 s | 89.2% |
| Step-to-step gaps (`step_finish → next step_start`) | 30,672 s | 89.2% |

Adds the source-code + scripts + 4 small patches needed to plug
altimate-code into upstream ade-bench. Lets anyone reproduce the
81.3% pass rate described in research/kimi-k26-ade-bench-2026-05-10/
without trusting the pre-aggregated numbers.

What's included:
- benchmark/ade-bench/README.md — full reproduction guide (prereqs,
  Docker memory, env-var knobs, step-by-step commands, troubleshooting)
- benchmark/ade-bench/altimate_code_agent/ — drop-in agent module
  (AltimateCodeAgent class, JSON event parser, log formatter, install
  script that runs inside the trial container, tarball builder)
- benchmark/ade-bench/patches/ — 4 small patches against upstream
  dbt-labs/ade-bench (register AgentName.ALTIMATE_CODE, wire it into
  the AgentFactory, export from installed_agents/__init__.py, route
  the existing shared/config/AGENTS.md baseline file the same way
  Codex receives it — pure parity, no benchmark-specific content)

Explicitly NOT in this folder:
- Trace files / per-trial agent.log / results.json (regenerable)
- The 130 MB built tarball (build-local-tarball.sh recreates it)
- Seed DuckDB databases (downloaded from dbt-labs/ade-bench releases)
- Per-task ground-truth seeds + test SQL (those live in upstream
  ade-bench and are never sent to the agent at run time)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py (2)

225-225: 💤 Low value

Remove unnecessary f-string prefix.

The f-prefix is not needed since there are no format placeholders in this string.

🧹 Proposed fix
-        command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"
+        command = "echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py` at line 225,
The string assigned to variable "command" in altimate_code_agent.py is using an
unnecessary f-string; replace the f-prefixed string in the assignment to command
(currently: command = f"...") with a plain string literal (command = "echo
'AGENT RESPONSE: ' && altimate-code run --format json --yolo") so there are no
unused format prefixes.

58-59: ⚡ Quick win

Consider logging parse errors for debugging.

The bare except: pass silently swallows all parsing errors, making it difficult to debug malformed log files during benchmark development. While silent failure is acceptable for tooling, adding a minimal error indicator would improve troubleshooting.

🔍 Proposed improvement
-        except Exception:
-            pass
+        except Exception as e:
+            # Return partial results; log parse errors are non-fatal in benchmark context
+            import sys
+            print(f"Warning: log parse error: {e}", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py` around lines
58 - 59, The bare "except: pass" in the parsing block silently swallows errors;
change it to "except Exception as e" and log a minimal error message including
the exception (e.g., using logging.getLogger(__name__).warning or .exception)
with context like "Failed to parse log entry" so malformed inputs are visible
during debugging; ensure the module has a logger configured (import logging and
getLogger) before using it.
benchmark/ade-bench/README.md (1)

9-22: ⚡ Quick win

Add language identifier to the fenced code block.

The code block showing the directory structure would benefit from a language identifier for proper syntax highlighting.

📝 Proposed fix
-```
+```text
 benchmark/ade-bench/
 ├── README.md                              ← you are here
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/ade-bench/README.md` around lines 9 - 22, Update the fenced code
block in README.md to include a language identifier for proper highlighting:
change the opening triple backticks that currently start the directory-tree
block to use "text" (i.e., ```text) so the tree shown (the block containing
benchmark/ade-bench/ and the listed files like altimate_code_agent/ and
patches/) is rendered with correct formatting.
benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh (1)

83-87: ⚡ Quick win

Prefer find over ls for discovering the tarball.

The current approach using ls works but is sensitive to locale and could behave unexpectedly if multiple tarballs exist. A find-based approach provides better control and predictability.

♻️ Proposed refactor using find
-TARBALL="$(ls -1 "$STAGE"/altimate-code-*.tgz | head -1)"
+TARBALL="$(find "$STAGE" -maxdepth 1 -name 'altimate-code-*.tgz' -print -quit)"
 if [[ -z "$TARBALL" ]]; then
   echo "pack failed: no tarball produced" >&2
   exit 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh` around lines
83 - 87, Replace the fragile ls-based discovery of the tarball by using find:
instead of assigning TARBALL via ls on "$STAGE", run a find rooted at "$STAGE"
with -maxdepth 1 -type f -name "altimate-code-*.tgz" -print -quit to reliably
pick the first match, then check if TARBALL is empty and exit with the same
error handling; update references to TARBALL and keep the existing error
message/exit behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py`:
- Line 225: The string assigned to variable "command" in altimate_code_agent.py
is using an unnecessary f-string; replace the f-prefixed string in the
assignment to command (currently: command = f"...") with a plain string literal
(command = "echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo")
so there are no unused format prefixes.
- Around line 58-59: The bare "except: pass" in the parsing block silently
swallows errors; change it to "except Exception as e" and log a minimal error
message including the exception (e.g., using logging.getLogger(__name__).warning
or .exception) with context like "Failed to parse log entry" so malformed inputs
are visible during debugging; ensure the module has a logger configured (import
logging and getLogger) before using it.

In `@benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh`:
- Around line 83-87: Replace the fragile ls-based discovery of the tarball by
using find: instead of assigning TARBALL via ls on "$STAGE", run a find rooted
at "$STAGE" with -maxdepth 1 -type f -name "altimate-code-*.tgz" -print -quit to
reliably pick the first match, then check if TARBALL is empty and exit with the
same error handling; update references to TARBALL and keep the existing error
message/exit behavior unchanged.

In `@benchmark/ade-bench/README.md`:
- Around line 9-22: Update the fenced code block in README.md to include a
language identifier for proper highlighting: change the opening triple backticks
that currently start the directory-tree block to use "text" (i.e., ```text) so
the tree shown (the block containing benchmark/ade-bench/ and the listed files
like altimate_code_agent/ and patches/) is rendered with correct formatting.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 778af701-c01c-4a00-96d9-848f6ea6aded

📥 Commits

Reviewing files that changed from the base of the PR and between e7e1d92 and df9a3d5.

📒 Files selected for processing (9)
  • benchmark/ade-bench/README.md
  • benchmark/ade-bench/altimate_code_agent/__init__.py
  • benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh
  • benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py
  • benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh
  • benchmark/ade-bench/patches/01-agent_name.py.patch
  • benchmark/ade-bench/patches/02-agent_factory.py.patch
  • benchmark/ade-bench/patches/03-installed_agents_init.py.patch
  • benchmark/ade-bench/patches/04-agent_setup.py.patch
✅ Files skipped from review due to trivial changes (1)
  • benchmark/ade-bench/patches/03-installed_agents_init.py.patch

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh">

<violation number="1" location="benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh:29">
P2: Avoid `@latest` in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.</violation>
</file>

<file name="benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh">

<violation number="1" location="benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh:11">
P1: `REPO_ROOT` is computed with too many `..` segments, so package paths resolve outside the repository and the tarball build fails.</violation>
</file>

<file name="benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py">

<violation number="1" location="benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py:228">
P1: Shell command construction does not quote `self._model_name`, which allows command injection or malformed execution when model IDs contain shell metacharacters.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: REPO_ROOT is computed with too many .. segments, so package paths resolve outside the repository and the tarball build fails.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/build-local-tarball.sh, line 11:

<comment>`REPO_ROOT` is computed with too many `..` segments, so package paths resolve outside the repository and the tarball build fails.</comment>

<file context>
@@ -0,0 +1,90 @@
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)"
+PKG_DIR="$REPO_ROOT/packages/opencode"
+DBT_TOOLS_DIR="$REPO_ROOT/packages/dbt-tools"
</file context>
Suggested change
REPO_ROOT="$(cd "$SCRIPT_DIR/../../../../../.." && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"

command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"

if self._model_name:
command += f" --model {self._model_name}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Shell command construction does not quote self._model_name, which allows command injection or malformed execution when model IDs contain shell metacharacters.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/altimate_code_agent.py, line 228:

<comment>Shell command construction does not quote `self._model_name`, which allows command injection or malformed execution when model IDs contain shell metacharacters.</comment>

<file context>
@@ -0,0 +1,264 @@
+        command = f"echo 'AGENT RESPONSE: ' && altimate-code run --format json --yolo"
+
+        if self._model_name:
+            command += f" --model {self._model_name}"
+        command += f" --max-turns 80 {escaped_prompt}"
+
</file context>

chmod 755 "$PKG_BIN_DIR/.altimate-code" "$PKG_BIN_DIR/.altimate"
else
echo "Local tarball not staged; falling back to latest published"
npm install -g --no-audit --no-fund @altimateai/altimate-code@latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid @latest in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At benchmark/ade-bench/altimate_code_agent/altimate-code-setup.sh, line 29:

<comment>Avoid `@latest` in benchmark setup fallback; it makes runs non-reproducible and can silently change agent behavior.</comment>

<file context>
@@ -0,0 +1,106 @@
+  chmod 755 "$PKG_BIN_DIR/.altimate-code" "$PKG_BIN_DIR/.altimate"
+else
+  echo "Local tarball not staged; falling back to latest published"
+  npm install -g --no-audit --no-fund @altimateai/altimate-code@latest
+fi
+
</file context>

…itfalls

Two related changes, both shipped to every altimate-code user.

(1) `feat(skill)`: add `alwaysApply: bool` and `applyPaths: string|string[]`
    frontmatter to skill metadata, mirroring Cursor's "Always Apply" and
    "Auto Attached" rule modes. When a skill is `alwaysApply: true` or has
    `applyPaths` matching at least one file under the worktree, its body
    is inlined into the system prompt at session start under an
    `<auto_loaded_skill>` block — the model no longer needs to invoke the
    Skill tool to access that guidance.

    Motivation: benchmark traces show the agent invokes the `Skill` tool
    in <1% of tool calls, even after the skill description is rewritten
    to be imperative. Many failures occur on patterns the relevant skill
    already documents but the agent never loads. Auto-loading puts the
    body deterministically in context for projects where the skill
    applies.

    Files:
      • packages/opencode/src/skill/skill.ts — Info schema + both load
        paths (filesystem + binary-embedded) pluck the new fields
      • packages/opencode/src/session/system.ts — auto-inline matched
        skill bodies after the existing available_skills XML block
      • .opencode/skills/dbt-develop/SKILL.md — frontmatter now declares
        `applyPaths: [dbt_project.yml, **/dbt_project.yml]`, so dbt
        projects auto-load this skill's body (~270 lines of dbt
        best-practice patterns) at session start

    The existing skill-tool-invocation path is unchanged; auto-load is
    additive. Skills without `alwaysApply` / `applyPaths` continue to
    require explicit invocation. Prompt caching amortizes the extra
    tokens across the long agent loop.

(2) `docs(skill)`: three new generic dbt pitfall sections in
    `dbt-develop/SKILL.md`, all benchmark-agnostic best practices
    surfaced during failure-trace analysis:

    • String concatenation with `NULL` operands — `||` / `CONCAT`
      propagate `NULL`; wrap with `COALESCE` or use `CONCAT_WS`.
      Catches an invisible row-dropper in surrogate-key generation and
      derived columns.
    • dbt model versioning (dbt 1.8+) — when introducing a v2 of an
      existing model, use dbt's `versions:` block in `_models.yml` with
      `defined_in:`, not a sibling `_v2.sql` file. Otherwise downstream
      lineage and `{{ ref(model, v=2) }}` resolution break.
    • Strengthened the existing window-rank + `LIMIT` section to call
      out determinism explicitly, including the `QUALIFY ROW_NUMBER()
      OVER (... ORDER BY metric, id)` form and the "if you can't think
      of a tiebreaker, you don't have a unique key yet" framing.

    All three patterns are documented in well-known dbt style guides
    and would benefit any real altimate-code user — they are not
    benchmark-targeted tweaks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/opencode/src/session/system.ts (1)

74-104: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep auto-loaded skills outside the LLM selector.

collectAutoLoadedSkills(filtered) makes alwaysApply / applyPaths contingent on selectSkillsWithLLM(...). When fingerprint selection is enabled, an omitted skill never auto-loads, which breaks the new “always apply / auto attached” contract.

Suggested fix
     let filtered: Skill.Info[]
     if (cfg.experimental?.env_fingerprint_skill_selection === true) {
       filtered = await selectSkillsWithLLM(list, Fingerprint.get())
     } else {
       filtered = list
     }
-    // Sort by name for stable, deterministic output across calls.
-    filtered = [...filtered].sort((a, b) => a.name.localeCompare(b.name))
+    const autoLoaded = await collectAutoLoadedSkills(list)
+    const visible = [...new Map([...filtered, ...autoLoaded].map((skill) => [skill.name, skill])).values()]
+      .sort((a, b) => a.name.localeCompare(b.name))
@@
-      Skill.fmt(filtered, { verbose: true }),
+      Skill.fmt(visible, { verbose: true }),
@@
-    const autoLoaded = await collectAutoLoadedSkills(filtered)
     if (autoLoaded.length > 0) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/src/session/system.ts` around lines 74 - 104, The auto-load
logic is currently run against the LLM-filtered "filtered" list, which makes
collectAutoLoadedSkills(filtered) miss skills excluded by selectSkillsWithLLM;
change the flow so collectAutoLoadedSkills runs against the unfiltered skill
list (the original "list") and use that result for the auto-loaded block, while
still using selectSkillsWithLLM(list, Fingerprint.get()) -> filtered for
presentation (Skill.fmt) and sorting; update references to filtered only for
display and keep collectAutoLoadedSkills(list) (or a separate variable like
autoLoadedFromAll) to determine alwaysApply/applyPaths behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.opencode/skills/dbt-develop/SKILL.md:
- Around line 270-272: Update the documentation guidance about CONCAT_WS: remove
the blanket claim that CONCAT_WS skips NULLs in Snowflake and BigQuery and
instead state explicit, dialect-safe advice — note that Snowflake's CONCAT_WS
propagates NULLs, BigQuery lacks CONCAT_WS (use ARRAY_TO_STRING for
NULL-omitting behavior), and recommend using COALESCE on operands or validating
the adapter-specific NULL semantics before relying on any concat function
(mention CONCAT_WS, ARRAY_TO_STRING, COALESCE by name to help locate the
reference).

In `@packages/opencode/src/session/system.ts`:
- Around line 157-168: The helper anyMatchInWorktree currently swallows
Glob.scan errors via .catch(() => []), preventing the caller's warning path from
seeing scan failures; remove that inline catch so await Glob.scan(g, { ... })
can throw (or replace it with a catch that rethrows the original error) and let
the upstream warning/logging handle it; search for the function
anyMatchInWorktree and the Glob.scan call to update the error handling
accordingly.

---

Outside diff comments:
In `@packages/opencode/src/session/system.ts`:
- Around line 74-104: The auto-load logic is currently run against the
LLM-filtered "filtered" list, which makes collectAutoLoadedSkills(filtered) miss
skills excluded by selectSkillsWithLLM; change the flow so
collectAutoLoadedSkills runs against the unfiltered skill list (the original
"list") and use that result for the auto-loaded block, while still using
selectSkillsWithLLM(list, Fingerprint.get()) -> filtered for presentation
(Skill.fmt) and sorting; update references to filtered only for display and keep
collectAutoLoadedSkills(list) (or a separate variable like autoLoadedFromAll) to
determine alwaysApply/applyPaths behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cb5b75c6-4bf4-4c38-adb7-0222c2d920b6

📥 Commits

Reviewing files that changed from the base of the PR and between df9a3d5 and d8a1add.

📒 Files selected for processing (3)
  • .opencode/skills/dbt-develop/SKILL.md
  • packages/opencode/src/session/system.ts
  • packages/opencode/src/skill/skill.ts

Comment on lines +270 to +272
Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
skips `NULL` operands instead of propagating them, which is usually safer
than a static placeholder.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

`Verify SQL dialect docs:

  1. Snowflake CONCAT_WS NULL behavior
  2. BigQuery GoogleSQL support for CONCAT_WS vs CONCAT/ARRAY_TO_STRING`

💡 Result:

  1. Snowflake CONCAT_WS NULL behavior Snowflake’s CONCAT_WS does not skip NULL values. If any argument to CONCAT_WS is NULL, the function returns NULL. [1] 2) BigQuery (GoogleSQL) support: CONCAT_WS vs CONCAT/ARRAY_TO_STRING BigQuery GoogleSQL does not provide a CONCAT_WS function. [2] BigQuery’s CONCAT returns NULL if any input argument is NULL. [3] To get CONCAT_WS-like “join with delimiter while omitting NULLs”, BigQuery’s ARRAY_TO_STRING is the documented alternative: ARRAY_TO_STRING omits NULL array elements when null_text is not supplied (so it does not emit the delimiter for NULLs). [4] Example (conceptual): ARRAY_TO_STRING(['foo', NULL, 'bar'], '#') produces 'foo#bar' because NULLs are omitted and their preceding delimiter is omitted when null_text is not used. [4] So: Snowflake CONCAT_WS = NULL in any position => NULL result; BigQuery has no CONCAT_WS, and to emulate “skip NULLs with delimiter” you typically use ARRAY_TO_STRING (where NULLs are omitted unless you provide null_text). [1][3][4]

Citations:


Fix inaccurate CONCAT_WS NULL handling guidance in documentation.

Lines 270–272 claim CONCAT_WS() is supported in Snowflake and BigQuery while skipping NULL operands. However:

  • Snowflake CONCAT_WS propagates NULL (returns NULL if any argument is NULL)
  • BigQuery does not provide CONCAT_WS; use ARRAY_TO_STRING instead for NULL-omitting behavior

This misguidance risks silent NULL propagation bugs in generated SQL. Replace with explicit dialect-safe guidance recommending COALESCE for operands or verification of adapter-specific NULL semantics before relying on any concat function.

Suggested doc fix
-Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
-skips `NULL` operands instead of propagating them, which is usually safer
-than a static placeholder.
+Use dialect-safe null handling explicitly. In many engines, string concat
+propagates `NULL` unless you `COALESCE` each operand first.
+If you choose `CONCAT_WS`, verify your adapter's NULL semantics in docs
+before relying on it.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.opencode/skills/dbt-develop/SKILL.md around lines 270 - 272, Update the
documentation guidance about CONCAT_WS: remove the blanket claim that CONCAT_WS
skips NULLs in Snowflake and BigQuery and instead state explicit, dialect-safe
advice — note that Snowflake's CONCAT_WS propagates NULLs, BigQuery lacks
CONCAT_WS (use ARRAY_TO_STRING for NULL-omitting behavior), and recommend using
COALESCE on operands or validating the adapter-specific NULL semantics before
relying on any concat function (mention CONCAT_WS, ARRAY_TO_STRING, COALESCE by
name to help locate the reference).

Comment on lines +157 to +168
async function anyMatchInWorktree(globs: string[]): Promise<boolean> {
// Search from worktree root so a skill that wants `dbt_project.yml`
// catches the file no matter how deep the user's cwd is.
const root = Instance.worktree
for (const g of globs) {
const matches = await Glob.scan(g, {
cwd: root,
absolute: true,
include: "file",
dot: false,
symlink: false,
}).catch(() => [] as string[])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Let Glob.scan failures reach the warning path.

The inline .catch(() => []) turns invalid glob / scan errors into a silent “no match”, so the warning on Lines 144-146 never fires and applyPaths failures are invisible.

Suggested fix
     for (const g of globs) {
-      const matches = await Glob.scan(g, {
+      const matches = await Glob.scan(g, {
         cwd: root,
         absolute: true,
         include: "file",
         dot: false,
         symlink: false,
-      }).catch(() => [] as string[])
+      })
       if (matches.length > 0) return true
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function anyMatchInWorktree(globs: string[]): Promise<boolean> {
// Search from worktree root so a skill that wants `dbt_project.yml`
// catches the file no matter how deep the user's cwd is.
const root = Instance.worktree
for (const g of globs) {
const matches = await Glob.scan(g, {
cwd: root,
absolute: true,
include: "file",
dot: false,
symlink: false,
}).catch(() => [] as string[])
async function anyMatchInWorktree(globs: string[]): Promise<boolean> {
// Search from worktree root so a skill that wants `dbt_project.yml`
// catches the file no matter how deep the user's cwd is.
const root = Instance.worktree
for (const g of globs) {
const matches = await Glob.scan(g, {
cwd: root,
absolute: true,
include: "file",
dot: false,
symlink: false,
})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opencode/src/session/system.ts` around lines 157 - 168, The helper
anyMatchInWorktree currently swallows Glob.scan errors via .catch(() => []),
preventing the caller's warning path from seeing scan failures; remove that
inline catch so await Glob.scan(g, { ... }) can throw (or replace it with a
catch that rethrows the original error) and let the upstream warning/logging
handle it; search for the function anyMatchInWorktree and the Glob.scan call to
update the error handling accordingly.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name=".opencode/skills/dbt-develop/SKILL.md">

<violation number="1" location=".opencode/skills/dbt-develop/SKILL.md:270">
P2: `CONCAT_WS` support/behavior is documented incorrectly: BigQuery does not support `CONCAT_WS`, and Snowflake `CONCAT_WS` does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.</violation>
</file>

<file name="packages/opencode/src/session/system.ts">

<violation number="1" location="packages/opencode/src/session/system.ts:168">
P2: `Glob.scan` errors are swallowed, so `applyPaths` scan failures are silently ignored instead of being logged by the caller.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment on lines +270 to +271
Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
skips `NULL` operands instead of propagating them, which is usually safer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: CONCAT_WS support/behavior is documented incorrectly: BigQuery does not support CONCAT_WS, and Snowflake CONCAT_WS does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At .opencode/skills/dbt-develop/SKILL.md, line 270:

<comment>`CONCAT_WS` support/behavior is documented incorrectly: BigQuery does not support `CONCAT_WS`, and Snowflake `CONCAT_WS` does not skip NULLs. This guidance can produce failing or incorrect SQL in both dialects.</comment>

<file context>
@@ -252,6 +255,44 @@ CASE WHEN cond THEN CAST('0' AS NUMERIC) ELSE CAST(0 AS NUMERIC) END
+-- Right: explicit placeholder
+COALESCE(region, 'UNKNOWN') || '-' || COALESCE(segment, 'UNKNOWN') AS geo_segment
+```
+Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
+skips `NULL` operands instead of propagating them, which is usually safer
+than a static placeholder.
</file context>
Suggested change
Use `CONCAT_WS()` if your dialect supports it (Snowflake, BigQuery) — it
skips `NULL` operands instead of propagating them, which is usually safer
Use dialect-specific NULL-safe concatenation patterns. In BigQuery, use `ARRAY_TO_STRING([...], '-')` to skip `NULL`s; in Snowflake, `CONCAT_WS` still returns `NULL` when any argument is `NULL`, so wrap operands with `COALESCE(...)`.

Tip: Review your code locally with the cubic CLI to iterate faster.

include: "file",
dot: false,
symlink: false,
}).catch(() => [] as string[])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Glob.scan errors are swallowed, so applyPaths scan failures are silently ignored instead of being logged by the caller.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/opencode/src/session/system.ts, line 168:

<comment>`Glob.scan` errors are swallowed, so `applyPaths` scan failures are silently ignored instead of being logged by the caller.</comment>

<file context>
@@ -78,14 +82,93 @@ export namespace SystemPrompt {
+        include: "file",
+        dot: false,
+        symlink: false,
+      }).catch(() => [] as string[])
+      if (matches.length > 0) return true
+    }
</file context>
Suggested change
}).catch(() => [] as string[])
})

Adds reference for the new auto-load mechanism to docs/docs/configure/skills.md:
- Lists the two new frontmatter fields in the Frontmatter Fields table
- New "Auto-loading skills" section explaining the lazy-load default,
  how `alwaysApply` and `applyPaths` change it, a worked example,
  a "when to use" table, and an honest section on context-size
  implications + prompt-cache amortization

Pure documentation update — no code change in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

👋 This PR was automatically closed by our quality checks.

Common reasons:

  • New GitHub account with limited contribution history
  • PR description doesn't meet our guidelines
  • Contribution appears to be AI-generated without meaningful review

If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you.

@github-actions
Copy link
Copy Markdown

👋 This PR was automatically closed by our quality checks.

Common reasons:

  • New GitHub account with limited contribution history
  • PR description doesn't meet our guidelines
  • Contribution appears to be AI-generated without meaningful review

If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/docs/configure/skills.md`:
- Around line 68-72: The fenced code block showing the <auto_loaded_skill>
element lacks a language identifier causing lint warnings; update the markdown
code fence to include a language tag (use "xml") so the block becomes ```xml ...
``` around the <auto_loaded_skill name="<skill-name>"> ... </auto_loaded_skill>
snippet to enable proper syntax highlighting and satisfy the linter.
- Around line 48-50: The example in applyPaths lists both "dbt_project.yml" and
"**/dbt_project.yml", which are redundant because a bare filename already
matches at any depth; update the docs by removing the "**/dbt_project.yml" entry
or add a short clarifying sentence explaining why both are shown (e.g., that
both patterns are equivalent and the second is optional/for explicitness).
Ensure the change references the applyPaths example and the "dbt_project.yml"
filename so readers understand the intended behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 85dd32e2-0eb5-436e-a2f1-b942c9209597

📥 Commits

Reviewing files that changed from the base of the PR and between d8a1add and 6107c3b.

📒 Files selected for processing (1)
  • docs/docs/configure/skills.md

Comment on lines +48 to +50
applyPaths:
- "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree
- "**/dbt_project.yml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find the glob-matching logic for applyPaths to verify recursive behavior

# Search for applyPaths glob matching implementation
rg -n -C5 'applyPaths.*glob|glob.*applyPaths' --type=ts

# Look for minimatch or glob library usage in session/system context
rg -n -C3 'minimatch|micromatch|glob.*match' packages/opencode/src/session/system.ts

# Find where applyPaths is processed
ast-grep --pattern 'applyPaths'

Repository: AltimateAI/altimate-code

Length of output: 3091


🏁 Script executed:

rg -n "normalizeApplyPaths|anyMatchInWorktree" --type=ts -A 10

Repository: AltimateAI/altimate-code

Length of output: 2534


🏁 Script executed:

rg -n "import.*Glob|from.*Glob" packages/opencode/src/session/system.ts --type=ts

Repository: AltimateAI/altimate-code

Length of output: 106


🏁 Script executed:

cat -n packages/opencode/src/util/glob.ts

Repository: AltimateAI/altimate-code

Length of output: 1257


Clarify or remove the redundant glob pattern.

The example shows both "dbt_project.yml" and "**/dbt_project.yml". A bare filename already matches files at any depth in the worktree (as stated in the codebase comment: "a skill that wants dbt_project.yml catches the file no matter how deep the user's cwd is"). The second pattern is functionally identical and may confuse users. Either explain why both are shown or remove the redundant one.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/docs/configure/skills.md` around lines 48 - 50, The example in
applyPaths lists both "dbt_project.yml" and "**/dbt_project.yml", which are
redundant because a bare filename already matches at any depth; update the docs
by removing the "**/dbt_project.yml" entry or add a short clarifying sentence
explaining why both are shown (e.g., that both patterns are equivalent and the
second is optional/for explicitness). Ensure the change references the
applyPaths example and the "dbt_project.yml" filename so readers understand the
intended behavior.

Comment on lines +68 to +72
```
<auto_loaded_skill name="<skill-name>">
... full skill body ...
</auto_loaded_skill>
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to code block.

The code block is missing a language specifier, which prevents proper syntax highlighting and triggers linting warnings.

📝 Proposed fix
-```
+```xml
 <auto_loaded_skill name="<skill-name>">
 ... full skill body ...
 </auto_loaded_skill>

</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 68-68: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/docs/configure/skills.md` around lines 68 - 72, The fenced code block
showing the <auto_loaded_skill> element lacks a language identifier causing lint
warnings; update the markdown code fence to include a language tag (use "xml")
so the block becomes ```xml ... ``` around the <auto_loaded_skill
name="<skill-name>"> ... </auto_loaded_skill> snippet to enable proper syntax
highlighting and satisfy the linter.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/docs/configure/skills.md">

<violation number="1" location="docs/docs/configure/skills.md:49">
P3: The `applyPaths` example comment is inaccurate: `"dbt_project.yml"` does not match anywhere in the worktree, only at the root.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Comment on lines +49 to +50
- "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree
- "**/dbt_project.yml"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The applyPaths example comment is inaccurate: "dbt_project.yml" does not match anywhere in the worktree, only at the root.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/docs/configure/skills.md, line 49:

<comment>The `applyPaths` example comment is inaccurate: `"dbt_project.yml"` does not match anywhere in the worktree, only at the root.</comment>

<file context>
@@ -28,7 +28,75 @@ Focus on the query: $ARGUMENTS
+---
+name: dbt-develop
+applyPaths:
+  - "dbt_project.yml"        # matches if any dbt_project.yml exists in the worktree
+  - "**/dbt_project.yml"
+description: ...
</file context>
Suggested change
- "dbt_project.yml" # matches if any dbt_project.yml exists in the worktree
- "**/dbt_project.yml"
- "dbt_project.yml" # matches only at the worktree root
- "**/dbt_project.yml" # matches anywhere under the worktree

Tip: Review your code locally with the cubic CLI to iterate faster.

Two changes informed by trace analysis of the benchmark run with the
initial auto-load mechanism. With the auto-loaded body present in the
system prompt, 6 of 8 sampled failing trials never referenced any of
its guidance keywords (date spine, tiebreaker, deliverable, etc.) —
the model was treating the auto-loaded section as background reference
rather than binding directive. These two changes address the framing.

(1) `feat(system-prompt)`: move auto-loaded skill bodies BEFORE the
    lazy-loaded `<available_skills>` XML block in the skills section.

    Previously the order was:
      1. "Use the skill tool to load a skill..." preamble
      2. <available_skills> XML (long, descriptions only)
      3. <auto_loaded_skill> body (binding guidance)

    Now:
      1. <auto_loaded_skill> body (binding guidance — read FIRST)
      2. "Skills provide specialized instructions..." preamble
      3. <available_skills> XML (lazy-loaded skills the agent can opt into)

    Framing the auto-loaded body as "rules of the road" at the start
    rather than supplementary documentation at the end. Pure ordering
    change in `SystemPrompt.skills()` parts array — no schema or API
    change. Applies to any skill using `applyPaths` or `alwaysApply`.

    File: packages/opencode/src/session/system.ts

(2) `docs(skill)`: add a "Pre-completion checklist" section (§5) to
    dbt-develop that the agent is told to emit with `[x]/[ ]` marks
    before declaring the task done.

    Each item is a yes/no question against patterns the skill already
    documents (LEFT JOIN cardinality, date-spine completeness,
    window-rank tiebreaker, type harmonization in COALESCE/CASE/UNION,
    string-concat NULL handling, uniqueness enforcement, incremental
    high-water mark, snapshot strategy, dbt model versioning v2,
    unit-test verification).

    The forcing function: the agent must produce the checklist text in
    its final message. Unchecked items without a stated "n/a" reason
    mean the task is not done. Forces the model to slow down at the
    end and verify the patterns against the SQL it just wrote, rather
    than silently skip the verification phase.

    All items are generic dbt patterns applicable to any project — no
    benchmark-specific test names, no solution-seed values, no
    grading-rubric hints.

    File: .opencode/skills/dbt-develop/SKILL.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

👋 This PR was automatically closed by our quality checks.

Common reasons:

  • New GitHub account with limited contribution history
  • PR description doesn't meet our guidelines
  • Contribution appears to be AI-generated without meaningful review

If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you.

1 similar comment
@github-actions
Copy link
Copy Markdown

👋 This PR was automatically closed by our quality checks.

Common reasons:

  • New GitHub account with limited contribution history
  • PR description doesn't meet our guidelines
  • Contribution appears to be AI-generated without meaningful review

If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.opencode/skills/dbt-develop/SKILL.md:
- Around line 165-167: Replace the non-recursive "ls models/" check with a
recursive file-discovery command so nested model files aren't missed; update the
SKILL.md checklist to use a recursive listing (e.g., recursive ls or find)
targeting model files (reference the current "ls models/" line) and ensure the
new command filters for model file types so deliverable verification includes
files in subdirectories.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c8271a3c-2e25-4585-8a12-9658d74c1430

📥 Commits

Reviewing files that changed from the base of the PR and between 6107c3b and c647876.

📒 Files selected for processing (2)
  • .opencode/skills/dbt-develop/SKILL.md
  • packages/opencode/src/session/system.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/opencode/src/session/system.ts

Comment on lines +165 to +167
ls models/ # confirm every requested file exists
altimate-dbt info # confirm every requested model is in the project
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use recursive file discovery instead of ls models/ for deliverable verification.

ls models/ only shows top-level entries, so nested model files can be missed during checklist validation. Prefer a recursive check command.

Suggested doc tweak
-ls models/                                                   # confirm every requested file exists
+find models -type f -name "*.sql"                           # confirm every requested model file exists (including nested dirs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.opencode/skills/dbt-develop/SKILL.md around lines 165 - 167, Replace the
non-recursive "ls models/" check with a recursive file-discovery command so
nested model files aren't missed; update the SKILL.md checklist to use a
recursive listing (e.g., recursive ls or find) targeting model files (reference
the current "ls models/" line) and ensure the new command filters for model file
types so deliverable verification includes files in subdirectories.

…result

The "emit a [x]/[ ] checklist before declaring done" addition to
dbt-develop (§5, shipped two commits ago) was measured negative on
the post-A+B benchmark re-run:

  - Checklist appeared in 6 of 14 still-failing trial outputs.
  - Zero of those 6 flipped to PASS.
  - In multiple traces, the agent self-marked `[x] LEFT JOIN
    cardinality correct` while the underlying SQL still had the
    exact phantom-row bug the item warned against.

The framing trained the model to perform verification theater
rather than actually re-read its SQL. The two flips attributed
earlier to "A+B" (helixops_saas007, helixops_saas009) trace back
to the placement reorder (A) — the checklist (B) contributed
nothing measurable, and adds 50+ lines of system-prompt content
for no benefit.

This commit:

(1) Removes §5 from `.opencode/skills/dbt-develop/SKILL.md`.
    The other sections (Plan → Discover → Write → Validate,
    Common Pitfalls in Transformation Logic, Iron Rules) stay
    intact. The placement reorder in `system.ts` and the
    `applyPaths`/`alwaysApply` frontmatter mechanism stay.

(2) Adds a "What we tried that didn't work" section to
    research/kimi-k26-ade-bench-2026-05-10/findings.md so the
    negative result is preserved as institutional knowledge.
    The broader principle — "soft self-verification (model
    promises it checked X) is unreliable on this model class;
    hard verification (compile/test failures) still works" — is
    worth keeping around.

(3) Updates the findings TL;DR with both the original 81.3%
    headline and the post-second-wave 85.3% best-of-runs number,
    with the caveat that the body of the post analyzes the
    first-wave traces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

👋 This PR was automatically closed by our quality checks.

Common reasons:

  • New GitHub account with limited contribution history
  • PR description doesn't meet our guidelines
  • Contribution appears to be AI-generated without meaningful review

If you believe this was a mistake, please open an issue explaining your intended contribution and a maintainer will help you.

@dev-punia-altimate
Copy link
Copy Markdown

❌ Tests — Failures Detected

TypeScript — 15 failure(s)

  • baseline [0.49ms]
  • baseline [0.40ms]
  • baseline [0.12ms]
  • baseline [0.30ms]
  • connection_refused [0.28ms]
  • timeout [0.07ms]
  • permission_denied [0.07ms]
  • parse_error [0.05ms]
  • oom [0.06ms]
  • network_error [0.06ms]
  • auth_failure [0.05ms]
  • rate_limit [0.05ms]
  • internal_error [0.07ms]
  • empty_error [0.06ms]
  • connection_refused [0.08ms]

Next Step

Please address the failing cases above and re-run verification.

cc @anandgupta42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants