Development Guide

This file is for contributors working on skills and eval harnesses.

Prerequisites

Run from repo root: graphistry-skills/
CLIs on PATH: codex, claude, jq
Auth configured for each runtime (~/.codex, ~/.claude)

Optional (for OTel trace capture + inspection):

Local OTel stack running (from sibling repo workflow): ./dc-otel
Health check: ./bin/otel/status

Repo Map

Skills: .agents/skills/<skill>/SKILL.md
Journeys: evals/journeys/*.json
Runner: bin/agent.sh
Core eval engine: scripts/agent_eval_loop.py
Checked-in benchmark artifacts (public-safe): benchmarks/data/*/combined_metrics.json and benchmarks/reports/*
Release workflow (maintainers): RELEASE.md

Skill Scope Conventions

User-facing/published skills are the pygraphistry* set.
Internal maintainer skills live under .agents/skills/internal/ (for example: .agents/skills/internal/plan, .agents/skills/internal/eval-otel, .agents/skills/internal/benchmarks) and are flagged with metadata.internal: true.
Keep internal maintainer skills out of default end-user install snippets.

Validate Skills Before Sweeps

python3 scripts/ci/validate_skills.py
./bin/evals/codex-skills-smoke.sh
./bin/evals/claude-skills-smoke.sh

Native Skill Env Setup (Manual)

./scripts/evals/setup_codex_skill_env.sh --env-dir evals/env/codex
./scripts/evals/setup_claude_skill_env.sh --env-dir evals/env/claude

Notes:

Codex native skills are loaded from .codex/skills/ under runtime CWD.
Claude native skills are loaded from .claude/skills/ under runtime CWD.

Fast Iteration (Subset Cases)

Use this while editing skill text.

./bin/agent.sh \
  --codex --claude \
  --journeys pygraphistry_persona_journeys_v1 \
  --case-ids persona_novice_fraud_table_to_viz_algo,persona_connector_analyst_workflow \
  --skills-mode both \
  --skills-delivery native \
  --max-workers 2 \
  --failfast

GFQL Expansion Evals

Run the GFQL-specific eval suites (deterministic):

./bin/agent.sh \
  --claude \
  --journeys pygraphistry_gfql_cypher_v1,pygraphistry_gfql_let_dag_v1,pygraphistry_gfql_backward_fixes_v1,pygraphistry_gfql_row_pipeline_v1 \
  --skills-mode both \
  --skills-delivery native \
  --max-workers 2 \
  --out "$OUT"

Run functional execution evals (code is actually executed with pygraphistry):

# Step 1: Generate responses
./bin/agent.sh \
  --claude \
  --journeys pygraphistry_gfql_functional_v1 \
  --skills-mode on \
  --skills-delivery native \
  --max-workers 2 \
  --out "$OUT"

# Step 2: Execute generated code and validate results
cd ~/Work/pygraphistry && PYTHONPATH="$PWD" python3 \
  /path/to/graphistry-skills/scripts/evals/gfql_functional_check.py \
  --rows "$OUT/rows.jsonl" \
  --journey-dir /path/to/graphistry-skills/evals/journeys

The functional checker extracts Python code blocks from responses, runs them with pygraphistry, and validates: no exceptions, expected output strings, correct result shapes.

Grading Modes (Deterministic / Oracle / Hybrid)

Default eval scoring is deterministic checks from each journey case.

Oracle grading scaffold is also available:

OUT="/tmp/graphistry_skills_oracle_smoke_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --codex \
  --journeys runtime_smoke \
  --case-ids echo_token \
  --skills-mode off \
  --grading oracle \
  --oracle-harness codex \
  --out "$OUT"

Hybrid grading (deterministic + oracle conjunction) is enabled via --grading hybrid.

Sweep Recipes

1) Persona sweep, baseline vs skills (codex + claude)

OUT="/tmp/graphistry_skills_persona_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --codex --claude \
  --journeys pygraphistry_persona_journeys_v1 \
  --skills-mode both \
  --skills-delivery native \
  --max-workers 2 \
  --failfast \
  --out "$OUT"

2) Model matrix sweep (Codex family)

OUT="/tmp/graphistry_skills_codex_model_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --codex \
  --journeys pygraphistry_persona_journeys_v1 \
  --case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
  --skills-mode both \
  --skills-delivery native \
  --codex-models gpt-5,gpt-5-codex,gpt-5.3-codex,gpt-5.3-codex-spark \
  --max-workers 2 \
  --failfast \
  --out "$OUT"

3) Model matrix sweep (Claude tiers)

OUT="/tmp/graphistry_skills_claude_model_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --claude \
  --journeys pygraphistry_persona_journeys_v1 \
  --case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
  --skills-mode both \
  --skills-delivery native \
  --claude-models sonnet,opus \
  --max-workers 2 \
  --failfast \
  --out "$OUT"

4) Combined cross-runtime matrix (Codex + Claude)

OUT="/tmp/graphistry_skills_cross_runtime_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --codex --claude \
  --journeys pygraphistry_persona_journeys_v1 \
  --case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
  --skills-mode both \
  --skills-delivery native \
  --codex-models gpt-5,gpt-5-codex,gpt-5.3-codex,gpt-5.3-codex-spark \
  --claude-models sonnet,opus \
  --max-workers 2 \
  --failfast \
  --out "$OUT"

5) Codex effort A/B (high vs medium, full journeys)

Run with the same matrix and only change AGENT_CODEX_REASONING_EFFORT.

# high
OUT="/tmp/graphistry_skills_codex_full_effort_high_$(date +%Y%m%d-%H%M%S)"
AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=web-only \
AGENT_CODEX_REASONING_EFFORT=high \
./bin/agent.sh \
  --codex \
  --journeys all \
  --skills-mode both \
  --skills-profile pygraphistry_core \
  --skills-delivery native \
  --codex-models gpt-5.3-codex \
  --timeout-s 240 \
  --max-workers 2 \
  --out "$OUT"

# medium
OUT="/tmp/graphistry_skills_codex_full_effort_medium_$(date +%Y%m%d-%H%M%S)"
AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=web-only \
AGENT_CODEX_REASONING_EFFORT=medium \
./bin/agent.sh \
  --codex \
  --journeys all \
  --skills-mode both \
  --skills-profile pygraphistry_core \
  --skills-delivery native \
  --codex-models gpt-5.3-codex \
  --timeout-s 240 \
  --max-workers 2 \
  --out "$OUT"

OTel-Enabled Sweeps

Enable OTel on any sweep:

OUT="/tmp/graphistry_skills_otel_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
  --codex --claude \
  --journeys pygraphistry_persona_journeys_v1 \
  --skills-mode both \
  --skills-delivery native \
  --max-workers 2 \
  --failfast \
  --otel \
  --out "$OUT"

Verify traces were recorded:

TRACE_ID="$(tail -n 1 "$OUT/rows.jsonl" | jq -r '.trace_id')"
./bin/otel/cmds/trace2tree "$TRACE_ID"

Notes:

rows.jsonl stores per-row trace_id and traceparent.
otel_ids.json and report.md are emitted when run finalization completes.

Minimal Result Slices

All rows:

jq -r '[.case_id,.harness,.model,.skills_enabled,.pass_bool,.latency_ms] | @tsv' "$OUT/rows.jsonl" | column -t

Summary:

jq -s '
  {
    total: length,
    pass: (map(select(.pass_bool)) | length),
    off_total: (map(select(.skills_enabled|not)) | length),
    off_pass: (map(select((.skills_enabled|not) and .pass_bool)) | length),
    on_total: (map(select(.skills_enabled)) | length),
    on_pass: (map(select(.skills_enabled and .pass_bool)) | length)
  }' "$OUT/rows.jsonl"

By model:

jq -s '
  group_by(.model) |
  map({
    model: .[0].model,
    off: ((map(select((.skills_enabled|not) and .pass_bool)) | length|tostring) + "/" + (map(select(.skills_enabled|not)) | length|tostring)),
    on: ((map(select(.skills_enabled and .pass_bool)) | length|tostring) + "/" + (map(select(.skills_enabled)) | length|tostring))
  })' "$OUT/rows.jsonl"

Generate a Report

Create a markdown + JSON report from one run:

python3 scripts/benchmarks/make_report.py \
  --rows "$OUT/rows.jsonl" \
  --title "Graphistry Skills Eval Report" \
  --out-md "$OUT/report.md" \
  --out-json "$OUT/report.json"

Create one combined report from multiple run outputs:

python3 scripts/benchmarks/make_report.py \
  --rows /tmp/graphistry_skills_persona_YYYYMMDD-HHMMSS/rows.jsonl \
  --rows /tmp/graphistry_skills_codex_model_matrix_YYYYMMDD-HHMMSS/rows.jsonl \
  --rows /tmp/graphistry_skills_claude_model_matrix_YYYYMMDD-HHMMSS/rows.jsonl \
  --title "Graphistry Skills Combined Report" \
  --out-md benchmarks/reports/$(date +%Y-%m-%d)-local-sweep.md \
  --out-json benchmarks/data/$(date +%Y-%m-%d)-local-sweep/combined_metrics.json

Coverage Audit (Scenario Breadth)

Run a coverage scan across all journeys:

python3 scripts/benchmarks/scenario_coverage_audit.py \
  --journey-dir evals/journeys \
  --out-md benchmarks/reports/$(date +%Y-%m-%d)-scenario-coverage.md \
  --out-json benchmarks/data/$(date +%Y-%m-%d)-scenario-coverage.json

Use this before adding new journeys to close zero-bucket or severely imbalanced dimensions.

Baseline Isolation Notes

For Codex native-mode evals, agent_eval_loop.py creates mode-scoped native env dirs and mode-scoped CODEX_HOME under each run directory. This avoids baseline contamination from globally installed skills when comparing skills-mode off vs on.

For docs-mode control in native mode, use strict mount mode:

AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=toc \
./scripts/agent_eval_loop.py ...

and rerun with AGENT_EVAL_NATIVE_DOCS_MODE=web-only.

Why:

copy mode isolates skill files under the run env.
TOC mode in copy removes local docs mirror subtree if present.
manifest.json now records native_skills_mount_mode, native_docs_mode, and native_docs_ref.

Note:

Local docs mirror tooling is intentionally not part of the default shipped workflow in this repo revision.
Reintroduce mirror experiments only in a dedicated branch with clear KPI evidence.

Publishing Benchmarks

Suggested process:

Run clean sweeps into /tmp/....
Keep raw run artifacts local/private (rows.jsonl, manifest.json, traces, logs).
Generate public-safe outputs:

python3 scripts/benchmarks/make_report.py \
  --public-safe \
  --rows /tmp/<run>/rows.jsonl \
  --title "<report title>" \
  --out-md benchmarks/reports/<date-tag>.md \
  --out-json benchmarks/data/<date-tag>/combined_metrics.json

Generate README snippet:

python3 scripts/benchmarks/readme_snippet.py \
  --rows "/tmp/<run>/rows.jsonl" \
  --title "Fresh eval sweep with isolated baseline"

Update README.md Evals section with the generated snippet.
Update benchmarks/README.md with the new pack reference.
Check in sanitized report, combined_metrics.json, and README updates.

Releasing Versions

For semver bump + publish steps (release branch, changelog cut, PR merge, tag, GitHub release), use:

RELEASE.md (maintainer release guide)
.agents/skills/internal/release/SKILL.md (internal release skill)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Guide

Prerequisites

Repo Map

Skill Scope Conventions

Validate Skills Before Sweeps

Native Skill Env Setup (Manual)

Fast Iteration (Subset Cases)

GFQL Expansion Evals

Grading Modes (Deterministic / Oracle / Hybrid)

Sweep Recipes

1) Persona sweep, baseline vs skills (codex + claude)

2) Model matrix sweep (Codex family)

3) Model matrix sweep (Claude tiers)

4) Combined cross-runtime matrix (Codex + Claude)

5) Codex effort A/B (high vs medium, full journeys)

OTel-Enabled Sweeps

Minimal Result Slices

Generate a Report

Coverage Audit (Scenario Breadth)

Baseline Isolation Notes

Publishing Benchmarks

Releasing Versions

FilesExpand file tree

DEVELOP.md

Latest commit

History

DEVELOP.md

File metadata and controls

Development Guide

Prerequisites

Repo Map

Skill Scope Conventions

Validate Skills Before Sweeps

Native Skill Env Setup (Manual)

Fast Iteration (Subset Cases)

GFQL Expansion Evals

Grading Modes (Deterministic / Oracle / Hybrid)

Sweep Recipes

1) Persona sweep, baseline vs skills (codex + claude)

2) Model matrix sweep (Codex family)

3) Model matrix sweep (Claude tiers)

4) Combined cross-runtime matrix (Codex + Claude)

5) Codex effort A/B (high vs medium, full journeys)

OTel-Enabled Sweeps

Minimal Result Slices

Generate a Report

Coverage Audit (Scenario Breadth)

Baseline Isolation Notes

Publishing Benchmarks

Releasing Versions