This file is for contributors working on skills and eval harnesses.
- Run from repo root:
graphistry-skills/ - CLIs on
PATH:codex,claude,jq - Auth configured for each runtime (
~/.codex,~/.claude)
Optional (for OTel trace capture + inspection):
- Local OTel stack running (from sibling repo workflow):
./dc-otel - Health check:
./bin/otel/status
- Skills:
.agents/skills/<skill>/SKILL.md - Journeys:
evals/journeys/*.json - Runner:
bin/agent.sh - Core eval engine:
scripts/agent_eval_loop.py - Checked-in benchmark artifacts (public-safe):
benchmarks/data/*/combined_metrics.jsonandbenchmarks/reports/* - Release workflow (maintainers):
RELEASE.md
- User-facing/published skills are the
pygraphistry*set. - Internal maintainer skills live under
.agents/skills/internal/(for example:.agents/skills/internal/plan,.agents/skills/internal/eval-otel,.agents/skills/internal/benchmarks) and are flagged withmetadata.internal: true. - Keep internal maintainer skills out of default end-user install snippets.
python3 scripts/ci/validate_skills.py
./bin/evals/codex-skills-smoke.sh
./bin/evals/claude-skills-smoke.sh./scripts/evals/setup_codex_skill_env.sh --env-dir evals/env/codex
./scripts/evals/setup_claude_skill_env.sh --env-dir evals/env/claudeNotes:
- Codex native skills are loaded from
.codex/skills/under runtime CWD. - Claude native skills are loaded from
.claude/skills/under runtime CWD.
Use this while editing skill text.
./bin/agent.sh \
--codex --claude \
--journeys pygraphistry_persona_journeys_v1 \
--case-ids persona_novice_fraud_table_to_viz_algo,persona_connector_analyst_workflow \
--skills-mode both \
--skills-delivery native \
--max-workers 2 \
--failfastRun the GFQL-specific eval suites (deterministic):
./bin/agent.sh \
--claude \
--journeys pygraphistry_gfql_cypher_v1,pygraphistry_gfql_let_dag_v1,pygraphistry_gfql_backward_fixes_v1,pygraphistry_gfql_row_pipeline_v1 \
--skills-mode both \
--skills-delivery native \
--max-workers 2 \
--out "$OUT"Run functional execution evals (code is actually executed with pygraphistry):
# Step 1: Generate responses
./bin/agent.sh \
--claude \
--journeys pygraphistry_gfql_functional_v1 \
--skills-mode on \
--skills-delivery native \
--max-workers 2 \
--out "$OUT"
# Step 2: Execute generated code and validate results
cd ~/Work/pygraphistry && PYTHONPATH="$PWD" python3 \
/path/to/graphistry-skills/scripts/evals/gfql_functional_check.py \
--rows "$OUT/rows.jsonl" \
--journey-dir /path/to/graphistry-skills/evals/journeysThe functional checker extracts Python code blocks from responses, runs them with pygraphistry, and validates: no exceptions, expected output strings, correct result shapes.
Default eval scoring is deterministic checks from each journey case.
Oracle grading scaffold is also available:
OUT="/tmp/graphistry_skills_oracle_smoke_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--codex \
--journeys runtime_smoke \
--case-ids echo_token \
--skills-mode off \
--grading oracle \
--oracle-harness codex \
--out "$OUT"Hybrid grading (deterministic + oracle conjunction) is enabled via --grading hybrid.
OUT="/tmp/graphistry_skills_persona_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--codex --claude \
--journeys pygraphistry_persona_journeys_v1 \
--skills-mode both \
--skills-delivery native \
--max-workers 2 \
--failfast \
--out "$OUT"OUT="/tmp/graphistry_skills_codex_model_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--codex \
--journeys pygraphistry_persona_journeys_v1 \
--case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
--skills-mode both \
--skills-delivery native \
--codex-models gpt-5,gpt-5-codex,gpt-5.3-codex,gpt-5.3-codex-spark \
--max-workers 2 \
--failfast \
--out "$OUT"OUT="/tmp/graphistry_skills_claude_model_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--claude \
--journeys pygraphistry_persona_journeys_v1 \
--case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
--skills-mode both \
--skills-delivery native \
--claude-models sonnet,opus \
--max-workers 2 \
--failfast \
--out "$OUT"OUT="/tmp/graphistry_skills_cross_runtime_matrix_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--codex --claude \
--journeys pygraphistry_persona_journeys_v1 \
--case-ids persona_novice_fraud_table_to_viz_algo,persona_advanced_coloring_with_gfql_slices,persona_connector_analyst_workflow \
--skills-mode both \
--skills-delivery native \
--codex-models gpt-5,gpt-5-codex,gpt-5.3-codex,gpt-5.3-codex-spark \
--claude-models sonnet,opus \
--max-workers 2 \
--failfast \
--out "$OUT"Run with the same matrix and only change AGENT_CODEX_REASONING_EFFORT.
# high
OUT="/tmp/graphistry_skills_codex_full_effort_high_$(date +%Y%m%d-%H%M%S)"
AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=web-only \
AGENT_CODEX_REASONING_EFFORT=high \
./bin/agent.sh \
--codex \
--journeys all \
--skills-mode both \
--skills-profile pygraphistry_core \
--skills-delivery native \
--codex-models gpt-5.3-codex \
--timeout-s 240 \
--max-workers 2 \
--out "$OUT"# medium
OUT="/tmp/graphistry_skills_codex_full_effort_medium_$(date +%Y%m%d-%H%M%S)"
AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=web-only \
AGENT_CODEX_REASONING_EFFORT=medium \
./bin/agent.sh \
--codex \
--journeys all \
--skills-mode both \
--skills-profile pygraphistry_core \
--skills-delivery native \
--codex-models gpt-5.3-codex \
--timeout-s 240 \
--max-workers 2 \
--out "$OUT"Enable OTel on any sweep:
OUT="/tmp/graphistry_skills_otel_$(date +%Y%m%d-%H%M%S)"
./bin/agent.sh \
--codex --claude \
--journeys pygraphistry_persona_journeys_v1 \
--skills-mode both \
--skills-delivery native \
--max-workers 2 \
--failfast \
--otel \
--out "$OUT"Verify traces were recorded:
TRACE_ID="$(tail -n 1 "$OUT/rows.jsonl" | jq -r '.trace_id')"
./bin/otel/cmds/trace2tree "$TRACE_ID"Notes:
rows.jsonlstores per-rowtrace_idandtraceparent.otel_ids.jsonandreport.mdare emitted when run finalization completes.
All rows:
jq -r '[.case_id,.harness,.model,.skills_enabled,.pass_bool,.latency_ms] | @tsv' "$OUT/rows.jsonl" | column -tSummary:
jq -s '
{
total: length,
pass: (map(select(.pass_bool)) | length),
off_total: (map(select(.skills_enabled|not)) | length),
off_pass: (map(select((.skills_enabled|not) and .pass_bool)) | length),
on_total: (map(select(.skills_enabled)) | length),
on_pass: (map(select(.skills_enabled and .pass_bool)) | length)
}' "$OUT/rows.jsonl"By model:
jq -s '
group_by(.model) |
map({
model: .[0].model,
off: ((map(select((.skills_enabled|not) and .pass_bool)) | length|tostring) + "/" + (map(select(.skills_enabled|not)) | length|tostring)),
on: ((map(select(.skills_enabled and .pass_bool)) | length|tostring) + "/" + (map(select(.skills_enabled)) | length|tostring))
})' "$OUT/rows.jsonl"Create a markdown + JSON report from one run:
python3 scripts/benchmarks/make_report.py \
--rows "$OUT/rows.jsonl" \
--title "Graphistry Skills Eval Report" \
--out-md "$OUT/report.md" \
--out-json "$OUT/report.json"Create one combined report from multiple run outputs:
python3 scripts/benchmarks/make_report.py \
--rows /tmp/graphistry_skills_persona_YYYYMMDD-HHMMSS/rows.jsonl \
--rows /tmp/graphistry_skills_codex_model_matrix_YYYYMMDD-HHMMSS/rows.jsonl \
--rows /tmp/graphistry_skills_claude_model_matrix_YYYYMMDD-HHMMSS/rows.jsonl \
--title "Graphistry Skills Combined Report" \
--out-md benchmarks/reports/$(date +%Y-%m-%d)-local-sweep.md \
--out-json benchmarks/data/$(date +%Y-%m-%d)-local-sweep/combined_metrics.jsonRun a coverage scan across all journeys:
python3 scripts/benchmarks/scenario_coverage_audit.py \
--journey-dir evals/journeys \
--out-md benchmarks/reports/$(date +%Y-%m-%d)-scenario-coverage.md \
--out-json benchmarks/data/$(date +%Y-%m-%d)-scenario-coverage.jsonUse this before adding new journeys to close zero-bucket or severely imbalanced dimensions.
For Codex native-mode evals, agent_eval_loop.py creates mode-scoped native env dirs and mode-scoped CODEX_HOME under each run directory. This avoids baseline contamination from globally installed skills when comparing skills-mode off vs on.
For docs-mode control in native mode, use strict mount mode:
AGENT_EVAL_NATIVE_SKILLS_MOUNT_MODE=copy \
AGENT_EVAL_NATIVE_DOCS_MODE=toc \
./scripts/agent_eval_loop.py ...and rerun with AGENT_EVAL_NATIVE_DOCS_MODE=web-only.
Why:
copymode isolates skill files under the run env.- TOC mode in
copyremoves local docs mirror subtree if present. manifest.jsonnow recordsnative_skills_mount_mode,native_docs_mode, andnative_docs_ref.
Note:
- Local docs mirror tooling is intentionally not part of the default shipped workflow in this repo revision.
- Reintroduce mirror experiments only in a dedicated branch with clear KPI evidence.
Suggested process:
- Run clean sweeps into
/tmp/.... - Keep raw run artifacts local/private (
rows.jsonl,manifest.json, traces, logs). - Generate public-safe outputs:
python3 scripts/benchmarks/make_report.py \
--public-safe \
--rows /tmp/<run>/rows.jsonl \
--title "<report title>" \
--out-md benchmarks/reports/<date-tag>.md \
--out-json benchmarks/data/<date-tag>/combined_metrics.json- Generate README snippet:
python3 scripts/benchmarks/readme_snippet.py \
--rows "/tmp/<run>/rows.jsonl" \
--title "Fresh eval sweep with isolated baseline"- Update
README.mdEvals section with the generated snippet. - Update
benchmarks/README.mdwith the new pack reference. - Check in sanitized report,
combined_metrics.json, and README updates.
For semver bump + publish steps (release branch, changelog cut, PR merge, tag, GitHub release), use:
RELEASE.md(maintainer release guide).agents/skills/internal/release/SKILL.md(internal release skill)