-
Notifications
You must be signed in to change notification settings - Fork 0
eval harness: increase isolation to prevent cross-case and memory contamination #17
Description
Summary
The eval harness has good per-case process isolation (fresh claude -p per case, mode-scoped workdirs, isolated CODEX_HOME) but has gaps that could contaminate results — especially baseline (skills-off) measurements.
Current isolation
| Layer | Isolated? | How |
|---|---|---|
| Conversation history | Yes | Each case is a fresh claude -p process |
| Skills files | Yes | Mode-scoped native_env/ dirs; OFF uses /tmp with no repo paths |
| Codex HOME | Yes | Per-run + per-worker cloned CODEX_HOME |
| Working directory | Partial | Mode-scoped, but shared across cases within a mode |
Contamination vectors
1. Claude persistent memory (HIGH)
~/.claude/memory/ and ~/.claude/projects/ are accessible to all cases. If a user has worked with graphistry-skills before, the model may recall skill content from memory even in skills-off mode.
Fix: Set CLAUDE_MEMORY_DIR (or equivalent) to a temp dir for eval runs, or use --memory-dir flag if available. Alternatively, run with a clean CLAUDE_HOME.
2. Shared workdir within a mode (MEDIUM)
Cases within the same mode (e.g., all skills-on cases) share a native_env/ workdir. If a case uses tools that write files (e.g., creates a Python file), subsequent cases could discover those files.
Fix: Use per-case temp workdirs, or clean the workdir between cases.
3. CLAUDE.md / project files (MEDIUM)
If the harness workdir or any parent has a CLAUDE.md or .claude/ directory, those instructions load into every case — including skills-off baseline cases. This could inject skill-like guidance into baselines.
Fix: For skills-off mode, ensure the workdir path has no CLAUDE.md in any parent directory. The current /tmp/baseline_* approach helps but should be verified.
4. Environment variables (LOW)
os.environ is copied to child processes. If env vars contain graphistry-related hints (e.g., GRAPHISTRY_SERVER), they're visible to all cases regardless of skill mode.
Fix: Scrub non-essential env vars for baseline runs, or use an explicit allowlist.
5. Model training data (NOT FIXABLE)
The model may have seen PyGraphistry/GFQL patterns in training data. This is the fundamental baseline contamination risk — skills-off results reflect model knowledge, not zero knowledge.
Mitigation: Document this clearly in benchmark reports. Consider novel/obscure API patterns in eval cases to reduce training data overlap.
Proposed priority
- Claude memory isolation — highest impact, most likely to contaminate baselines
- Per-case workdir cleanup — prevents filesystem side-channel between cases
- CLAUDE.md verification — ensure no project-level instructions leak into baselines
- Env var scrubbing — low priority but easy to add