Skip to content

eval harness: increase isolation to prevent cross-case and memory contamination #17

@lmeyerov

Description

@lmeyerov

Summary

The eval harness has good per-case process isolation (fresh claude -p per case, mode-scoped workdirs, isolated CODEX_HOME) but has gaps that could contaminate results — especially baseline (skills-off) measurements.

Current isolation

Layer Isolated? How
Conversation history Yes Each case is a fresh claude -p process
Skills files Yes Mode-scoped native_env/ dirs; OFF uses /tmp with no repo paths
Codex HOME Yes Per-run + per-worker cloned CODEX_HOME
Working directory Partial Mode-scoped, but shared across cases within a mode

Contamination vectors

1. Claude persistent memory (HIGH)

~/.claude/memory/ and ~/.claude/projects/ are accessible to all cases. If a user has worked with graphistry-skills before, the model may recall skill content from memory even in skills-off mode.

Fix: Set CLAUDE_MEMORY_DIR (or equivalent) to a temp dir for eval runs, or use --memory-dir flag if available. Alternatively, run with a clean CLAUDE_HOME.

2. Shared workdir within a mode (MEDIUM)

Cases within the same mode (e.g., all skills-on cases) share a native_env/ workdir. If a case uses tools that write files (e.g., creates a Python file), subsequent cases could discover those files.

Fix: Use per-case temp workdirs, or clean the workdir between cases.

3. CLAUDE.md / project files (MEDIUM)

If the harness workdir or any parent has a CLAUDE.md or .claude/ directory, those instructions load into every case — including skills-off baseline cases. This could inject skill-like guidance into baselines.

Fix: For skills-off mode, ensure the workdir path has no CLAUDE.md in any parent directory. The current /tmp/baseline_* approach helps but should be verified.

4. Environment variables (LOW)

os.environ is copied to child processes. If env vars contain graphistry-related hints (e.g., GRAPHISTRY_SERVER), they're visible to all cases regardless of skill mode.

Fix: Scrub non-essential env vars for baseline runs, or use an explicit allowlist.

5. Model training data (NOT FIXABLE)

The model may have seen PyGraphistry/GFQL patterns in training data. This is the fundamental baseline contamination risk — skills-off results reflect model knowledge, not zero knowledge.

Mitigation: Document this clearly in benchmark reports. Consider novel/obscure API patterns in eval cases to reduce training data overlap.

Proposed priority

  1. Claude memory isolation — highest impact, most likely to contaminate baselines
  2. Per-case workdir cleanup — prevents filesystem side-channel between cases
  3. CLAUDE.md verification — ensure no project-level instructions leak into baselines
  4. Env var scrubbing — low priority but easy to add

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions