eval harness: increase isolation to prevent cross-case and memory contamination

## Summary

The eval harness has good per-case process isolation (fresh `claude -p` per case, mode-scoped workdirs, isolated CODEX_HOME) but has gaps that could contaminate results — especially baseline (skills-off) measurements.

## Current isolation

| Layer | Isolated? | How |
|-------|-----------|-----|
| Conversation history | Yes | Each case is a fresh `claude -p` process |
| Skills files | Yes | Mode-scoped `native_env/` dirs; OFF uses `/tmp` with no repo paths |
| Codex HOME | Yes | Per-run + per-worker cloned `CODEX_HOME` |
| Working directory | Partial | Mode-scoped, but shared across cases within a mode |

## Contamination vectors

### 1. Claude persistent memory (HIGH)
`~/.claude/memory/` and `~/.claude/projects/` are accessible to all cases. If a user has worked with graphistry-skills before, the model may recall skill content from memory even in skills-off mode.

**Fix**: Set `CLAUDE_MEMORY_DIR` (or equivalent) to a temp dir for eval runs, or use `--memory-dir` flag if available. Alternatively, run with a clean `CLAUDE_HOME`.

### 2. Shared workdir within a mode (MEDIUM)
Cases within the same mode (e.g., all skills-on cases) share a `native_env/` workdir. If a case uses tools that write files (e.g., creates a Python file), subsequent cases could discover those files.

**Fix**: Use per-case temp workdirs, or clean the workdir between cases.

### 3. CLAUDE.md / project files (MEDIUM)
If the harness workdir or any parent has a `CLAUDE.md` or `.claude/` directory, those instructions load into every case — including skills-off baseline cases. This could inject skill-like guidance into baselines.

**Fix**: For skills-off mode, ensure the workdir path has no `CLAUDE.md` in any parent directory. The current `/tmp/baseline_*` approach helps but should be verified.

### 4. Environment variables (LOW)
`os.environ` is copied to child processes. If env vars contain graphistry-related hints (e.g., `GRAPHISTRY_SERVER`), they're visible to all cases regardless of skill mode.

**Fix**: Scrub non-essential env vars for baseline runs, or use an explicit allowlist.

### 5. Model training data (NOT FIXABLE)
The model may have seen PyGraphistry/GFQL patterns in training data. This is the fundamental baseline contamination risk — skills-off results reflect model knowledge, not zero knowledge.

**Mitigation**: Document this clearly in benchmark reports. Consider novel/obscure API patterns in eval cases to reduce training data overlap.

## Proposed priority

1. **Claude memory isolation** — highest impact, most likely to contaminate baselines
2. **Per-case workdir cleanup** — prevents filesystem side-channel between cases  
3. **CLAUDE.md verification** — ensure no project-level instructions leak into baselines
4. **Env var scrubbing** — low priority but easy to add

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval harness: increase isolation to prevent cross-case and memory contamination #17

Summary

Current isolation

Contamination vectors

1. Claude persistent memory (HIGH)

2. Shared workdir within a mode (MEDIUM)

3. CLAUDE.md / project files (MEDIUM)

4. Environment variables (LOW)

5. Model training data (NOT FIXABLE)

Proposed priority

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	Isolated?	How
Conversation history	Yes	Each case is a fresh `claude -p` process
Skills files	Yes	Mode-scoped `native_env/` dirs; OFF uses `/tmp` with no repo paths
Codex HOME	Yes	Per-run + per-worker cloned `CODEX_HOME`
Working directory	Partial	Mode-scoped, but shared across cases within a mode

eval harness: increase isolation to prevent cross-case and memory contamination #17

Description

Summary

Current isolation

Contamination vectors

1. Claude persistent memory (HIGH)

2. Shared workdir within a mode (MEDIUM)

3. CLAUDE.md / project files (MEDIUM)

4. Environment variables (LOW)

5. Model training data (NOT FIXABLE)

Proposed priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions