-
Notifications
You must be signed in to change notification settings - Fork 0
Evals: Evergreen / self-regenerating eval journeys #8
Copy link
Copy link
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
Summary
Design and implement a system for evergreen, self-regenerating eval journeys that stay current with API changes and documentation updates.
Problem
Current eval journeys are static JSON files that can drift from:
- API changes in PyGraphistry, graphistry-js, REST API
- Documentation updates on ReadTheDocs
- New features and deprecations
- Real-world usage patterns
Goals
- Eval cases automatically update when APIs change
- Journeys reflect current documentation state
- Reduce manual maintenance burden
- Catch skill drift early (before users hit issues)
Potential Approaches
1. Doc-driven generation
- Parse RTD/GitHub docs to extract code examples
- Auto-generate eval cases from docstrings and examples
- Detect when docs change and flag stale journeys
2. API schema-driven
- Use type hints, OpenAPI specs, or introspection
- Generate cases covering API surface area
- Track coverage gaps automatically
3. Usage-driven
- Collect anonymized query patterns from real usage
- Generate journeys from common user intents
- Weight cases by frequency
4. CI integration
- Nightly runs against latest docs/APIs
- Auto-PR when journeys need updates
- Block releases if evals regress
Requirements
- Design doc outlining chosen approach
- Prototype implementation
- Integration with existing eval harness (
agent_eval_loop.py) - CI workflow for scheduled regeneration
- Documentation in
DEVELOP.md
References
- Current journeys:
evals/journeys/*.json - Eval harness:
scripts/agent_eval_loop.py - Benchmark workflow:
.agents/skills/benchmarks/SKILL.md
Notes
This is an open-ended design problem. Multiple valid approaches exist. Start with a design doc before implementing.
Help wanted! Community contributions welcome.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed