Skip to content

Evals: Evergreen / self-regenerating eval journeys #8

@lmeyerov

Description

@lmeyerov

Summary

Design and implement a system for evergreen, self-regenerating eval journeys that stay current with API changes and documentation updates.

Problem

Current eval journeys are static JSON files that can drift from:

  • API changes in PyGraphistry, graphistry-js, REST API
  • Documentation updates on ReadTheDocs
  • New features and deprecations
  • Real-world usage patterns

Goals

  • Eval cases automatically update when APIs change
  • Journeys reflect current documentation state
  • Reduce manual maintenance burden
  • Catch skill drift early (before users hit issues)

Potential Approaches

1. Doc-driven generation

  • Parse RTD/GitHub docs to extract code examples
  • Auto-generate eval cases from docstrings and examples
  • Detect when docs change and flag stale journeys

2. API schema-driven

  • Use type hints, OpenAPI specs, or introspection
  • Generate cases covering API surface area
  • Track coverage gaps automatically

3. Usage-driven

  • Collect anonymized query patterns from real usage
  • Generate journeys from common user intents
  • Weight cases by frequency

4. CI integration

  • Nightly runs against latest docs/APIs
  • Auto-PR when journeys need updates
  • Block releases if evals regress

Requirements

  • Design doc outlining chosen approach
  • Prototype implementation
  • Integration with existing eval harness (agent_eval_loop.py)
  • CI workflow for scheduled regeneration
  • Documentation in DEVELOP.md

References

  • Current journeys: evals/journeys/*.json
  • Eval harness: scripts/agent_eval_loop.py
  • Benchmark workflow: .agents/skills/benchmarks/SKILL.md

Notes

This is an open-ended design problem. Multiple valid approaches exist. Start with a design doc before implementing.


Help wanted! Community contributions welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions