-
Notifications
You must be signed in to change notification settings - Fork 429
Description
Hi Lighteval team, thanks for releasing and maintaining this framework.
Lighteval already does a great job unifying many benchmarks and backends, and making it easy to dig into sample-level results. I wanted to ask about a slightly different angle for evaluation and whether it fits your roadmap.
Over the last year I have been working on something I call a "long horizon tension crash test" for LLMs. Instead of just asking one question and scoring the answer, the idea is to push a model through a long sequence of very high tension questions and watch how its internal story slowly drifts or collapses.
Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is:
- a plain text universe of 131 S-class questions across alignment, extreme physics, long horizon civilization decisions, etc
- designed so that any LLM that supports file input can read it directly
- already used as a stress lab to see where models start hallucinating or losing track of their own commitments
This behaves less like a standard QA dataset and more like a "crash test dummy" for long horizon reasoning under semantic tension.
My question is:
Do you see room in Lighteval for a dimension like this?
Some possibilities I can imagine:
- treating the TXT pack (or a subset) as a long horizon benchmark and logging per-question behavior,
- defining simple metrics like "amount of self-contradiction over 50+ steps" or "time to collapse under high tension prompts",
- or using it as an optional add-on for people who care about stability under extreme scenarios rather than only accuracy.
Everything is MIT licensed, lives in a public repo, and is meant to be reproducible and inspectable. I am not asking you to adopt it as a first-class benchmark today. I mostly want feedback on whether a long horizon tension crash test makes sense inside Lighteval or should live entirely outside.
If this sounds even slightly interesting, I am happy to:
- provide a minimal slice that is easy to wire into your pipeline, and
- share a few anonymized traces of how different models behave when running through it.
Thanks for any thoughts or pointers on where this might fit.