[FT] add a long horizon "tension crash test" dimension based on a TXT S-class question pack

Hi Lighteval team, thanks for releasing and maintaining this framework.

Lighteval already does a great job unifying many benchmarks and backends, and making it easy to dig into sample-level results. I wanted to ask about a slightly different angle for evaluation and whether it fits your roadmap.

Over the last year I have been working on something I call a "long horizon tension crash test" for LLMs. Instead of just asking one question and scoring the answer, the idea is to push a model through a long sequence of very high tension questions and watch how its internal story slowly drifts or collapses.

Concretely, I maintain an open TXT pack called "WFGY 3.0 · Singularity Demo (BlackHole-131)". It is:

- a plain text universe of 131 S-class questions across alignment, extreme physics, long horizon civilization decisions, etc
- designed so that any LLM that supports file input can read it directly
- already used as a stress lab to see where models start hallucinating or losing track of their own commitments

This behaves less like a standard QA dataset and more like a "crash test dummy" for long horizon reasoning under semantic tension.

My question is:

Do you see room in Lighteval for a dimension like this?

Some possibilities I can imagine:
- treating the TXT pack (or a subset) as a long horizon benchmark and logging per-question behavior,
- defining simple metrics like "amount of self-contradiction over 50+ steps" or "time to collapse under high tension prompts",
- or using it as an optional add-on for people who care about stability under extreme scenarios rather than only accuracy.

Everything is MIT licensed, lives in a public repo, and is meant to be reproducible and inspectable. I am not asking you to adopt it as a first-class benchmark today. I mostly want feedback on whether a long horizon tension crash test makes sense inside Lighteval or should live entirely outside.

If this sounds even slightly interesting, I am happy to:
- provide a minimal slice that is easy to wire into your pipeline, and
- share a few anonymized traces of how different models behave when running through it.

Thanks for any thoughts or pointers on where this might fit.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FT] add a long horizon "tension crash test" dimension based on a TXT S-class question pack #1164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FT] add a long horizon "tension crash test" dimension based on a TXT S-class question pack #1164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions