Formal title: TwinBench: Benchmark for Persistent AI Systems
Version: 1.0
Status: Canonical public specification
TwinBench specifies a benchmark for evaluating persistent AI systems. Its purpose is to measure whether a system preserves useful continuity over time rather than performing well only within isolated sessions.
The benchmark covers systems that claim durable user memory, longitudinal task handling, identity stability, cross-context transfer, or personalization that improves with use.
The TwinBench object of evaluation is a system instance, not a model in isolation.
The evaluated object may include:
- foundation models
- memory services
- orchestration logic
- tool and channel adapters
- scheduling systems
- identity or profile layers
- retrieval and storage infrastructure
TwinBench evaluates the externally observable behavior of that full stack.
TwinBench is distinct from two common benchmark families:
Model benchmarksmeasure model competence, often with fixed prompts and single-run scoring.Agent-task benchmarksmeasure task completion within a bounded workflow or episode.
TwinBench measures a third layer:
Persistent intelligence benchmarksmeasure whether the system remains coherent, useful, and identity-consistent across time, context changes, and repeated use.
TwinBench therefore should not be interpreted as a replacement for model or agent benchmarks. It complements them by measuring what they usually leave untested.
TwinBench v1 assumes that the evaluated system:
- can be engaged across multiple sessions or checkpoints
- can access or simulate durable state across those checkpoints
- exposes enough behavioral evidence for evaluators to inspect continuity claims
- can be tested with fixed scenario definitions and recorded dates
TwinBench does not assume any particular model family, API design, storage system, or product form factor.
TwinBench v1 does not directly attempt to measure:
- frontier reasoning capability in isolation
- enterprise security certification
- benchmark-resistant social behavior
- production cost efficiency at large scale
- broad product quality outside persistence
- general intelligence claims
These areas may matter for deployment, but they are outside the benchmark scope of v1.
TwinBench v1 defines five metrics:
MR— Memory RetentionIC— Identity ConsistencyCCC— Cross-Context CoherenceTC— Task ContinuityPG— Personalization Gain
Metric definitions are normative in METRICS.md.
TwinBench favors interpretable scoring over aggressive aggregation.
The v1 default scoring profile:
- scores each metric on a
0-100scale - uses equal metric weighting
- reports scenario coverage
- reports metric coverage
- requires caveats and evaluator notes alongside scores
TwinBench prefers explicit partial coverage to hidden extrapolation. If a system is only partially observed, the result should show that directly rather than imply complete evidence.
TwinBench treats reproducibility as a graded property, not an absolute.
The benchmark is designed to improve reproducibility through:
- fixed scenario definitions
- explicit metric formulas
- dated evaluation artifacts
- disclosed caveats
- stable result schema
At the same time, persistent-system evaluation remains sensitive to environmental drift, delayed checkpoints, interface changes, and evaluator interpretation. v1 therefore requires transparent reporting instead of pretending full determinism where it does not exist.
TwinBench v1 defines a minimal but stable public surface:
- one canonical specification
- five core metrics
- five reference scenarios
- a lightweight evaluation scaffold
- a public result template
- one reference example artifact
Later versions may add:
- stronger delayed-checkpoint controls
- richer evidence requirements
- evaluator agreement procedures
- contradiction-resolution scenarios
- multi-user persistence scenarios
TwinBench uses benchmark-level versioning.
Patchchanges may clarify wording, examples, or non-normative guidance.Minorchanges may add optional scenarios or tooling without breaking result comparability.Majorchanges are required when metric definitions, scoring rules, or normative scenario behavior change in ways that break comparability.
The canonical benchmark version must always be recorded in published results.
A valid TwinBench result should include at minimum:
- benchmark name
- benchmark version
- system name
- system version
- evaluation date
- scenario-level observations
- per-metric scores
- total score
- coverage values
- evaluator notes
- caveats
- evidence references, if available
TwinBench scores are evidence about persistent-system behavior. They are not universal rankings of intelligence, safety, or product quality.
High TwinBench scores suggest that a system behaves more like a coherent persistent intelligence layer. They do not establish superiority outside the benchmark scope.
TwinBench is vendor-neutral, but it is grounded in a clear systems thesis: persistence is not reducible to prompting. Durable continuity usually depends on runtime architecture, state management, orchestration, and identity handling. This makes TwinBench especially relevant to persistent AI runtimes while keeping the benchmark independent of any single implementation.