refacto prompt building #709

NathanHB · 2025-05-07T11:59:29Z

What does this PR do?

This PR gives the prompt building logic in lighteval a much-needed spring cleaning

The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥

Highlights

Prompt Manager Overhaul: Each model now owns its own PromptManager instance, with custom params for every flavor of prompt (multimodal, API, multiturn, you name it).
Metrics Slimdown: Metrics now only care about samplingMethod (generative or loglikelihood). Say goodbye to use_case and all those old request types.
Request Layer Gone: Models get the raw Doc directly -—no more unnecessary request wrappers that were bloating the code.
Unified ModelResponse: All models return a single ModelResponse type, whether generative or loglikelihood. This means simpler logging and metric computation.
Consistent Metric Signatures: Every metric now uses the same function signature: compute(doc: Doc, model_response: ModelResponse).
Standardized Details: Each sample’s details now always include three fields: doc, metric, and model_response.
Generative Metrics Unified: All generative metrics now work the same way. If users want greedy generation, they need to set temperature to 0.
Removed Loglikelihood Single Token: bloated and almost not used
Tests: All tests pass, and no changes were needed to expected values.

Why?

Less code, fewer headaches.
Easier to add new benchmarks (including weird and wonderful ones).
More user-friendly inspection tools.
A single, unified way to handle prompts, responses, and metrics.

HuggingFaceDocBuilderDev · 2025-05-07T12:01:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…t-building

src/lighteval/metrics/dynamic_metrics.py

src/lighteval/models/abstract_model.py

tests/test_unit_harness_metrics.py

NathanHB · 2025-06-16T14:36:25Z

src/lighteval/models/transformers/vlm_transformers_model.py

                dataloader, desc="Greedy generation", position=1, leave=True, disable=self.disable_tqdm
            ):
                batch_inputs = batch_inputs.to(self.device)
                if self.torch_dtype is not None:
                    batch_inputs = batch_inputs.to(self.torch_dtype)

                max_new_tokens = self.config.generation_size or batch_requests[0].generation_size
+                do_sample = batch_requests[0].do_sample


wether we want to use sampling or greedy is left to the user

isn't there a default value for the param though?

the default valaue is True, actally after yje discussion with @lewtun , this params is not needed anymore. It is controlled by the temperature arg (defaults to 0) and if the user wants to use sampling, the he has to set the temperature > 0

NathanHB · 2025-06-16T14:59:15Z

src/lighteval/models/transformers/transformers_model.py

@@ -650,7 +629,7 @@ def _generate(
            max_new_tokens=max_new_tokens,
            pad_token_id=self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id,
            eos_token_id=self.tokenizer.eos_token_id,
-            do_sample=do_sample,
+            do_sample=do_sample if generation_config.get("temperature", 1.0) > 0 else False,


do_sample will always be true, except if user set temp to 0

which is the default case when temp is not provided

Copilot

Pull Request Overview

This PR overhauls prompt-building and metrics handling across LiteEval by removing legacy wrappers, unifying on Doc and ModelResponse types, and standardizing metric signatures to use SamplingMethod.

Metrics now accept (doc: Doc, model_response: ModelResponse) and use SamplingMethod instead of legacy MetricCategory/MetricUseCase.
The global apply_... functions in metrics/__init__.py are consolidated into a single apply_metric that handles batched and per-sample metrics.
Data loading, logging, and documentation are updated to use the new Doc model, drop old request classes, and reflect the simplified API.

Reviewed Changes

Copilot reviewed 67 out of 67 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/lighteval/metrics/harness_compatibility/drop.py	Update DROP metric to use `Doc` and `ModelResponse`
src/lighteval/metrics/dynamic_metrics.py	Replace `MetricCategory` with `SamplingMethod` categories
src/lighteval/metrics/init.py	Consolidate apply functions into unified `apply_metric`
src/lighteval/main_endpoint.py	Minor docstring update
src/lighteval/logging/info_loggers.py	Rewrite `Detail` dataclass to hold `doc` and `model_response`
src/lighteval/logging/evaluation_tracker.py	Add `preview_outputs` using new `Detail` fields
src/lighteval/data.py	Swap legacy request types for `Doc` and update type hints
pyproject.toml	Broaden pytest version constraint
examples/model_configs/vllm_model_config.yaml	Add `is_async` parameter
examples/model_configs/transformers_vlm_model.yaml	Enable `use_fast_image_processor`, set `temperature: 0.0`
examples/model_configs/transformers_model.yaml	Set `temperature: 0.0`
examples/model_configs/sglang_model_config.yaml	Toggle `use_chat_template: True`
examples/custom_tasks_tests.py	Fix parameter name from `metric` to `metrics`
docs/source/saving-and-reading-results.mdx	Update detail file columns to `__doc__`, `__model_response__`, `__metric__`
docs/source/quicktour.mdx	Refresh backend list with new endpoint names
docs/source/package_reference/tasks.mdx	Remove old request classes, add `Doc`
docs/source/package_reference/models.mdx	Revise to "Model Configs", update model config entries
docs/source/adding-a-new-metric.mdx	Show updated metric signature using `Doc`/`ModelResponse`
docs/source/adding-a-custom-task.mdx	Rename `metric`→`metrics`, update parameter names
docs/source/_toctree.yml	Rename "Models and ModelConfigs" to "Model Configs"

Comments suppressed due to low confidence (4)

src/lighteval/data.py:258

This sorting criterion uses the character length of doc.query instead of the token length. It can misorder batches. Consider using the tokenized context length (e.g., len(doc.tokenized_context)).

-            -(len(query) + gen_length),

src/lighteval/logging/info_loggers.py:211

The attribute model_response.input may not exist on ModelResponse. Verify the correct property (e.g., model_response.input_tokens or model_response.text).

pprint(model_response.input)

src/lighteval/metrics/init.py:31

[nitpick] The nested loops over metrics and docs in apply_metric are hard to follow. Consider separating batched and per-sample flows into clear helper functions to improve readability.

for metric in metrics:

docs/source/adding-a-custom-task.mdx:56

The placeholder {GENERATIVE,LOGPROBS} is invalid syntax. Recommend specifying one method, e.g. SamplingMethod.GENERATIVE, or document how to pass multiple.

category=SamplingMethod.{GENERATIVE,LOGPROBS},

clefourrier · 2025-06-16T17:44:34Z

Good to review?

clefourrier

Much neater, very nice refacto of a lot of old hanging code!
Lots of very cool simplifications!

Some questions:

you removed a range of return types in function signatures and I'm not clear why
how can one now select a custom batch size? (override_bs param before)
what do you get on aime24 atm? (I'm still not sure how you manage evals where you get both sampling generative and greedy generative metrics)

docs/source/adding-a-custom-task.mdx

docs/source/package_reference/models.mdx

docs/source/quicktour.mdx

src/lighteval/data.py

src/lighteval/tasks/lighteval_task.py

src/lighteval/tasks/requests.py

src/lighteval/tasks/prompt_manager.py

src/lighteval/tasks/registry.py

Co-authored-by: Clémentine Fourrier <[email protected]>

…ace/lighteval into nathan-refactor-prompt-building

clefourrier

skimmed, looks like most important things are fixed - again super good job

…ace/lighteval into nathan-refactor-prompt-building

refacto prompt building

9684d16

NathanHB added 28 commits May 19, 2025 12:00

commit

cab4027

working state for generative metrics (mocked the model)

0b1e213

working state, removed Metrictype and use_case

723daeb

working state, all metrics should work, need to unmock the models now

65c2508

remove unused functions from pipeline

4a16bec

working for transformer's greedyuntil

29e6657

working on loglikelihood but getting random results

9624de6

loglikelihood working

747ebf1

transformers model working

0358cb4

remove unused functions

471247b

all unit tests pass

62658d4

all unit tests pass

31c80cb

loglikelihood vllm works

c30ff90

end to end works

6dfc502

end to end works

d458183

all tests pass

2913dfd

all tests pass official

2c4cd69

sglang works

9675dc3

fixing more models

52dcf11

Merge remote-tracking branch 'origin/main' into nathan-refactor-promp…

9691d33

…t-building

all tests passing

8102e77

all models files were reviewed except nanotron

81992a9

working

c7502d3

load from details working

872a6be

fix tests

e42d999

documentation

c0a0b82

documentation

bbd0c11

documentation

55cdfb6

NathanHB commented Jun 16, 2025

View reviewed changes

src/lighteval/metrics/dynamic_metrics.py Show resolved Hide resolved

NathanHB commented Jun 16, 2025

View reviewed changes

src/lighteval/models/abstract_model.py Show resolved Hide resolved

NathanHB commented Jun 16, 2025

View reviewed changes

tests/test_unit_harness_metrics.py Show resolved Hide resolved

NathanHB commented Jun 16, 2025

View reviewed changes

fixes

3c49c33

NathanHB requested a review from Copilot June 16, 2025 16:25

Copilot AI reviewed Jun 16, 2025

View reviewed changes

clefourrier reviewed Jun 17, 2025

View reviewed changes

NathanHB and others added 9 commits June 19, 2025 11:24

Update docs/source/quicktour.mdx

d493c39

Co-authored-by: Clémentine Fourrier <[email protected]>

Update src/lighteval/metrics/sample_preparator.py

42c2cc3

Co-authored-by: Clémentine Fourrier <[email protected]>

Update src/lighteval/tasks/registry.py

9c2fa74

Co-authored-by: Clémentine Fourrier <[email protected]>

fixes from review

83703c9

Merge branch 'nathan-refactor-prompt-building' of github.com:huggingf…

f7ad781

…ace/lighteval into nathan-refactor-prompt-building

fix tests

4c147d8

details

ff4c1b0

fix tests

0178e9a

fix tests

6e9abcc

clefourrier approved these changes Jun 20, 2025

View reviewed changes

NathanHB and others added 10 commits June 20, 2025 08:33

fails when using temp == 0 on sampling tasks

8da10dd

revert metrics on aime24

bbd53ff

data // for loglikelihood in transformers model

ac21462

system prompt part of model config

2e423be

move use-chat-template and system promtp to model config

2125b24

add docstring to model configs

d98b048

Merge branch 'main' into nathan-refactor-prompt-building

4f28468

add doc for model responses

9c23637

Merge branch 'nathan-refactor-prompt-building' of github.com:huggingf…

d6069fc

…ace/lighteval into nathan-refactor-prompt-building

fix tests

456cd38

refacto prompt building #709

Are you sure you want to change the base?

refacto prompt building #709

Uh oh!

Conversation

NathanHB commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Highlights

Why?

Uh oh!

HuggingFaceDocBuilderDev commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NathanHB Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

clefourrier Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

NathanHB Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

NathanHB Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

NathanHB Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

clefourrier commented Jun 16, 2025

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NathanHB commented May 7, 2025 •

edited

Loading