-
Notifications
You must be signed in to change notification settings - Fork 293
refacto prompt building #709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
dataloader, desc="Greedy generation", position=1, leave=True, disable=self.disable_tqdm | ||
): | ||
batch_inputs = batch_inputs.to(self.device) | ||
if self.torch_dtype is not None: | ||
batch_inputs = batch_inputs.to(self.torch_dtype) | ||
|
||
max_new_tokens = self.config.generation_size or batch_requests[0].generation_size | ||
do_sample = batch_requests[0].do_sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wether we want to use sampling or greedy is left to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there a default value for the param though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default valaue is True, actally after yje discussion with @lewtun , this params is not needed anymore. It is controlled by the temperature arg (defaults to 0) and if the user wants to use sampling, the he has to set the temperature > 0
@@ -650,7 +629,7 @@ def _generate( | |||
max_new_tokens=max_new_tokens, | |||
pad_token_id=self.tokenizer.pad_token_id if self.tokenizer.pad_token_id else self.tokenizer.eos_token_id, | |||
eos_token_id=self.tokenizer.eos_token_id, | |||
do_sample=do_sample, | |||
do_sample=do_sample if generation_config.get("temperature", 1.0) > 0 else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_sample will always be true, except if user set temp to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is the default case when temp is not provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR overhauls prompt-building and metrics handling across LiteEval by removing legacy wrappers, unifying on Doc
and ModelResponse
types, and standardizing metric signatures to use SamplingMethod
.
- Metrics now accept
(doc: Doc, model_response: ModelResponse)
and useSamplingMethod
instead of legacyMetricCategory
/MetricUseCase
. - The global
apply_...
functions inmetrics/__init__.py
are consolidated into a singleapply_metric
that handles batched and per-sample metrics. - Data loading, logging, and documentation are updated to use the new
Doc
model, drop old request classes, and reflect the simplified API.
Reviewed Changes
Copilot reviewed 67 out of 67 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/lighteval/metrics/harness_compatibility/drop.py | Update DROP metric to use Doc and ModelResponse |
src/lighteval/metrics/dynamic_metrics.py | Replace MetricCategory with SamplingMethod categories |
src/lighteval/metrics/init.py | Consolidate apply functions into unified apply_metric |
src/lighteval/main_endpoint.py | Minor docstring update |
src/lighteval/logging/info_loggers.py | Rewrite Detail dataclass to hold doc and model_response |
src/lighteval/logging/evaluation_tracker.py | Add preview_outputs using new Detail fields |
src/lighteval/data.py | Swap legacy request types for Doc and update type hints |
pyproject.toml | Broaden pytest version constraint |
examples/model_configs/vllm_model_config.yaml | Add is_async parameter |
examples/model_configs/transformers_vlm_model.yaml | Enable use_fast_image_processor , set temperature: 0.0 |
examples/model_configs/transformers_model.yaml | Set temperature: 0.0 |
examples/model_configs/sglang_model_config.yaml | Toggle use_chat_template: True |
examples/custom_tasks_tests.py | Fix parameter name from metric to metrics |
docs/source/saving-and-reading-results.mdx | Update detail file columns to __doc__ , __model_response__ , __metric__ |
docs/source/quicktour.mdx | Refresh backend list with new endpoint names |
docs/source/package_reference/tasks.mdx | Remove old request classes, add Doc |
docs/source/package_reference/models.mdx | Revise to "Model Configs", update model config entries |
docs/source/adding-a-new-metric.mdx | Show updated metric signature using Doc /ModelResponse |
docs/source/adding-a-custom-task.mdx | Rename metric →metrics , update parameter names |
docs/source/_toctree.yml | Rename "Models and ModelConfigs" to "Model Configs" |
Comments suppressed due to low confidence (4)
src/lighteval/data.py:258
- This sorting criterion uses the character length of
doc.query
instead of the token length. It can misorder batches. Consider using the tokenized context length (e.g.,len(doc.tokenized_context)
).
- -(len(query) + gen_length),
src/lighteval/logging/info_loggers.py:211
- The attribute
model_response.input
may not exist onModelResponse
. Verify the correct property (e.g.,model_response.input_tokens
ormodel_response.text
).
pprint(model_response.input)
src/lighteval/metrics/init.py:31
- [nitpick] The nested loops over
metrics
anddocs
inapply_metric
are hard to follow. Consider separating batched and per-sample flows into clear helper functions to improve readability.
for metric in metrics:
docs/source/adding-a-custom-task.mdx:56
- The placeholder
{GENERATIVE,LOGPROBS}
is invalid syntax. Recommend specifying one method, e.g.SamplingMethod.GENERATIVE
, or document how to pass multiple.
category=SamplingMethod.{GENERATIVE,LOGPROBS},
Good to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much neater, very nice refacto of a lot of old hanging code!
Lots of very cool simplifications!
Some questions:
- you removed a range of return types in function signatures and I'm not clear why
- how can one now select a custom batch size? (override_bs param before)
- what do you get on
aime24
atm? (I'm still not sure how you manage evals where you get both sampling generative and greedy generative metrics)
Co-authored-by: Clémentine Fourrier <[email protected]>
Co-authored-by: Clémentine Fourrier <[email protected]>
Co-authored-by: Clémentine Fourrier <[email protected]>
…ace/lighteval into nathan-refactor-prompt-building
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skimmed, looks like most important things are fixed - again super good job
…ace/lighteval into nathan-refactor-prompt-building
What does this PR do?
This PR gives the prompt building logic in lighteval a much-needed spring cleaning
The main goal: ditch legacy bloat, make things less painful for users and contributors, and unlock support for more complex benchmarks 🔥
Highlights
samplingMethod
(generative or loglikelihood). Say goodbye touse_case
and all those old request types.Doc
directly -—no more unnecessaryrequest
wrappers that were bloating the code.ModelResponse
type, whether generative or loglikelihood. This means simpler logging and metric computation.compute(doc: Doc, model_response: ModelResponse)
.Why?