-
Notifications
You must be signed in to change notification settings - Fork 429
Description
Describe the bug
I'm trying to benchmark models on the CFFormulation for belebele.
When doing so, lighteval crashes during the lobprob token-based normalization phase with an IndexError (see traceback below).
Putting some prints in the offending function, I notice that the length of choices_tokens (3) does not correspond to the expected length of 4, which is both the number of answer choices and the length of choices_tokens and choices_logprob.
choices_text=[' 4', ' 27', ' 6', ' 34']
choices_logprob=[-2.90625, -5.65625, -3.03125, -5.4375]
choices_tokens=[[236743, 236812, -1], [236743, 236778, 236832], [236743, 236825, -1]]
I tried to trace it back to its origin, but it seems to be a problem that occurs during generation itself.
Traceback:
───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/main_ac │
│ celerate.py:147 in accelerate │
│ │
│ 144 │ │ model_config=model_config, │
│ 145 │ ) │
│ 146 │ │
│ ❱ 147 │ pipeline.evaluate() │
│ 148 │ │
│ 149 │ pipeline.show_results() │
│ 150 │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:291 in evaluate │
│ │
│ 288 │ │ │
│ 289 │ │ if self.is_main_process(): │
│ 290 │ │ │ self._post_process_outputs(outputs) │
│ ❱ 291 │ │ │ self._compute_metrics(outputs) │
│ 292 │ │ │ │
│ 293 │ │ │ self.evaluation_tracker.general_config_logger.log_end_time │
│ 294 │ │ │ self.evaluation_tracker.metrics_logger.aggregate( │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:391 in _compute_metrics │
│ │
│ 388 │ │ │ │ docs = [doc for doc, _ in samples] │
│ 389 │ │ │ │ responses = [response for _, response in samples] │
│ 390 │ │ │ │ │
│ ❱ 391 │ │ │ │ outputs = apply_metric( │
│ 392 │ │ │ │ │ docs=docs, │
│ 393 │ │ │ │ │ responses=responses, │
│ 394 │ │ │ │ │ metrics=metric_category_metrics, │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /__init__.py:54 in apply_metric │
│ │
│ 51 │ │ # Add non-batched metric results for this sample │
│ 52 │ │ for metric in non_batched_metrics: │
│ 53 │ │ │ output.update( │
│ ❱ 54 │ │ │ │ metric.compute_sample( │
│ 55 │ │ │ │ │ model_response=responses[i], │
│ 56 │ │ │ │ │ doc=docs[i], │
│ 57 │ │ │ │ ) │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /utils/metric_utils.py:59 in compute_sample │
│ │
│ 56 │ │ │
│ 57 │ │ if isinstance(self, MetricGrouping): │
│ 58 │ │ │ return sample_level_fn(**kwargs) │
│ ❱ 59 │ │ return {self.metric_name: sample_level_fn(**kwargs)} │
│ 60 │ │
│ 61 │ def get_corpus_aggregations(self) -> dict: │
│ 62 │ │ if isinstance(self, MetricGrouping): │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /metrics_sample.py:282 in compute │
│ │
│ 279 │ │ choices_tokens = model_response.output_tokens[:n_choices] │
│ 280 │ │ │
│ 281 │ │ normalized_log_probs = ( │
│ ❱ 282 │ │ │ normalize_log_probs( │
│ 283 │ │ │ │ self.logprob_normalization, │
│ 284 │ │ │ │ choices_logprobs, │
│ 285 │ │ │ │ unconditioned_logprobs, │
│ │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /normalizations.py:527 in normalize_log_probs │
│ │
│ 524 │ │ case LogProbTokenNorm(): │
│ 525 │ │ │ assert choices_tokens is not None, "choices_tokens must be │
│ 526 │ │ │ normalized_log_probs = [ │
│ ❱ 527 │ │ │ │ choices_logprob[ix] / len(choices_tokens[ix]) for ix i │
│ 528 │ │ │ ] │
│ 529 │ │ case LogProbPMINorm(): │
│ 530 │ │ │ assert unconditioned_logprob is not None, "unconditioned_l │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range
To Reproduce
$ lighteval accelerate "model_name=HPLT/hplt2c_nld_checkpoints" "belebele_nld_Latn_cf|5" --load-tasks-multilingual
Expected behavior
Lighteval completes the benchmark successfully, as it does for "belebele_nld_Latn_mcq" and for LogProbCharNorm in "belebele_nld_Latn_cf".
Version info
Linux, lighteval version 0.13.0 and main, python 3.13.