Skip to content

[BUG] LogProbTokenNorm crashes with IndexError in CFFormulation for belebele #1170

@Rijgersberg

Description

@Rijgersberg

Describe the bug

I'm trying to benchmark models on the CFFormulation for belebele.

When doing so, lighteval crashes during the lobprob token-based normalization phase with an IndexError (see traceback below).

Putting some prints in the offending function, I notice that the length of choices_tokens (3) does not correspond to the expected length of 4, which is both the number of answer choices and the length of choices_tokens and choices_logprob.

choices_text=[' 4', ' 27', ' 6', ' 34']                   
choices_logprob=[-2.90625, -5.65625, -3.03125, -5.4375]     
choices_tokens=[[236743, 236812, -1], [236743, 236778, 236832], [236743, 236825, -1]] 

I tried to trace it back to its origin, but it seems to be a problem that occurs during generation itself.

Traceback:

───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/main_ac │
│ celerate.py:147 in accelerate                                                │
│                                                                              │
│   144 │   │   model_config=model_config,                                     │
│   145 │   )                                                                  │
│   146 │                                                                      │
│ ❱ 147 │   pipeline.evaluate()                                                │
│   148 │                                                                      │
│   149 │   pipeline.show_results()                                            │
│   150                                                                        │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:291 in evaluate                                                         │
│                                                                              │
│   288 │   │                                                                  │
│   289 │   │   if self.is_main_process():                                     │
│   290 │   │   │   self._post_process_outputs(outputs)                        │
│ ❱ 291 │   │   │   self._compute_metrics(outputs)                             │
│   292 │   │   │                                                              │
│   293 │   │   │   self.evaluation_tracker.general_config_logger.log_end_time │
│   294 │   │   │   self.evaluation_tracker.metrics_logger.aggregate(          │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/pipelin │
│ e.py:391 in _compute_metrics                                                 │
│                                                                              │
│   388 │   │   │   │   docs = [doc for doc, _ in samples]                     │
│   389 │   │   │   │   responses = [response for _, response in samples]      │
│   390 │   │   │   │                                                          │
│ ❱ 391 │   │   │   │   outputs = apply_metric(                                │
│   392 │   │   │   │   │   docs=docs,                                         │
│   393 │   │   │   │   │   responses=responses,                               │
│   394 │   │   │   │   │   metrics=metric_category_metrics,                   │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /__init__.py:54 in apply_metric                                              │
│                                                                              │
│   51 │   │   # Add non-batched metric results for this sample                │
│   52 │   │   for metric in non_batched_metrics:                              │
│   53 │   │   │   output.update(                                              │
│ ❱ 54 │   │   │   │   metric.compute_sample(                                  │
│   55 │   │   │   │   │   model_response=responses[i],                        │
│   56 │   │   │   │   │   doc=docs[i],                                        │
│   57 │   │   │   │   )                                                       │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /utils/metric_utils.py:59 in compute_sample                                  │
│                                                                              │
│    56 │   │                                                                  │
│    57 │   │   if isinstance(self, MetricGrouping):                           │
│    58 │   │   │   return sample_level_fn(**kwargs)                           │
│ ❱  59 │   │   return {self.metric_name: sample_level_fn(**kwargs)}           │
│    60 │                                                                      │
│    61 │   def get_corpus_aggregations(self) -> dict:                         │
│    62 │   │   if isinstance(self, MetricGrouping):                           │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /metrics_sample.py:282 in compute                                            │
│                                                                              │
│    279 │   │   choices_tokens = model_response.output_tokens[:n_choices]     │
│    280 │   │                                                                 │
│    281 │   │   normalized_log_probs = (                                      │
│ ❱  282 │   │   │   normalize_log_probs(                                      │
│    283 │   │   │   │   self.logprob_normalization,                           │
│    284 │   │   │   │   choices_logprobs,                                     │
│    285 │   │   │   │   unconditioned_logprobs,                               │
│                                                                              │
│ /home/edwinr/venvs/benchmarks/lib/python3.13/site-packages/lighteval/metrics │
│ /normalizations.py:527 in normalize_log_probs                                │
│                                                                              │
│   524 │   │   case LogProbTokenNorm():                                       │
│   525 │   │   │   assert choices_tokens is not None, "choices_tokens must be │
│   526 │   │   │   normalized_log_probs = [                                   │
│ ❱ 527 │   │   │   │   choices_logprob[ix] / len(choices_tokens[ix]) for ix i │
│   528 │   │   │   ]                                                          │
│   529 │   │   case LogProbPMINorm():                                         │
│   530 │   │   │   assert unconditioned_logprob is not None, "unconditioned_l │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

To Reproduce

$ lighteval accelerate "model_name=HPLT/hplt2c_nld_checkpoints" "belebele_nld_Latn_cf|5" --load-tasks-multilingual

Expected behavior

Lighteval completes the benchmark successfully, as it does for "belebele_nld_Latn_mcq" and for LogProbCharNorm in "belebele_nld_Latn_cf".

Version info

Linux, lighteval version 0.13.0 and main, python 3.13.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions