[BUG] encounter an ArrowInvalid error while saving experiment tracker

## Describe the bug
encounter an ArrowInvalid error while saving experiment tracker.
The most process of evaluation is done, but error occur when saving.
The error info is as follow:
```
[2025-04-06 18:27:14,942] [[32m    INFO[0m]: Saving experiment tracker (evaluation_tracker.py:180)[0m
|    Task     |Version|Metric|Value |   |Stderr|
|-------------|------:|------|-----:|---|-----:|
|all          |       |em    |0.0102|±  |0.0020|
|             |       |qem   |0.0110|±  |0.0021|
|             |       |pem   |0.1925|±  |0.0078|
|             |       |pqem  |0.3937|±  |0.0097|
|             |       |acc   |0.2578|±  |0.0087|
|helm:med_qa:0|      0|em    |0.0102|±  |0.0020|
|             |       |qem   |0.0110|±  |0.0021|
|             |       |pem   |0.1925|±  |0.0078|
|             |       |pqem  |0.3937|±  |0.0097|
|             |       |acc   |0.2578|±  |0.0087|
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/lighteval/main_v │
│ llm.py:163 in vllm                                                           │
│                                                                              │
│   160 │                                                                      │
│   161 │   results = pipeline.get_results()                                   │
│   162 │                                                                      │
│ ❱ 163 │   pipeline.save_and_push_results()                                   │
│   164 │                                                                      │
│   165 │   return results                                                     │
│   166                                                                        │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/lighteval/pipeli │
│ ne.py:536 in save_and_push_results                                           │
│                                                                              │
│   533 │   def save_and_push_results(self):                                   │
│   534 │   │   logger.info("--- SAVING AND PUSHING RESULTS ---")              │
│   535 │   │   if self.is_main_process():                                     │
│ ❱ 536 │   │   │   self.evaluation_tracker.save()                             │
│   537 │                                                                      │
│   538 │   def _init_final_dict(self):                                        │
│   539 │   │   if self.is_main_process():                                     │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/lighteval/loggin │
│ g/evaluation_tracker.py:201 in save                                          │
│                                                                              │
│   198 │   │   details_datasets: dict[str, Dataset] = {}                      │
│   199 │   │   for task_name, task_details in self.details_logger.details.ite │
│   200 │   │   │   # Create a dataset from the dictionary - we force cast to  │
│ ❱ 201 │   │   │   dataset = Dataset.from_list([asdict(detail) for detail in  │
│   202 │   │   │                                                              │
│   203 │   │   │   # We don't keep 'id' around if it's there                  │
│   204 │   │   │   column_names = dataset.column_names                        │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/datasets/arrow_d │
│ ataset.py:986 in from_list                                                   │
│                                                                              │
│    983 │   │   """                                                           │
│    984 │   │   # for simplicity and consistency wrt OptimizedTypedSequence w │
│    985 │   │   mapping = {k: [r.get(k) for r in mapping] for k in mapping[0] │
│ ❱  986 │   │   return cls.from_dict(mapping, features, info, split)          │
│    987 │                                                                     │
│    988 │   @staticmethod                                                     │
│    989 │   def from_csv(                                                     │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/datasets/arrow_d │
│ ataset.py:940 in from_dict                                                   │
│                                                                              │
│    937 │   │   │   │   )                                                     │
│    938 │   │   │   arrow_typed_mapping[col] = data                           │
│    939 │   │   mapping = arrow_typed_mapping                                 │
│ ❱  940 │   │   pa_table = InMemoryTable.from_pydict(mapping=mapping)         │
│    941 │   │   if info is None:                                              │
│    942 │   │   │   info = DatasetInfo()                                      │
│    943 │   │   info.features = features                                      │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/datasets/table.p │
│ y:758 in from_pydict                                                         │
│                                                                              │
│    755 │   │   Returns:                                                      │
│    756 │   │   │   `datasets.table.Table`                                    │
│    757 │   │   """                                                           │
│ ❱  758 │   │   return cls(pa.Table.from_pydict(*args, **kwargs))             │
│    759 │                                                                     │
│    760 │   @classmethod                                                      │
│    761 │   def from_pylist(cls, mapping, *args, **kwargs):                   │
│                                                                              │
│ in pyarrow.lib._Tabular.from_pydict:1968                                     │
│                                                                              │
│ in pyarrow.lib._from_pydict:6337                                             │
│                                                                              │
│ in pyarrow.lib.asarray:402                                                   │
│                                                                              │
│ in pyarrow.lib.array:252                                                     │
│                                                                              │
│ in pyarrow.lib._handle_arrow_array_protocol:114                              │
│                                                                              │
│ /home/lc/.conda/envs/lighteval/lib/python3.11/site-packages/datasets/arrow_w │
│ riter.py:229 in __arrow_array__                                              │
│                                                                              │
│   226 │   │   │   │   out = list_of_np_array_to_pyarrow_listarray(data)      │
│   227 │   │   │   else:                                                      │
│   228 │   │   │   │   trying_cast_to_python_objects = True                   │
│ ❱ 229 │   │   │   │   out = pa.array(cast_to_python_objects(data, only_1d_fo │
│   230 │   │   │   # use smaller integer precisions if possible               │
│   231 │   │   │   if self.trying_int_optimization:                           │
│   232 │   │   │   │   if pa.types.is_int64(out.type):                        │
│                                                                              │
│ in pyarrow.lib.array:372                                                     │
│                                                                              │
│ in pyarrow.lib._sequence_to_array:42                                         │
│                                                                              │
│ in pyarrow.lib.pyarrow_internal_check_status:155                             │
│                                                                              │
│ in pyarrow.lib.check_status:92                                               │
╰──────────────────────────────────────────────────────────────────────────────╯
ArrowInvalid: cannot mix list and non-list, non-null values

```
## To Reproduce
I executed the following command to eval Qwen2.5-0.5B-Instruct with med_qa benchmark, but got error.
```bash
NAMESPACE=Qwen
MODEL_NAME=Qwen2.5-0.5B-Instruct #
MODEL=$NAMESPACE/$MODEL_NAME
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=2048,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:2048,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=data/evals/$MODEL

TASK=med_qa
LOG_FILE=logs/evals/${TASK}_${MODEL_NAME}.log
CUDA_VISIBLE_DEVICES=0 nohup lighteval vllm $MODEL_ARGS "helm|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR \
    > ${LOG_FILE} 2>&1 &
```

## Expected behavior
Save results of evaluation successfully.

## Version info
Ubuntu 24.04
lighteval 0.8.1
cuda 12.4
vllm 0.8.3
torch 2.6.0



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] encounter an ArrowInvalid error while saving experiment tracker #660

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] encounter an ArrowInvalid error while saving experiment tracker #660

Description

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions