Revert the task detail serialization to be compliant with PyArrow #715

alvin319 · 2025-05-11T04:45:15Z

With #660, we need to revert to the previous implementation of serializing the task details to be compatible with PyArrow.

Running the same command, now everything works again!

> NAMESPACE=Qwen MODEL_NAME=Qwen2.5-0.5B-Instruct MODEL=$NAMESPACE/$MODEL_NAME MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,generation_parameters={max_new_tokens:2048,temperature:0.6,top_p:0.95}" OUTPUT_DIR=data/evals/$MODEL TASK=med_qa LOG_FILE=logs/evals/${TASK}_${MODEL_NAME}.log CUDA_VISIBLE_DEVICES=0 uv run lighteval accelerate $MODEL_ARGS "helm|$TASK|0|0" --use-chat-template --output-dir $OUTPUT_DIR

[2025-05-11 04:29:42,035] [    INFO]: PyTorch version 2.5.1+cu121 available. (config.py:54)
[2025-05-11 04:29:46,965] [    INFO]: Test gather tensor (parallelism.py:133)
[2025-05-11 04:29:47,106] [    INFO]: gathered_tensor tensor([0], device='cuda:0'), should be [0] (parallelism.py:136)
[2025-05-11 04:29:47,107] [    INFO]: --- LOADING MODEL --- (pipeline.py:187)
[2025-05-11 04:29:47,486] [    INFO]: Tokenizer truncation and padding size set to the left side. (transformers_model.py:435)
[2025-05-11 04:29:47,486] [    INFO]: We are not in a distributed setting. Setting model_parallel to False. (transformers_model.py:330)
[2025-05-11 04:29:47,486] [    INFO]: Model parallel was set to False, max memory set to None and device map to None (transformers_model.py:359)
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
[2025-05-11 04:29:47,741] [    INFO]: Using Data Parallelism, putting model on device cuda (transformers_model.py:203)
[2025-05-11 04:29:47,982] [    INFO]: --- INIT SEEDS --- (pipeline.py:258)
[2025-05-11 04:29:47,982] [    INFO]: --- LOADING TASKS --- (pipeline.py:212)
[2025-05-11 04:29:47,982] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:137)
[2025-05-11 04:29:47,985] [    INFO]: bigbio/med_qa med_qa_en_source (lighteval_task.py:187)
[2025-05-11 04:29:50,053] [    INFO]: --- RUNNING MODEL --- (pipeline.py:462)
[2025-05-11 04:29:50,053] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:466)
[2025-05-11 04:29:51,526] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:237)
Splits:   0%|                                                                                                                                                                                                                                                                                         | 0/1 [00:00<?, ?it/s]
[2025-05-11 04:29:51,528] [    INFO]: Detecting largest batch size with max_input_length=994 (transformers_model.py:489)
[2025-05-11 04:30:12,344] [    INFO]: Determined largest batch size: 16 (transformers_model.py:502)
                                                                                                                                                                                                                                                                                                                           [
2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes.
 You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
 (warnings.py:110)
[2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:636: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You
 should set `do_sample=True` or unset `top_p`.
  warnings.warn(
 (warnings.py:110)
[2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:653: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You s
hould set `do_sample=True` or unset `top_k`.
  warnings.warn(
 (warnings.py:110)
Splits: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:04<00:00, 64.97s/it]
[2025-05-11 04:30:56,506] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:466)
0it [00:00, ?it/s][2025-05-11 04:31:10,840] [    INFO]: Detecting largest batch size with max_input_length=989 (transformers_model.py:489)
[2025-05-11 04:31:31,529] [    INFO]: Determined largest batch size: 16 (transformers_model.py:502)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:26<00:00,  2.30it/s]
1it [01:47, 107.37s/it][2025-05-11 04:32:58,206] [    INFO]: Detecting largest batch size with max_input_length=307 (transformers_model.py:489)███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:26<00:00,  2.38it/s]
[2025-05-11 04:33:03,162] [    INFO]: Determined largest batch size: 32 (transformers_model.py:502)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:28<00:00,  3.49it/s]
2it [02:21, 64.00s/it] [2025-05-11 04:33:31,851] [    INFO]: Detecting largest batch size with max_input_length=253 (transformers_model.py:489)███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:28<00:00,  4.14it/s]
[2025-05-11 04:33:39,982] [    INFO]: Determined largest batch size: 64 (transformers_model.py:502)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.15it/s]
3it [02:52, 49.12s/it][2025-05-11 04:34:03,262] [    INFO]: Detecting largest batch size with max_input_length=210 (transformers_model.py:489)██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.32it/s]
[2025-05-11 04:34:11,097] [    INFO]: Determined largest batch size: 64 (transformers_model.py:502)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:19<00:00,  2.56it/s]
4it [03:19, 49.96s/it]██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:19<00:00,  2.79it/s]
[2025-05-11 04:34:30,750] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:498)
[2025-05-11 04:34:42,656] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:540)
|    Task     |Version|Metric|Value |   |Stderr|
|-------------|------:|------|-----:|---|-----:|
|all          |       |em    |0.0000|±  |0.0000|
|             |       |qem   |0.0000|±  |0.0000|
|             |       |pem   |0.1470|±  |0.0070|
|             |       |pqem  |0.4012|±  |0.0097|
|             |       |acc   |0.2597|±  |0.0087|
|helm:med_qa:0|      0|em    |0.0000|±  |0.0000|
|             |       |qem   |0.0000|±  |0.0000|
|             |       |pem   |0.1470|±  |0.0070|
|             |       |pqem  |0.4012|±  |0.0097|
|             |       |acc   |0.2597|±  |0.0087|

[2025-05-11 04:34:42,685] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:530)
[2025-05-11 04:34:42,685] [    INFO]: Saving experiment tracker (evaluation_tracker.py:196)
[2025-05-11 04:34:59,340] [    INFO]: Saving results to /home/ubuntu/datlitgpt_proto/text/sanitized_evals/data/evals/Qwen/Qwen2.5-0.5B-Instruct/results/Qwen/Qwen2.5-0.5B-Instruct/results_2025-05-11T04-34-42.685574.json (evaluation_tracker.py:265)

alvin319 · 2025-05-11T04:46:03Z

@NathanHB, please review it whenever you get the chance, as it is blocking the critical path of running evaluations with LightEval in general

NathanHB · 2025-05-12T11:12:32Z

Hi! Thanks for the PR. Indeed, it's annoying that we get issues with the logging. I just took a look, and this happens because the task has both loglikelihood and generative metrics. Those generate details with different schemas (for fields predictions, input_tokens, cont_tokens).
I would rather not convert everything to a string, as it makes it fairly unreliable for inspecting details.

What you could do: Choose whether you want loglikelihood and generative evals for your use case and stick to those; this should remove the error.

for example, removing logelikelihood (the best choice for instruct models)

|    Task     |Version|Metric|Value|   |Stderr|
|-------------|------:|------|----:|---|-----:|
|all          |       |em    |  0.0|±  |0.0000|
|             |       |qem   |  0.0|±  |0.0000|
|             |       |pem   |  0.0|±  |0.0000|
|             |       |pqem  |  0.6|±  |0.1633|
|helm:med_qa:0|      0|em    |  0.0|±  |0.0000|
|             |       |qem   |  0.0|±  |0.0000|
|             |       |pem   |  0.0|±  |0.0000|
|             |       |pqem  |  0.6|±  |0.1633|

alvin319 · 2025-05-12T19:22:14Z

Thanks for the reply @NathanHB! If that's the case, then there are a lot of public tasks that shared both sets of metrics will not be runnable unless some refactoring needs to happen to separate the metric use. Moreover, it seems counter-intuitive to not being able to use different metrics to evaluate the model performance solely due to this serialization behavior, as some of my use cases do involve using a mixed set of logprob-based and generative-based metrics to evaluate models across scales. I do think another follow up is to update the existing documentation to explicitly describe this behavior as well.

That being said, can we consider re-designing how the schema is being defined upstream?

NathanHB · 2025-05-14T08:46:42Z

i agree that we should not have issues running both types of metrics, we wanted to revamp the details logging system for a bit but never had the bandwidth.

The schema should indeed be the same for all details, with maybe optional fields.

alvin319 · 2025-05-14T09:34:50Z

i agree that we should not have issues running both types of metrics, we wanted to revamp the details logging system for a bit but never had the bandwidth.

The schema should indeed be the same for all details, with maybe optional fields.

Can you give us some pointers on where the schema lives?

clefourrier · 2025-05-14T09:52:58Z

In this file: https://github.com/huggingface/lighteval/blob/main/src/lighteval/logging/evaluation_tracker.py

alvin319 · 2025-06-02T05:43:01Z

I currently don't have the bandwidth to address this issue in a timely manner, so I'm closing this PR for now.

revert the serialization

0455c8b

NathanHB linked an issue May 15, 2025 that may be closed by this pull request

[BUG] encounter an ArrowInvalid error while saving experiment tracker #660

Open

alvin319 closed this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert the task detail serialization to be compliant with PyArrow #715

Revert the task detail serialization to be compliant with PyArrow #715

Uh oh!

alvin319 commented May 11, 2025

Uh oh!

alvin319 commented May 11, 2025

Uh oh!

NathanHB commented May 12, 2025

Uh oh!

alvin319 commented May 12, 2025

Uh oh!

NathanHB commented May 14, 2025

Uh oh!

alvin319 commented May 14, 2025

Uh oh!

clefourrier commented May 14, 2025

Uh oh!

alvin319 commented Jun 2, 2025

Uh oh!

Uh oh!

Revert the task detail serialization to be compliant with PyArrow #715

Revert the task detail serialization to be compliant with PyArrow #715

Uh oh!

Conversation

alvin319 commented May 11, 2025

Uh oh!

alvin319 commented May 11, 2025

Uh oh!

NathanHB commented May 12, 2025

Uh oh!

alvin319 commented May 12, 2025

Uh oh!

NathanHB commented May 14, 2025

Uh oh!

alvin319 commented May 14, 2025

Uh oh!

clefourrier commented May 14, 2025

Uh oh!

alvin319 commented Jun 2, 2025

Uh oh!

Uh oh!