Skip to content

Revert the task detail serialization to be compliant with PyArrow #715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

alvin319
Copy link
Contributor

With #660, we need to revert to the previous implementation of serializing the task details to be compatible with PyArrow.

Running the same command, now everything works again!

> NAMESPACE=Qwen MODEL_NAME=Qwen2.5-0.5B-Instruct MODEL=$NAMESPACE/$MODEL_NAME MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,generation_parameters={max_new_tokens:2048,temperature:0.6,top_p:0.95}" OUTPUT_DIR=data/evals/$MODEL TASK=med_qa LOG_FILE=logs/evals/${TASK}_${MODEL_NAME}.log CUDA_VISIBLE_DEVICES=0 uv run lighteval accelerate $MODEL_ARGS "helm|$TASK|0|0" --use-chat-template --output-dir $OUTPUT_DIR

[2025-05-11 04:29:42,035] [    INFO]: PyTorch version 2.5.1+cu121 available. (config.py:54)
[2025-05-11 04:29:46,965] [    INFO]: Test gather tensor (parallelism.py:133)
[2025-05-11 04:29:47,106] [    INFO]: gathered_tensor tensor([0], device='cuda:0'), should be [0] (parallelism.py:136)
[2025-05-11 04:29:47,107] [    INFO]: --- LOADING MODEL --- (pipeline.py:187)
[2025-05-11 04:29:47,486] [    INFO]: Tokenizer truncation and padding size set to the left side. (transformers_model.py:435)
[2025-05-11 04:29:47,486] [    INFO]: We are not in a distributed setting. Setting model_parallel to False. (transformers_model.py:330)
[2025-05-11 04:29:47,486] [    INFO]: Model parallel was set to False, max memory set to None and device map to None (transformers_model.py:359)
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
[2025-05-11 04:29:47,741] [    INFO]: Using Data Parallelism, putting model on device cuda (transformers_model.py:203)
[2025-05-11 04:29:47,982] [    INFO]: --- INIT SEEDS --- (pipeline.py:258)
[2025-05-11 04:29:47,982] [    INFO]: --- LOADING TASKS --- (pipeline.py:212)
[2025-05-11 04:29:47,982] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`. (registry.py:137)
[2025-05-11 04:29:47,985] [    INFO]: bigbio/med_qa med_qa_en_source (lighteval_task.py:187)
[2025-05-11 04:29:50,053] [    INFO]: --- RUNNING MODEL --- (pipeline.py:462)
[2025-05-11 04:29:50,053] [    INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:466)
[2025-05-11 04:29:51,526] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:237)
Splits:   0%|                                                                                                                                                                                                                                                                                         | 0/1 [00:00<?, ?it/s]
[2025-05-11 04:29:51,528] [    INFO]: Detecting largest batch size with max_input_length=994 (transformers_model.py:489)
[2025-05-11 04:30:12,344] [    INFO]: Determined largest batch size: 16 (transformers_model.py:502)
                                                                                                                                                                                                                                                                                                                           [
2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes.
 You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
 (warnings.py:110)
[2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:636: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You
 should set `do_sample=True` or unset `top_p`.
  warnings.warn(
 (warnings.py:110)
[2025-05-11 04:30:12,533] [ WARNING]: /home/ubuntu/datlitgpt_proto/text/sanitized_evals/.venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:653: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You s
hould set `do_sample=True` or unset `top_k`.
  warnings.warn(
 (warnings.py:110)
Splits: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [01:04<00:00, 64.97s/it]
[2025-05-11 04:30:56,506] [    INFO]: Running RequestType.LOGLIKELIHOOD requests (pipeline.py:466)
0it [00:00, ?it/s][2025-05-11 04:31:10,840] [    INFO]: Detecting largest batch size with max_input_length=989 (transformers_model.py:489)
[2025-05-11 04:31:31,529] [    INFO]: Determined largest batch size: 16 (transformers_model.py:502)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:26<00:00,  2.30it/s]
1it [01:47, 107.37s/it][2025-05-11 04:32:58,206] [    INFO]: Detecting largest batch size with max_input_length=307 (transformers_model.py:489)███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:26<00:00,  2.38it/s]
[2025-05-11 04:33:03,162] [    INFO]: Determined largest batch size: 32 (transformers_model.py:502)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:28<00:00,  3.49it/s]
2it [02:21, 64.00s/it] [2025-05-11 04:33:31,851] [    INFO]: Detecting largest batch size with max_input_length=253 (transformers_model.py:489)███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:28<00:00,  4.14it/s]
[2025-05-11 04:33:39,982] [    INFO]: Determined largest batch size: 64 (transformers_model.py:502)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.15it/s]
3it [02:52, 49.12s/it][2025-05-11 04:34:03,262] [    INFO]: Detecting largest batch size with max_input_length=210 (transformers_model.py:489)██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.32it/s]
[2025-05-11 04:34:11,097] [    INFO]: Determined largest batch size: 64 (transformers_model.py:502)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:19<00:00,  2.56it/s]
4it [03:19, 49.96s/it]██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:19<00:00,  2.79it/s]
[2025-05-11 04:34:30,750] [    INFO]: --- COMPUTING METRICS --- (pipeline.py:498)
[2025-05-11 04:34:42,656] [    INFO]: --- DISPLAYING RESULTS --- (pipeline.py:540)
|    Task     |Version|Metric|Value |   |Stderr|
|-------------|------:|------|-----:|---|-----:|
|all          |       |em    |0.0000|±  |0.0000|
|             |       |qem   |0.0000|±  |0.0000|
|             |       |pem   |0.1470|±  |0.0070|
|             |       |pqem  |0.4012|±  |0.0097|
|             |       |acc   |0.2597|±  |0.0087|
|helm:med_qa:0|      0|em    |0.0000|±  |0.0000|
|             |       |qem   |0.0000|±  |0.0000|
|             |       |pem   |0.1470|±  |0.0070|
|             |       |pqem  |0.4012|±  |0.0097|
|             |       |acc   |0.2597|±  |0.0087|

[2025-05-11 04:34:42,685] [    INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:530)
[2025-05-11 04:34:42,685] [    INFO]: Saving experiment tracker (evaluation_tracker.py:196)
[2025-05-11 04:34:59,340] [    INFO]: Saving results to /home/ubuntu/datlitgpt_proto/text/sanitized_evals/data/evals/Qwen/Qwen2.5-0.5B-Instruct/results/Qwen/Qwen2.5-0.5B-Instruct/results_2025-05-11T04-34-42.685574.json (evaluation_tracker.py:265)

@alvin319
Copy link
Contributor Author

@NathanHB, please review it whenever you get the chance, as it is blocking the critical path of running evaluations with LightEval in general

@NathanHB
Copy link
Member

Hi! Thanks for the PR. Indeed, it's annoying that we get issues with the logging. I just took a look, and this happens because the task has both loglikelihood and generative metrics. Those generate details with different schemas (for fields predictions, input_tokens, cont_tokens).
I would rather not convert everything to a string, as it makes it fairly unreliable for inspecting details.

What you could do: Choose whether you want loglikelihood and generative evals for your use case and stick to those; this should remove the error.

for example, removing logelikelihood (the best choice for instruct models)

|    Task     |Version|Metric|Value|   |Stderr|
|-------------|------:|------|----:|---|-----:|
|all          |       |em    |  0.0|±  |0.0000|
|             |       |qem   |  0.0|±  |0.0000|
|             |       |pem   |  0.0|±  |0.0000|
|             |       |pqem  |  0.6|±  |0.1633|
|helm:med_qa:0|      0|em    |  0.0|±  |0.0000|
|             |       |qem   |  0.0|±  |0.0000|
|             |       |pem   |  0.0|±  |0.0000|
|             |       |pqem  |  0.6|±  |0.1633|

@alvin319
Copy link
Contributor Author

Thanks for the reply @NathanHB! If that's the case, then there are a lot of public tasks that shared both sets of metrics will not be runnable unless some refactoring needs to happen to separate the metric use. Moreover, it seems counter-intuitive to not being able to use different metrics to evaluate the model performance solely due to this serialization behavior, as some of my use cases do involve using a mixed set of logprob-based and generative-based metrics to evaluate models across scales. I do think another follow up is to update the existing documentation to explicitly describe this behavior as well.

That being said, can we consider re-designing how the schema is being defined upstream?

@NathanHB
Copy link
Member

i agree that we should not have issues running both types of metrics, we wanted to revamp the details logging system for a bit but never had the bandwidth.

The schema should indeed be the same for all details, with maybe optional fields.

@alvin319
Copy link
Contributor Author

i agree that we should not have issues running both types of metrics, we wanted to revamp the details logging system for a bit but never had the bandwidth.

The schema should indeed be the same for all details, with maybe optional fields.

Can you give us some pointers on where the schema lives?

@clefourrier
Copy link
Member

In this file: https://github.com/huggingface/lighteval/blob/main/src/lighteval/logging/evaluation_tracker.py

@alvin319
Copy link
Contributor Author

alvin319 commented Jun 2, 2025

I currently don't have the bandwidth to address this issue in a timely manner, so I'm closing this PR for now.

@alvin319 alvin319 closed this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] encounter an ArrowInvalid error while saving experiment tracker
3 participants