You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the LLM hyperparameter optimization API in Katib—which depends on the Trainer SDK v1.9.0—I encountered multiple errors while running the example in the user guide. These errors stem from version mismatches and code issues in both the storage-initializer and trainer.
Errors in the storage-initializer container:
2025-03-27T21:53:28Z INFO Downloading model
2025-03-27T21:53:28Z INFO ----------------------------------------
/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/storage_initializer/storage.py", line 50, in <module>
model_factory(args.model_provider, args.model_provider_parameters)
File "/app/storage_initializer/storage.py", line 12, in model_factory
hf.download_model_and_tokenizer()
File "/app/storage_initializer/hugging_face.py", line 68, in download_model_and_tokenizer
transformer_type_class.from_pretrained(
File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 521, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1135, in from_pretrained
return config_class.from_dict(config_dict, **unused_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 763, in from_dict
config = cls(**config_dict)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 160, in __init__
self._rope_scaling_validation()
File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 180, in _rope_scaling_validation
raise ValueError(
ValueError: `rope_scaling` must be a dictionary with with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
2025-03-27T21:58:37Z INFO Downloading model
2025-03-27T21:58:37Z INFO ----------------------------------------
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/storage_initializer/storage.py", line 50, in <module>
model_factory(args.model_provider, args.model_provider_parameters)
File "/app/storage_initializer/storage.py", line 12, in model_factory
hf.download_model_and_tokenizer()
File "/app/storage_initializer/hugging_face.py", line 74, in download_model_and_tokenizer
transformers.AutoTokenizer.from_pretrained(
File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 916, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2255, in from_pretrained
raise EnvironmentError(
OSError: Can't load tokenizer for 'meta-llama/Llama-3.2-1B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'meta-llama/Llama-3.2-1B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
Errors in the pytorch container
(test-llm) (base) helen@Hezhis-MacBook-Air trainer % kubectl logs llm-experiment-75sn7nzs-master-0 -n kubeflow -c metrics-logger-and-collector
I0327 23:08:34.638268 150 main.go:400] Trial Name: llm-experiment-75sn7nzs
I0327 23:08:39.453069 150 main.go:143] 2025-03-27T23:08:39Z INFO Starting HuggingFace LLM Trainer
I0327 23:08:39.527767 150 main.go:143] /usr/local/lib/python3.11/site-packages/accelerate/state.py:261: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0327 23:08:39.527816 150 main.go:143] warnings.warn(
I0327 23:08:39.544034 150 main.go:143] /usr/local/lib/python3.11/site-packages/transformers/training_args.py:2007: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0327 23:08:39.544054 150 main.go:143] warnings.warn(
I0327 23:08:39.545602 150 main.go:143] 2025-03-27T23:08:39Z INFO Setup model and tokenizer
I0327 23:09:18.927359 150 main.go:143] 2025-03-27T23:09:18Z INFO Preprocess dataset
I0327 23:09:18.927463 150 main.go:143] 2025-03-27T23:09:18Z INFO Load and preprocess dataset
I0327 23:09:18.943631 150 main.go:143] 2025-03-27T23:09:18Z INFO Dataset specification: Dataset({
I0327 23:09:18.943646 150 main.go:143] features: ['text', 'label'],
I0327 23:09:18.943653 150 main.go:143] num_rows: 8
I0327 23:09:18.943659 150 main.go:143] })
I0327 23:09:18.943667 150 main.go:143] 2025-03-27T23:09:18Z INFO ----------------------------------------
I0327 23:09:18.943669 150 main.go:143] 2025-03-27T23:09:18Z INFO Tokenize dataset
Map: 0%| | 0/8 [00:00<?, ? examples/s]
Map: 0%| | 0/8 [00:00<?, ? examples/s]
I0327 23:09:19.059331 150 main.go:143] [rank0]: Traceback (most recent call last):
I0327 23:09:19.059344 150 main.go:143] [rank0]: File "/app/hf_llm_training.py", line 195, in <module>
I0327 23:09:19.059360 150 main.go:143] [rank0]: train_data, eval_data = load_and_preprocess_data(
I0327 23:09:19.059365 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059370 150 main.go:143] [rank0]: File "/app/hf_llm_training.py", line 77, in load_and_preprocess_data
I0327 23:09:19.059371 150 main.go:143] [rank0]: dataset = dataset.map(
I0327 23:09:19.059381 150 main.go:143] [rank0]: ^^^^^^^^^^^^
I0327 23:09:19.059382 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
I0327 23:09:19.059390 150 main.go:143] [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
I0327 23:09:19.059398 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059400 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
I0327 23:09:19.059401 150 main.go:143] [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
I0327 23:09:19.059403 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059404 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3167, in map
I0327 23:09:19.059406 150 main.go:143] [rank0]: for rank, done, content in Dataset._map_single(**dataset_kwargs):
I0327 23:09:19.059407 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3558, in _map_single
I0327 23:09:19.059409 150 main.go:143] [rank0]: batch = apply_function_on_filtered_inputs(
I0327 23:09:19.059412 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059414 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
I0327 23:09:19.059415 150 main.go:143] [rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
I0327 23:09:19.059417 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059418 150 main.go:143] [rank0]: File "/app/hf_llm_training.py", line 78, in <lambda>
I0327 23:09:19.059420 150 main.go:143] [rank0]: lambda x: tokenizer(x["text"], padding="max_length", truncation=True),
I0327 23:09:19.059421 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059424 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3055, in __call__
I0327 23:09:19.059428 150 main.go:143] [rank0]: encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
I0327 23:09:19.059479 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059521 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3142, in _call_one
I0327 23:09:19.059548 150 main.go:143] [rank0]: return self.batch_encode_plus(
I0327 23:09:19.059553 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059575 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3329, in batch_encode_plus
I0327 23:09:19.059593 150 main.go:143] [rank0]: padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
I0327 23:09:19.059599 150 main.go:143] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059600 150 main.go:143] [rank0]: File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2959, in _get_padding_truncation_strategies
I0327 23:09:19.059616 150 main.go:143] [rank0]: raise ValueError(
I0327 23:09:19.059618 150 main.go:143] [rank0]: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
I0327 23:09:19.925487 150 main.go:143] E0327 23:09:19.912000 149 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 169) of binary: /usr/local/bin/python3.11
I0327 23:09:19.926437 150 main.go:143] Traceback (most recent call last):
I0327 23:09:19.926447 150 main.go:143] File "/usr/local/bin/torchrun", line 8, in <module>
I0327 23:09:19.926700 150 main.go:143] sys.exit(main())
I0327 23:09:19.926762 150 main.go:143] ^^^^^^
I0327 23:09:19.926773 150 main.go:143] File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
I0327 23:09:19.927195 150 main.go:143] return f(*args, **kwargs)
I0327 23:09:19.927212 150 main.go:143] ^^^^^^^^^^
I0327 23:09:19.927213 150 main.go:143] ^^
I0327 23:09:19.927228 150 main.go:143] File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
I0327 23:09:19.929019 150 main.go:143] run(args)
I0327 23:09:19.929033 150 main.go:143] File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
I0327 23:09:19.929286 150 main.go:143] elastic_launch(
I0327 23:09:19.929296 150 main.go:143] File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
I0327 23:09:19.929539 150 main.go:143] return launch_agent(self._config, self._entrypoint, list(args))
I0327 23:09:19.929590 150 main.go:143] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.929595 150 main.go:143] ^^^
I0327 23:09:19.929600 150 main.go:143]
I0327 23:09:19.929601 150 main.go:143] File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
I0327 23:09:19.929677 150 main.go:143] raise ChildFailedError(
I0327 23:09:19.929686 150 main.go:143] torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I0327 23:09:19.929689 150 main.go:143] ============================================================
I0327 23:09:19.929690 150 main.go:143] hf_llm_training.py FAILED
I0327 23:09:19.929695 150 main.go:143] ------------------------------------------------------------
I0327 23:09:19.929697 150 main.go:143] Failures:
I0327 23:09:19.929701 150 main.go:143] <NO_OTHER_FAILURES>
I0327 23:09:19.929706 150 main.go:143] ------------------------------------------------------------
I0327 23:09:19.929725 150 main.go:143] Root Cause (first observed failure):
I0327 23:09:19.929729 150 main.go:143] [0]:
I0327 23:09:19.929731 150 main.go:143] time : 2025-03-27_23:09:19
I0327 23:09:19.929738 150 main.go:143] host : llm-experiment-75sn7nzs-master-0
I0327 23:09:19.929742 150 main.go:143] rank : 0 (local_rank: 0)
I0327 23:09:19.929742 150 main.go:143] exitcode : 1 (pid: 169)
I0327 23:09:19.929747 150 main.go:143] error_file: <N/A>
I0327 23:09:19.929750 150 main.go:143] traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I0327 23:09:19.929754 150 main.go:143] ============================================================
F0327 23:09:20.710260 150 main.go:425] Failed to wait for worker container: training container is failed. Unable to read file /var/log/katib/143.pid for pid 143, error: open /var/log/katib/143.pid: no such file or directory
What did you expect to happen?
The example completes successfully.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
I installed the Training Operator control plane: v1.9.0
Kubeflow Python SDK version:
$ pip show kubeflow
I installed Trainer SDK separately, and the version of Trainer SDK by running `pip show kubeflow-training` is:
Name: kubeflow-training
Version: 1.9.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/homebrew/anaconda3/envs/test-llm/lib/python3.12/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: kubeflow-katib
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered:
What happened?
When using the LLM hyperparameter optimization API in Katib—which depends on the Trainer SDK v1.9.0—I encountered multiple errors while running the example in the user guide. These errors stem from version mismatches and code issues in both the storage-initializer and trainer.
storage-initializer
container:pytorch
containerWhat did you expect to happen?
The example completes successfully.
Environment
Kubernetes version:
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}" I installed the Training Operator control plane: v1.9.0
Kubeflow Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: