Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Errors Running Katib LLM Hyperparameter Optimization Example Due to Issues with Trainer SDK v1.9.0 #2575

Open
helenxie-bit opened this issue Mar 29, 2025 · 2 comments
Assignees
Labels

Comments

@helenxie-bit
Copy link
Contributor

What happened?

When using the LLM hyperparameter optimization API in Katib—which depends on the Trainer SDK v1.9.0—I encountered multiple errors while running the example in the user guide. These errors stem from version mismatches and code issues in both the storage-initializer and trainer.

  1. Errors in the storage-initializer container:
2025-03-27T21:53:28Z INFO     Downloading model
2025-03-27T21:53:28Z INFO     ----------------------------------------
/usr/local/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 68, in download_model_and_tokenizer
    transformer_type_class.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 521, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1135, in from_pretrained
    return config_class.from_dict(config_dict, **unused_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/configuration_utils.py", line 763, in from_dict
    config = cls(**config_dict)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 160, in __init__
    self._rope_scaling_validation()
  File "/usr/local/lib/python3.11/site-packages/transformers/models/llama/configuration_llama.py", line 180, in _rope_scaling_validation
    raise ValueError(
ValueError: `rope_scaling` must be a dictionary with with two fields, `type` and `factor`, got {'factor': 32.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
2025-03-27T21:58:37Z INFO     Downloading model
2025-03-27T21:58:37Z INFO     ----------------------------------------
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 50, in <module>
    model_factory(args.model_provider, args.model_provider_parameters)
  File "/app/storage_initializer/storage.py", line 12, in model_factory
    hf.download_model_and_tokenizer()
  File "/app/storage_initializer/hugging_face.py", line 74, in download_model_and_tokenizer
    transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 916, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2255, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'meta-llama/Llama-3.2-1B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'meta-llama/Llama-3.2-1B' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
  1. Errors in the pytorch container
(test-llm) (base) helen@Hezhis-MacBook-Air trainer % kubectl logs llm-experiment-75sn7nzs-master-0 -n kubeflow -c metrics-logger-and-collector
I0327 23:08:34.638268     150 main.go:400] Trial Name: llm-experiment-75sn7nzs
I0327 23:08:39.453069     150 main.go:143] 2025-03-27T23:08:39Z INFO     Starting HuggingFace LLM Trainer
I0327 23:08:39.527767     150 main.go:143] /usr/local/lib/python3.11/site-packages/accelerate/state.py:261: UserWarning: OMP_NUM_THREADS/MKL_NUM_THREADS unset, we set it at 8 to improve oob performance.
I0327 23:08:39.527816     150 main.go:143]   warnings.warn(
I0327 23:08:39.544034     150 main.go:143] /usr/local/lib/python3.11/site-packages/transformers/training_args.py:2007: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
I0327 23:08:39.544054     150 main.go:143]   warnings.warn(
I0327 23:08:39.545602     150 main.go:143] 2025-03-27T23:08:39Z INFO     Setup model and tokenizer
I0327 23:09:18.927359     150 main.go:143] 2025-03-27T23:09:18Z INFO     Preprocess dataset
I0327 23:09:18.927463     150 main.go:143] 2025-03-27T23:09:18Z INFO     Load and preprocess dataset
I0327 23:09:18.943631     150 main.go:143] 2025-03-27T23:09:18Z INFO     Dataset specification: Dataset({
I0327 23:09:18.943646     150 main.go:143]     features: ['text', 'label'],
I0327 23:09:18.943653     150 main.go:143]     num_rows: 8
I0327 23:09:18.943659     150 main.go:143] })
I0327 23:09:18.943667     150 main.go:143] 2025-03-27T23:09:18Z INFO     ----------------------------------------
I0327 23:09:18.943669     150 main.go:143] 2025-03-27T23:09:18Z INFO     Tokenize dataset
Map:   0%|          | 0/8 [00:00<?, ? examples/s]
Map:   0%|          | 0/8 [00:00<?, ? examples/s]
I0327 23:09:19.059331     150 main.go:143] [rank0]: Traceback (most recent call last):
I0327 23:09:19.059344     150 main.go:143] [rank0]:   File "/app/hf_llm_training.py", line 195, in <module>
I0327 23:09:19.059360     150 main.go:143] [rank0]:     train_data, eval_data = load_and_preprocess_data(
I0327 23:09:19.059365     150 main.go:143] [rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059370     150 main.go:143] [rank0]:   File "/app/hf_llm_training.py", line 77, in load_and_preprocess_data
I0327 23:09:19.059371     150 main.go:143] [rank0]:     dataset = dataset.map(
I0327 23:09:19.059381     150 main.go:143] [rank0]:               ^^^^^^^^^^^^
I0327 23:09:19.059382     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
I0327 23:09:19.059390     150 main.go:143] [rank0]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
I0327 23:09:19.059398     150 main.go:143] [rank0]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059400     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
I0327 23:09:19.059401     150 main.go:143] [rank0]:     out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
I0327 23:09:19.059403     150 main.go:143] [rank0]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059404     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3167, in map
I0327 23:09:19.059406     150 main.go:143] [rank0]:     for rank, done, content in Dataset._map_single(**dataset_kwargs):
I0327 23:09:19.059407     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3558, in _map_single
I0327 23:09:19.059409     150 main.go:143] [rank0]:     batch = apply_function_on_filtered_inputs(
I0327 23:09:19.059412     150 main.go:143] [rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059414     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
I0327 23:09:19.059415     150 main.go:143] [rank0]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
I0327 23:09:19.059417     150 main.go:143] [rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059418     150 main.go:143] [rank0]:   File "/app/hf_llm_training.py", line 78, in <lambda>
I0327 23:09:19.059420     150 main.go:143] [rank0]:     lambda x: tokenizer(x["text"], padding="max_length", truncation=True),
I0327 23:09:19.059421     150 main.go:143] [rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059424     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3055, in __call__
I0327 23:09:19.059428     150 main.go:143] [rank0]:     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
I0327 23:09:19.059479     150 main.go:143] [rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059521     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3142, in _call_one
I0327 23:09:19.059548     150 main.go:143] [rank0]:     return self.batch_encode_plus(
I0327 23:09:19.059553     150 main.go:143] [rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059575     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3329, in batch_encode_plus
I0327 23:09:19.059593     150 main.go:143] [rank0]:     padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
I0327 23:09:19.059599     150 main.go:143] [rank0]:                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.059600     150 main.go:143] [rank0]:   File "/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2959, in _get_padding_truncation_strategies
I0327 23:09:19.059616     150 main.go:143] [rank0]:     raise ValueError(
I0327 23:09:19.059618     150 main.go:143] [rank0]: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
I0327 23:09:19.925487     150 main.go:143] E0327 23:09:19.912000 149 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 169) of binary: /usr/local/bin/python3.11
I0327 23:09:19.926437     150 main.go:143] Traceback (most recent call last):
I0327 23:09:19.926447     150 main.go:143]   File "/usr/local/bin/torchrun", line 8, in <module>
I0327 23:09:19.926700     150 main.go:143]     sys.exit(main())
I0327 23:09:19.926762     150 main.go:143]              ^^^^^^
I0327 23:09:19.926773     150 main.go:143]   File "/usr/local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
I0327 23:09:19.927195     150 main.go:143]     return f(*args, **kwargs)
I0327 23:09:19.927212     150 main.go:143]            ^^^^^^^^^^
I0327 23:09:19.927213     150 main.go:143] ^^
I0327 23:09:19.927228     150 main.go:143]   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
I0327 23:09:19.929019     150 main.go:143]     run(args)
I0327 23:09:19.929033     150 main.go:143]   File "/usr/local/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
I0327 23:09:19.929286     150 main.go:143]     elastic_launch(
I0327 23:09:19.929296     150 main.go:143]   File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
I0327 23:09:19.929539     150 main.go:143]     return launch_agent(self._config, self._entrypoint, list(args))
I0327 23:09:19.929590     150 main.go:143]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I0327 23:09:19.929595     150 main.go:143] ^^^
I0327 23:09:19.929600     150 main.go:143] 
I0327 23:09:19.929601     150 main.go:143]   File "/usr/local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
I0327 23:09:19.929677     150 main.go:143]     raise ChildFailedError(
I0327 23:09:19.929686     150 main.go:143] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
I0327 23:09:19.929689     150 main.go:143] ============================================================
I0327 23:09:19.929690     150 main.go:143] hf_llm_training.py FAILED
I0327 23:09:19.929695     150 main.go:143] ------------------------------------------------------------
I0327 23:09:19.929697     150 main.go:143] Failures:
I0327 23:09:19.929701     150 main.go:143]   <NO_OTHER_FAILURES>
I0327 23:09:19.929706     150 main.go:143] ------------------------------------------------------------
I0327 23:09:19.929725     150 main.go:143] Root Cause (first observed failure):
I0327 23:09:19.929729     150 main.go:143] [0]:
I0327 23:09:19.929731     150 main.go:143]   time      : 2025-03-27_23:09:19
I0327 23:09:19.929738     150 main.go:143]   host      : llm-experiment-75sn7nzs-master-0
I0327 23:09:19.929742     150 main.go:143]   rank      : 0 (local_rank: 0)
I0327 23:09:19.929742     150 main.go:143]   exitcode  : 1 (pid: 169)
I0327 23:09:19.929747     150 main.go:143]   error_file: <N/A>
I0327 23:09:19.929750     150 main.go:143]   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I0327 23:09:19.929754     150 main.go:143] ============================================================
F0327 23:09:20.710260     150 main.go:425] Failed to wait for worker container: training container is failed. Unable to read file /var/log/katib/143.pid for pid 143, error: open /var/log/katib/143.pid: no such file or directory

What did you expect to happen?

The example completes successfully.

Environment

Kubernetes version:

$ kubectl version

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"

I installed the Training Operator control plane: v1.9.0

Kubeflow Python SDK version:

$ pip show kubeflow
 
I installed Trainer SDK separately, and the version of Trainer SDK by running `pip show kubeflow-training` is:
Name: kubeflow-training
Version: 1.9.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/homebrew/anaconda3/envs/test-llm/lib/python3.12/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: kubeflow-katib

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@helenxie-bit
Copy link
Contributor Author

/assign

@helenxie-bit
Copy link
Contributor Author

/remove-label lifecycle/needs-triage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant