Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification #72

Open
starlitsky2010 opened this issue Jun 8, 2023 · 1 comment

Comments

@starlitsky2010
Copy link

Hi NVIDIA,

Slurm should be ready:

root@user:~/container_tools/NeMo-Megatron-Launcher/launcher_scripts# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
dgx* up infinite 1 idle user

First, download the pile dataset offline
Then, only place 00.jsonl.zst to the launcher_scripts/data/bpe path.

~/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe#ls
00.jsonl.zst  merges.txt  SHA256SUMS.txt  test.jsonl.zst  val.jsonl.zst  vocab.json

But when I execute "python3 main.py" the following error occurs.

main.py:57: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="conf", config_name="config")
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
=================== save_dir /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe file_name vocab.json
=================== save_dir /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe file_name vocab.json
File /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe/vocab.json already exists, skipping download.
=================== save_dir /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe file_name merges.txt
=================== save_dir /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe file_name merges.txt
File /root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/data/bpe/merges.txt already exists, skipping download.
Job nemo-megatron-download_gpt3_pile submission file created at '/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh'
sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification
Error executing job with overrides: []
subprocess.CalledProcessError: Command '['sbatch', '/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/results/download_gpt3_pile/download/nemo-megatron-download_gpt3_pile_submission.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 87, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "main.py", line 75, in main
    job_id = stage.run()
  File "/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/core/data_stages.py", line 70, in run
    job_id = launcher.launch(command_groups=command_groups)
  File "/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/core/launchers.py", line 58, in launch
    job_id = self._launcher.launch(command_groups)
  File "/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/core/launchers.py", line 96, in launch
    job_id = self._submit_command(submission_file_path)
  File "/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/core/launchers.py", line 393, in _submit_command
    output = job_utils.CommandFunction(command_list, verbose=False)()  # explicit errors
  File "/root/container_tools/NeMo-Megatron-Launcher/launcher_scripts/nemo_launcher/utils/job_utils.py", line 124, in __call__
    raise OSError(stderr) from subprocess_error
OSError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

Thanks
Aaron

@ethanhe42
Copy link
Member

If your cluster does not support gres, try setting gpus_per_node: null
https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/master/launcher_scripts/conf/cluster/bcm.yaml#L5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants