multi GPU error #100

zhangtianhong-1998 · 2024-06-01T14:27:15Z

我用两张A6000 96GB和两张GV100 尝试运行LLama 模型，但是cuda报错
单卡bert是能够正常运行，但是一旦切换到双卡就在soure_embedding前向传播部分开始报错
source embeddings = self.mapping_layer(self.word embeddings.permute(1, 0)).permute(1, 0)

报错如下，请问有碰见过类似情况的吗，求助！

[2024-06-01 22:11:10,870] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
The following values were not passed to accelerate launch and had defaults used instead:
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-06-01 22:11:13,842] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-01 22:11:14,481] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-01 22:11:15,874] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-01 22:11:16,787] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-01 22:11:16,787] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading checkpoint shards: 100%|████████████████████████████████████| 33/33 [00:13<00:00, 2.48it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 33/33 [00:13<00:00, 2.45it/s]
[2024-06-01 22:12:03,981] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.0, git-hash=unknown, git-branch=unknown
[2024-06-01 22:12:15,085] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-06-01 22:12:15,086] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-06-01 22:12:15,086] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-06-01 22:12:15,087] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = Adam
[2024-06-01 22:12:15,087] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Adam type=<class 'torch.optim.adam.Adam'>
[2024-06-01 22:12:15,087] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-06-01 22:12:15,087] [INFO] [stage_1_and_2.py:143:init] Reduce bucket size 200000000
[2024-06-01 22:12:15,087] [INFO] [stage_1_and_2.py:144:init] Allgather bucket size 200000000
[2024-06-01 22:12:15,087] [INFO] [stage_1_and_2.py:145:init] CPU Offload: False
[2024-06-01 22:12:15,087] [INFO] [stage_1_and_2.py:146:init] Round robin gradient partitioning: False
0it [00:00, ?it/s][2024-06-01 22:12:15,837] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-06-01 22:12:15,837] [INFO] [utils.py:792:see_memory_usage] MA 12.51 GB Max_MA 12.55 GB CA 12.55 GB Max_CA 13 GB
[2024-06-01 22:12:15,838] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.52 GB, percent = 1.9%
[2024-06-01 22:12:15,934] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-06-01 22:12:15,935] [INFO] [utils.py:792:see_memory_usage] MA 12.68 GB Max_MA 12.94 GB CA 12.98 GB Max_CA 13 GB
[2024-06-01 22:12:15,935] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.52 GB, percent = 1.9%
[2024-06-01 22:12:15,935] [INFO] [stage_1_and_2.py:533:init] optimizer state initialized
[2024-06-01 22:12:16,027] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-06-01 22:12:16,027] [INFO] [utils.py:792:see_memory_usage] MA 12.68 GB Max_MA 12.68 GB CA 12.98 GB Max_CA 13 GB
[2024-06-01 22:12:16,027] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.52 GB, percent = 1.9%
[2024-06-01 22:12:16,028] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = Adam
[2024-06-01 22:12:16,028] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-06-01 22:12:16,028] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-06-01 22:12:16,028] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0003999999999999993], mom=[(0.95, 0.999)]
[2024-06-01 22:12:16,028] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] amp_enabled .................. False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] amp_params ................... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] bfloat16_enabled ............. True
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7bd3dbb9f6d0>
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] communication_data_type ...... None
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] dataloader_drop_last ......... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] disable_allgather ............ False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] dump_state ................... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] elasticity_enabled ........... False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] fp16_auto_cast ............... None
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] fp16_enabled ................. False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] global_rank .................. 0
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] grad_accum_dtype ............. None
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] gradient_accumulation_steps .. 1
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] gradient_clipping ............ 0.0
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] graph_harvesting ............. False
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-06-01 22:12:16,029] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] load_universal_checkpoint .... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] loss_scale ................... 1.0
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] memory_breakdown ............. False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] mics_shard_size .............. -1
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] optimizer_name ............... None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] optimizer_params ............. None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] pld_enabled .................. False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] pld_params ................... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] prescale_gradients ........... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] scheduler_name ............... None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] scheduler_params ............. None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] sparse_attention ............. None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] steps_per_print .............. inf
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] train_batch_size ............. 48
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 24
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] use_node_local_storage ....... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] weight_quantization_config ... None
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] world_size ................... 2
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] zero_allow_untested_optimizer True
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] zero_enabled ................. True
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
[2024-06-01 22:12:16,030] [INFO] [config.py:988:print] zero_optimization_stage ...... 2
[2024-06-01 22:12:16,030] [INFO] [config.py:974:print_user_config] json = {
"bf16": {
"enabled": true,
"auto_cast": true
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2.000000e+08,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09
},
"gradient_accumulation_steps": 1,
"train_batch_size": 48,
"train_micro_batch_size_per_gpu": 24,
"steps_per_print": inf,
"wall_clock_breakdown": false,
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
0it [00:00, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [183,0,0], thread: [64,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [183,0,0], thread: [65,0,0] Assertion srcIndex < srcSelectDimSize failed.
...
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [221,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
0it [00:00, ?it/s]
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [523,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
....
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [523,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/media/lenovo/DATA/zth/Time-LLM-main/run_main.py", line 211, in
outputs = model(batch_x, batch_x_mark, dec_inp, batch_y_mark)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1842, in forward
loss = self.module(*inputs, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/lenovo/DATA/zth/Time-LLM-main/models/TimeLLM.py", line 197, in forward
dec_out = self.forecast(x_enc, x_mark_enc, x_dec, x_mark_dec)
File "/media/lenovo/DATA/zth/Time-LLM-main/models/TimeLLM.py", line 238, in forecast
source_embeddings = self.mapping_layer(self.word_embeddings.permute(1, 0)).permute(1, 0)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
Traceback (most recent call last):
File "/media/lenovo/DATA/zth/Time-LLM-main/run_main.py", line 211, in
outputs = model(batch_x, batch_x_mark, dec_inp, batch_y_mark)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1842, in forward
loss = self.module(*inputs, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/media/lenovo/DATA/zth/Time-LLM-main/models/TimeLLM.py", line 197, in forward
dec_out = self.forecast(x_enc, x_mark_enc, x_dec, x_mark_dec)
File "/media/lenovo/DATA/zth/Time-LLM-main/models/TimeLLM.py", line 238, in forecast
source_embeddings = self.mapping_layer(self.word_embeddings.permute(1, 0)).permute(1, 0)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x70cf3ddaf4d7 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x70cf3dd7936b in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x70cf52d42b58 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x70cee4777450 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x70cee477aa28 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x70cee477bf77 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc253 (0x70cf3d2dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94ac3 (0x70cf5d494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x126850 (0x70cf5d526850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7bd4877af4d7 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7bd48777936b in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7bd491003b58 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7bd419177450 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7bd41917aa28 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7bd41917bf77 in /home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdc253 (0x7bd471cdc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x94ac3 (0x7bd491e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x126850 (0x7bd491f26850 in /lib/x86_64-linux-gnu/libc.so.6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 5552) of binary: /home/lenovo/anaconda3/envs/timellm/bin/python
Traceback (most recent call last):
File "/home/lenovo/anaconda3/envs/timellm/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 932, in launch_command
multi_gpu_launcher(args)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lenovo/anaconda3/envs/timellm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_main.py FAILED

Failures:
[1]:
time : 2024-06-01_22:12:17
host : lenovo-ThinkStation-P7
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 5553)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5553

Root Cause (first observed failure):
[0]:
time : 2024-06-01_22:12:17
host : lenovo-ThinkStation-P7
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 5552)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 5552

The text was updated successfully, but these errors were encountered:

zhangtianhong-1998 · 2024-06-01T15:29:06Z

我的cuda是11.7

zhangtianhong-1998 · 2024-06-02T03:00:53Z

尝试降低python至3.8.5仍然出现类似错误

zhangtianhong-1998 · 2024-06-02T13:06:19Z

I used custom data during Debug to find the above error appearing in LLama embeddings，but the issue is still unresolved
（我使用自定义数据在Debug过程中发现上述错误出现在LLama embeddings，但是问题仍未解决）

zhangtianhong-1998 · 2024-06-02T13:44:52Z

code:
prompt_embeddings = self.llm_model.get_input_embeddings()(prompt.to(x_enc.device)) # (batch, prompt_token, dim)

When I switched the custom data back to ETTh2, everything seemed normal, but once I used the custom data, this unknown error occurred, which I thought was amazing.
（当我将自定义数据切换回ETTh2时，一切趋于正常，然而一但使用自定义的数据，变发生了这个未知的错误，我觉得很神奇）
Currently I am using weights from modelscope, this weight shows the above error, I am switching to author recommended weights for further testing
(目前我是用的权重来自于modelscope，在这个llama7b的权重显示出上述错误，我正在切换到作者推荐权重进行进一步测试)

这仅仅是语言模型的嵌入编码，我在ETTh2和自定义数据中测试发现，最大token id 为32000（使用了[PAD]），最小token id 为 0 但是不会出现越界的问题，由于词表为32000，tokenid应该位于0~31999之间，多余的[PAD]带来了额外的32000，所以我能理解越界的问题，然而这个问题却只出现在了自定义的数据，这是令人疑惑的

zhangtianhong-1998 · 2024-06-03T02:46:39Z

这个越界问题确实源于不正确的权重配置，[PAD]特殊token带来了额外的token id ，但是然而这个问题却只出现在了自定义的数据，这仍然无法解释

kwuking · 2024-06-03T02:52:00Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

zhangtianhong-1998 · 2024-06-03T07:48:17Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

linbingkong · 2024-10-11T03:32:32Z

你好，我单卡跑碰到点问题，可以加个联系方式请教一下吗

zkkkkkk72 · 2024-11-22T02:46:13Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

您好我在A100和V100上都跑了都出现了您图中的报错，我也是用的modelscope的llama权重，请问您提到的“您项目中的权重地址”在哪儿，我没找到，想试一下，谢谢

zkkkkkk72 · 2024-11-22T02:49:33Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

您好我在A100和V100上都跑了都出现了您图中的报错，我也是用的modelscope的llama权重，请问您提到的“您项目中的权重地址”在哪儿，我没找到，想试一下，谢谢

zhangtianhong-1998 · 2024-11-22T02:54:00Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

您好我在A100和V100上都跑了都出现了您图中的报错，我也是用的modelscope的llama权重，请问您提到的“您项目中的权重地址”在哪儿，我没找到，想试一下，谢谢

你可以使用作者源码里提供的hugging face 地址找到需要的权重，modelscope的权重是会出现上述问题

zkkkkkk72 · 2024-11-22T03:11:46Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

您好我在A100和V100上都跑了都出现了您图中的报错，我也是用的modelscope的llama权重，请问您提到的“您项目中的权重地址”在哪儿，我没找到，想试一下，谢谢

你可以使用作者源码里提供的hugging face 地址找到需要的权重，modelscope的权重是会出现上述问题

“作者源码里提供的hugging face 地址” 在哪个代码文件里？刚才没找到，能麻烦给一下链接吗，谢谢

zkkkkkk72 · 2024-11-22T03:13:37Z

你好，请为你的自定义数据集是什么样子的呢？目前出现的问题是使用自定义数据集会出现token id越界的现象吗？如果尝试将[PAD]去掉是否可以解决你的问题呢？

自定义数据就是一般的多元时序数据，目前出现的问题是，我从modelscope下载的权重缺少eostoken，然后导致自定义数据失败，但是这个在ETTh2上却不会出现问题，然后我目前使用了您项目中的权重地址下载了相关权重，发现可以正常训练了。我预计是由于越界，我同时打印了最小和最大的token id 发现最小是0，最大是32000（这样就有32001个token id），词表只有32000，但是如果越界应该同时发生，然而ETTh2上却没有出现，所以我比较困惑。删去[PAD] token我目前尚未尝试这一步骤，我将会在后续实验中尝试，进一步反馈

您好我在A100和V100上都跑了都出现了您图中的报错，我也是用的modelscope的llama权重，请问您提到的“您项目中的权重地址”在哪儿，我没找到，想试一下，谢谢

你可以使用作者源码里提供的hugging face 地址找到需要的权重，modelscope的权重是会出现上述问题

是这个吗？ https://huggingface.co/huggyllama/llama-7b

kwuking closed this as completed Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multi GPU error #100

multi GPU error #100

zhangtianhong-1998 commented Jun 1, 2024 •

edited

Loading

zhangtianhong-1998 commented Jun 1, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024 •

edited

Loading

Uh oh!

zhangtianhong-1998 commented Jun 3, 2024

Uh oh!

kwuking commented Jun 3, 2024

Uh oh!

zhangtianhong-1998 commented Jun 3, 2024

Uh oh!

linbingkong commented Oct 11, 2024 •

edited

Loading

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zhangtianhong-1998 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

multi GPU error #100

multi GPU error #100

Comments

zhangtianhong-1998 commented Jun 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zhangtianhong-1998 commented Jun 1, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024

Uh oh!

zhangtianhong-1998 commented Jun 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

code: prompt_embeddings = self.llm_model.get_input_embeddings()(prompt.to(x_enc.device)) # (batch, prompt_token, dim)

Uh oh!

zhangtianhong-1998 commented Jun 3, 2024

Uh oh!

kwuking commented Jun 3, 2024

Uh oh!

zhangtianhong-1998 commented Jun 3, 2024

Uh oh!

linbingkong commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zhangtianhong-1998 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zkkkkkk72 commented Nov 22, 2024

Uh oh!

zhangtianhong-1998 commented Jun 1, 2024 •

edited

Loading

zhangtianhong-1998 commented Jun 2, 2024 •

edited

Loading

code:
prompt_embeddings = self.llm_model.get_input_embeddings()(prompt.to(x_enc.device)) # (batch, prompt_token, dim)

linbingkong commented Oct 11, 2024 •

edited

Loading