-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Hello! Thanks for the open-sourced code release.
I have been trying to run the fine-tuning with DeepSeek-R1-Distill-Qwen-1.5B model on six 24GB 3090 GPU, while running
accelerate launch spin/run_spin.py configs/config.yaml
The program suddenly stops when starting training, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. Can you teach me how to modify the config? Thank you very much.
[WARNING|logging.py:329] 2025-02-12 10:42:26,998 >> The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use attn_implementation="flash_attention_2" instead.
[WARNING|modeling_utils.py:1506] 2025-02-12 10:42:26,999 >> You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
[WARNING|modeling_utils.py:1519] 2025-02-12 10:42:26,999 >> You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').
[WARNING|modeling_utils.py:1510] 2025-02-12 10:42:27,005 >> Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator.
[2025-02-12 10:42:27,615] [INFO] [partition_parameters.py:347:exit] finished initializing model - num_params = 339, num_elems = 1.78B
/data/SPIN/spin/alignment/trainer.py:164: UserWarning: You passed a ref model_id to the SPINTrainer. This will automatically create an AutoModelForCausalLM
[INFO|trainer.py:1721] 2025-02-12 10:42:42,040 >> ***** Running training *****
[INFO|trainer.py:1722] 2025-02-12 10:42:42,040 >> Num examples = 800
[INFO|trainer.py:1723] 2025-02-12 10:42:42,040 >> Num Epochs = 6
[INFO|trainer.py:1724] 2025-02-12 10:42:42,040 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2025-02-12 10:42:42,040 >> Total train batch size (w. parallel, distributed & accumulation) = 6
[INFO|trainer.py:1728] 2025-02-12 10:42:42,040 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2025-02-12 10:42:42,040 >> Total optimization steps = 804
[INFO|trainer.py:1730] 2025-02-12 10:42:42,041 >> Number of trainable parameters = 1,777,088,000
0%| | 0/804 [00:00<?, ?it/s]/data1/ /anaconda/envs/SPIN/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
losses: tensor([0.6931], device='cuda:0', grad_fn=)
policy_real_logps: tensor([-726.2833], device='cuda:0', grad_fn=)
policy_generated_logps: tensor([-611.3895], device='cuda:0', grad_fn=)
opponent_real_logps: tensor([-726.2833], device='cuda:0')
opponent_generated_logps: tensor([-611.3895], device='cuda:0')
logits: tensor([0.], device='cuda:0', grad_fn=)
real_rewards: tensor([0.], device='cuda:0')
losses: tensor([0.6931], device='cuda:3', grad_fn=)
generated_rewards: tensor([0.], device='cuda:0')
- Model parameters
ModelArguments(base_model_revision=None, model_name_or_path='/data/SPIN/DeepSeek-R1-Distill-Qwen-1.5B', model_revision='main', model_code_revision=None, torch_dtype=None, trust_remote_code=False, use_flash_attention_2=True, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
Data parameters
DataArguments(chat_template=None, dataset_mixer={'/data/SPIN/spin/synthetic': 1.0}, dataset_splits=['train', 'test'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
Training/evaluation parameters
SPINConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
beta=0.1,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=zephyr-7b-spin,
hub_model_revision=main,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-07,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=outputs/runs/Feb12_10-42-22_ubuntu,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_length=1024,
max_prompt_length=512,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=6.0,
optim=rmsprop,
optim_args=None,
output_dir=outputs/iter0-ckpt,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=outputs,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=epoch,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
)