We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
RT
这是资源图,512GB内存,显存一直没问题,稳定在50GB左右,内存会一直增长= =哪里有内存泄漏~~
Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
cuda12.6 ubuntu 22.04 swift 3.0.0 gpu a800 80gb torch 2.5.1
Additional context Add any other context about the problem here(在这里补充其他信息)
The text was updated successfully, but these errors were encountered:
补充下启动命令是
`source /maindata/data/shared/public/liang.hu/conda/bin/activate conda activate swift
NNODES=$WORLD_SIZE NODE_RANK=$RANK MASTER_ADDR=$MASTER_ADDR NPROC_PER_NODE=8 MAX_PIXELS=602112 VIDEO_MAX_PIXELS=602112 NFRAMES=8 swift sft --model /maindata/data/shared/public/chunli.peng/ckpt/Qwen2-VL-7B-Instruct/ --train_type full --torch_dtype bfloat16 --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --dataset /maindata/data/shared/public/liang.hu/infer1/train_qwen2vl_sft_swift.jsonl --output_dir /maindata/data/shared/public/liang.hu/infer1/qwen2_sft/test --num_train_epochs 1 --save_strategy 'no' --eval_strategy 'no' --logging_steps 1 --warmup_ratio 0.05 --report_to wandb --gradient_checkpointing true --freeze_vit true --deepspeed zero2 --attn_impl flash_attn`
数据的格式是
Sorry, something went wrong.
我好像也遇到这个问题了,请问老哥有办法解决吗?
有,办法很粗暴,就是多开点内存就行= =只要在OOM之前程序没崩,就不算OOM,你懂的,哈哈哈哈哈
貌似阿里云单节点现在内存能做到2T,你瞅瞅多开点
No branches or pull requests
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
RT
这是资源图,512GB内存,显存一直没问题,稳定在50GB左右,内存会一直增长= =哪里有内存泄漏~~
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
cuda12.6
ubuntu 22.04
swift 3.0.0
gpu a800 80gb
torch 2.5.1
Additional context
Add any other context about the problem here(在这里补充其他信息)
The text was updated successfully, but these errors were encountered: