qwen2-vl-7b爆内存，注意不是显存，是爆内存！内存没回收= = #2757

hl0737 · 2024-12-24T13:02:43Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

RT

这是资源图，512GB内存，显存一直没问题，稳定在50GB左右，内存会一直增长= =哪里有内存泄漏~~

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

cuda12.6
ubuntu 22.04
swift 3.0.0
gpu a800 80gb
torch 2.5.1

Additional context
Add any other context about the problem here(在这里补充其他信息)

hl0737 · 2024-12-24T23:00:25Z

补充下启动命令是

`source /maindata/data/shared/public/liang.hu/conda/bin/activate
conda activate swift

NNODES=$WORLD_SIZE
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
NPROC_PER_NODE=8
MAX_PIXELS=602112
VIDEO_MAX_PIXELS=602112
NFRAMES=8
swift sft
--model /maindata/data/shared/public/chunli.peng/ckpt/Qwen2-VL-7B-Instruct/
--train_type full
--torch_dtype bfloat16
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--dataset /maindata/data/shared/public/liang.hu/infer1/train_qwen2vl_sft_swift.jsonl
--output_dir /maindata/data/shared/public/liang.hu/infer1/qwen2_sft/test
--num_train_epochs 1
--save_strategy 'no'
--eval_strategy 'no'
--logging_steps 1
--warmup_ratio 0.05
--report_to wandb
--gradient_checkpointing true
--freeze_vit true
--deepspeed zero2
--attn_impl flash_attn`

数据的格式是

TimeLessLing · 2024-12-30T18:01:21Z

我好像也遇到这个问题了，请问老哥有办法解决吗？

hl0737 · 2025-01-03T12:22:46Z

我好像也遇到这个问题了，请问老哥有办法解决吗？

有，办法很粗暴，就是多开点内存就行= =只要在OOM之前程序没崩，就不算OOM，你懂的，哈哈哈哈哈

貌似阿里云单节点现在内存能做到2T，你瞅瞅多开点

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2-vl-7b爆内存，注意不是显存，是爆内存！内存没回收= = #2757

qwen2-vl-7b爆内存，注意不是显存，是爆内存！内存没回收= = #2757

hl0737 commented Dec 24, 2024

hl0737 commented Dec 24, 2024

TimeLessLing commented Dec 30, 2024

hl0737 commented Jan 3, 2025

qwen2-vl-7b爆内存，注意不是显存，是爆内存！内存没回收= = #2757

qwen2-vl-7b爆内存，注意不是显存，是爆内存！内存没回收= = #2757

Comments

hl0737 commented Dec 24, 2024

hl0737 commented Dec 24, 2024

TimeLessLing commented Dec 30, 2024

hl0737 commented Jan 3, 2025