-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Description
When using dyno gputrace
to profile PyTorch training jobs, the tool only successfully profiles CPU-based launcher processes but fails to capture the actual GPU training processes that are consuming GPU resources.
Current Behavior
- Successfully profiled: PIDs 6, 90, 355 (PyFlyte executors and Accelerate launcher)
- Failed to profile: PIDs 389, 390 (actual GPU training processes running
training.py
)
Expected Behavior
dyno gputrace
should successfully profile all matched processes, including GPU-bound PyTorch training processes.
Steps to Reproduce
# Start distributed training job on H100 GPUs
accelerate launch --config_file accelerate_fsdp.conf ... training.py
# Start dynolog
sudo dynolog --flagfile=/etc/dynolog.gflags &
dyno gputrace \
--log-file /shared/user/profiling/dynolog/4/libkeneto.pt.json \
--profile-memory \
--record-shapes \
--with-stacks \
--with-flops \
--with-modules \
--duration-ms 10000 \
--process-limit 5
ACTIVITIES_LOG_FILE=/shared/user/profiling/dynolog/4/libkeneto.pt.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=10000
PROFILE_REPORT_INPUT_SHAPES=true
PROFILE_PROFILE_MEMORY=true
PROFILE_WITH_STACK=true
PROFILE_WITH_FLOPS=true
PROFILE_WITH_MODULES=true
response length = 165
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[6,90,355,389,390],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[6,90,355,389,390]}
Matched 5 processes
Trace output files will be written to:
/shared/user/profiling/dynolog/4/libkeneto.pt_6.json
/shared/user/profiling/dynolog/4/libkeneto.pt_90.json
/shared/user/profiling/dynolog/4/libkeneto.pt_355.json
/shared/user/profiling/dynolog/4/libkeneto.pt_389.json
/shared/user/profiling/dynolog/4/libkeneto.pt_390.json
However, after some time, only 3 trace files are written of the uninteresting CPU launcher processes:
jobuser [ /shared/user/profiling/dynolog/4 ]$ ls -al
total 1810
drwxrwxr-x 2 jobuser jobuser 4096 Aug 11 05:56 .
drwxrwxr-x 6 jobuser jobuser 4096 Aug 11 05:54 ..
-rw-r--r-- 1 jobuser jobuser 1805724 Aug 11 05:56 libkeneto.pt_355.json
-rw-r--r-- 1 jobuser jobuser 12166 Aug 11 05:56 libkeneto.pt_6.json
-rw-r--r-- 1 jobuser jobuser 15462 Aug 11 05:56 libkeneto.pt_90.json
Process Details
- PID 6:
pyflyte-fast-execute
(CPU launcher) ✅ Profiled - PID 90:
pyflyte-execute
(CPU launcher) ✅ Profiled - PID 355:
accelerate launch
(CPU launcher) ✅ Profiled - PID 389:
training.py
(GPU training process) ❌ Not profiled - PID 390:
training.py
(GPU training process) ❌ Not profiled
Any ideas why this is not profiling the actual training processes?
Metadata
Metadata
Assignees
Labels
No labels