Bug: Reapers get stuck #893

haozturk · 2025-02-19T09:14:40Z

Bug Description

We realized that reapers are not deleting much on various disks and tapes. Looking at the logs of some reaper pods, we see that the latest log is from hours and in some cases days ago. For instance, the tier0 reaper was stuck for 1 week. Panos restarted it and it looks stale for 12 hours again:

$ k logs tier0-reaper-daemons-reaper-55b6c49d77-jmvcq     | tail -n 1
{"process": {"pid": 9}, "@timestamp": "2025-02-18T21:25:01.041Z", "log": {"logger": "root", "level": "WARNING"}, "message": "[30/34]: Deletion NOTFOUND of cms:/store/backfill/1/express/Tier0_REPLAY_2025/StreamExpress/ALCAPROMPT/PromptCalibProdSiPixel-Express-v10140233/000/382/686/00000/690604bb-f146-40d1-bb20-b94920bda107.root as davs://eoscms.cern.ch:443/eos/cms/tier0/store/backfill/1/express/Tier0_REPLAY_2025/StreamExpress/ALCAPROMPT/PromptCalibProdSiPixel-Express-v10140233/000/382/686/00000/690604bb-f146-40d1-bb20-b94920bda107.root on T0_CH_CERN_Disk in 0.35 seconds", "error": {"type": null, "message": null, "stack_trace": null}}

The JINR reaper I deployed yesterday is also stale for 10 hours now:

$ k logs daemons-reaper-jinr-667d5c9bb5-xmfpn        | tail -n 1
{"log": {"level": "WARNING", "logger": "root"}, "@timestamp": "2025-02-19T00:10:58.822Z", "message": "[59/60]: Deletion NOTFOUND of cms:/store/hidata/HIRun2018A/HIMinimumBias3/MINIAOD/PbPb18_MiniAODv1-v1/230000/bbd754be-46a8-4b91-b023-8fdbcbe9e4d3.root as davs://se-wbdv-mss.jinr-t1.ru:2880/pnfs/jinr-t1.ru/data/cms/store/hidata/HIRun2018A/HIMinimumBias3/MINIAOD/PbPb18_MiniAODv1-v1/230000/bbd754be-46a8-4b91-b023-8fdbcbe9e4d3.root on T1_RU_JINR_Tape in 0.49 seconds", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}

Regular reapers have similar behavior:

$ k logs daemons-reaper-fc985b484-xxdcd    | tail -n 1
{"log": {"logger": "root", "level": "WARNING"}, "message": "[51/60]: Deletion NOTFOUND of cms:/store/mc/Run3Summer23DRPremix/DYto2L-4Jets_MLL-4to50_HT-70to100_TuneCP5_13p6TeV_madgraphMLM-pythia8/AODSIM/130X_mcRun3_2023_realistic_v15-v3/2550000/00955e4c-786e-471f-a15d-374ae7cb3181.root as davs://xrootd-vanderbilt.sites.opensciencegrid.org:1094/store/mc/Run3Summer23DRPremix/DYto2L-4Jets_MLL-4to50_HT-70to100_TuneCP5_13p6TeV_madgraphMLM-pythia8/AODSIM/130X_mcRun3_2023_realistic_v15-v3/2550000/00955e4c-786e-471f-a15d-374ae7cb3181.root on T2_US_Vanderbilt in 1.44 seconds", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}, "@timestamp": "2025-02-19T00:08:05.325Z"}

The time at which I'm writing this issue is : 2025-02-19T09:16:44.991Z

Reproduction Steps

No response

Expected Behavior

Reapers should keep deleting as long as there are replicas to delete, which is the case at the moment

Possible Solution

No response

Related Issues

@ericvaandering @Panos512 @eachristgr fyi

The text was updated successfully, but these errors were encountered:

haozturk · 2025-02-19T10:29:20Z

The process is in sleep state. I'd like to find what it's waiting for, but the debugging tools I need (strace, lsof, netstat etc.) isn't available in the pod. If I can get access to the node, I'll try to debug further. Checking with Aroosha for the access.

$ k exec -it daemons-reaper-jinr-6cdd8595dc-88jz6  -- /bin/bash
[root@daemons-reaper-jinr-6cdd8595dc-88jz6 tmp]# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4192  3116 ?        Ss   09:25   0:00 /bin/bash /cms-entrypoint.sh
root           9  7.8  3.2 18289296 491468 ?     Sl   09:25   3:40 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T1_RU_JINR_Tape --chunk-size 4000 --sleep-t
memcach+      18  0.0  0.0 420200 10540 ?        Sl   09:25   0:00 /usr/bin/memcached -u memcached -p 11211 -m 64 -c 1024
root          68  0.0  0.0   4836  3816 pts/0    Ss   10:10   0:00 /bin/bash
root          87  0.0  0.0   7552  3196 pts/0    R+   10:11   0:00 ps aux

I see this change in the reaper lately. Perhaps it's related:

rucio/rucio@a37861b

dciangot · 2025-02-19T10:32:09Z

@haozturk about debugging, did you try to use kubectl debug to attach a container to the pod linux namespace?

https://kubernetes.io/docs/reference/kubectl/generated/kubectl_debug/

haozturk · 2025-02-19T12:08:11Z

I managed to get access to the node. I see that all the threads of the process is sleeping:

[core@cmsrucioprod2-yfehlu7mkan2-node-0 ~]$ ps aux | grep reaper
root          57  0.0  0.0      0     0 ?        S     2023   0:00 [oom_reaper]
root     3257266  2.4  3.1 18290572 473700 ?     Sl   09:18   3:57 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T0_CH_CERN_Disk|T2_CH_CERN --chunk-size 4000 --sleep-time 120
root     3257794 10.2  3.9 18342284 589156 ?     Sl   09:19  16:40 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=reaper=True --chunk-size 4000 --sleep-time 120
root     3264749  2.6  3.2 18289296 483436 ?     Sl   09:25   4:04 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T1_RU_JINR_Tape --chunk-size 4000 --sleep-time 120
core     3410510  0.0  0.0   3428  1508 pts/1    S+   12:01   0:00 grep --color=auto reaper
$ cat /proc/3264749/task/*/status | grep -E 'Pid|State'
State:	S (sleeping)
Pid:	3264749
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264848
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264849
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264853
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264854
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264855
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264856
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265467
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265468
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265469
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265470
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265471
PPid:	3264735
TracerPid:	0

The stack trace is as follows:

$ sudo cat /proc/3264749/stack
[<0>] futex_wait_queue+0x70/0xd0
[<0>] futex_wait+0x15a/0x220
[<0>] do_futex+0x106/0x1b0
[<0>] __x64_sys_futex+0x63/0x190
[<0>] do_syscall_64+0x5b/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

I'm trying to find if a thread is holding the lock and if it's also sleeping. If that's the case, it suggests a deadlock.

haozturk · 2025-02-20T11:47:31Z

The patch didn't fix it. Reopening the issue:

[haozturk@lxplus919 ~]$ k logs tier0-reaper-daemons-reaper-64584976f8-tv27p   | tail -n 1
Defaulted container "rucio-daemons" out of: rucio-daemons, debugger-9wzws (ephem)
{"log": {"level": "DEBUG", "logger": "root"}, "@timestamp": "2025-02-19T18:51:13.306Z", "process": {"pid": 9}, "message": "[9/12]: gfal.Default: closing protocol connection", "error": {"type": null, "message": null, "stack_trace": null}}
[haozturk@lxplus919 ~]$ date
Thu Feb 20 12:46:38 PM CET 2025

Restarting this pod, since T2_CH_CERN occupancy is at critical levels. There are still other reapers pods that are stuck and we can use them for debugging

haozturk added the bug label Feb 19, 2025

haozturk mentioned this issue Feb 19, 2025

Increase log level of reapers to debug dmwm/rucio-flux#371

Merged

rdimaio mentioned this issue Feb 19, 2025

Reaper - possible threading bug rucio/rucio#7422

Open

This was referenced Feb 19, 2025

remove rucio commit a37861b due to CMSRucio issue 893 haozturk/rucio#4

Open

Patch reaper to fix cmsrucio 893 dmwm/rucio-flux#372

Merged

ericvaandering closed this as completed in dmwm/rucio-flux#372 Feb 19, 2025

haozturk reopened this Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Reapers get stuck #893

Bug: Reapers get stuck #893

haozturk commented Feb 19, 2025 •

edited

Loading

haozturk commented Feb 19, 2025 •

edited

Loading

dciangot commented Feb 19, 2025

haozturk commented Feb 19, 2025 •

edited

Loading

haozturk commented Feb 20, 2025 •

edited

Loading

Bug: Reapers get stuck #893

Bug: Reapers get stuck #893

Comments

haozturk commented Feb 19, 2025 • edited Loading

Bug Description

Reproduction Steps

Expected Behavior

Possible Solution

Related Issues

haozturk commented Feb 19, 2025 • edited Loading

dciangot commented Feb 19, 2025

haozturk commented Feb 19, 2025 • edited Loading

haozturk commented Feb 20, 2025 • edited Loading

haozturk commented Feb 19, 2025 •

edited

Loading

haozturk commented Feb 19, 2025 •

edited

Loading

haozturk commented Feb 19, 2025 •

edited

Loading

haozturk commented Feb 20, 2025 •

edited

Loading