Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Reapers get stuck #893

Open
haozturk opened this issue Feb 19, 2025 · 4 comments · Fixed by dmwm/rucio-flux#372
Open

Bug: Reapers get stuck #893

haozturk opened this issue Feb 19, 2025 · 4 comments · Fixed by dmwm/rucio-flux#372
Labels

Comments

@haozturk
Copy link
Contributor

haozturk commented Feb 19, 2025

Bug Description

We realized that reapers are not deleting much on various disks and tapes. Looking at the logs of some reaper pods, we see that the latest log is from hours and in some cases days ago. For instance, the tier0 reaper was stuck for 1 week. Panos restarted it and it looks stale for 12 hours again:

$ k logs tier0-reaper-daemons-reaper-55b6c49d77-jmvcq     | tail -n 1
{"process": {"pid": 9}, "@timestamp": "2025-02-18T21:25:01.041Z", "log": {"logger": "root", "level": "WARNING"}, "message": "[30/34]: Deletion NOTFOUND of cms:/store/backfill/1/express/Tier0_REPLAY_2025/StreamExpress/ALCAPROMPT/PromptCalibProdSiPixel-Express-v10140233/000/382/686/00000/690604bb-f146-40d1-bb20-b94920bda107.root as davs://eoscms.cern.ch:443/eos/cms/tier0/store/backfill/1/express/Tier0_REPLAY_2025/StreamExpress/ALCAPROMPT/PromptCalibProdSiPixel-Express-v10140233/000/382/686/00000/690604bb-f146-40d1-bb20-b94920bda107.root on T0_CH_CERN_Disk in 0.35 seconds", "error": {"type": null, "message": null, "stack_trace": null}}

The JINR reaper I deployed yesterday is also stale for 10 hours now:

$ k logs daemons-reaper-jinr-667d5c9bb5-xmfpn        | tail -n 1
{"log": {"level": "WARNING", "logger": "root"}, "@timestamp": "2025-02-19T00:10:58.822Z", "message": "[59/60]: Deletion NOTFOUND of cms:/store/hidata/HIRun2018A/HIMinimumBias3/MINIAOD/PbPb18_MiniAODv1-v1/230000/bbd754be-46a8-4b91-b023-8fdbcbe9e4d3.root as davs://se-wbdv-mss.jinr-t1.ru:2880/pnfs/jinr-t1.ru/data/cms/store/hidata/HIRun2018A/HIMinimumBias3/MINIAOD/PbPb18_MiniAODv1-v1/230000/bbd754be-46a8-4b91-b023-8fdbcbe9e4d3.root on T1_RU_JINR_Tape in 0.49 seconds", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}}

Regular reapers have similar behavior:

$ k logs daemons-reaper-fc985b484-xxdcd    | tail -n 1
{"log": {"logger": "root", "level": "WARNING"}, "message": "[51/60]: Deletion NOTFOUND of cms:/store/mc/Run3Summer23DRPremix/DYto2L-4Jets_MLL-4to50_HT-70to100_TuneCP5_13p6TeV_madgraphMLM-pythia8/AODSIM/130X_mcRun3_2023_realistic_v15-v3/2550000/00955e4c-786e-471f-a15d-374ae7cb3181.root as davs://xrootd-vanderbilt.sites.opensciencegrid.org:1094/store/mc/Run3Summer23DRPremix/DYto2L-4Jets_MLL-4to50_HT-70to100_TuneCP5_13p6TeV_madgraphMLM-pythia8/AODSIM/130X_mcRun3_2023_realistic_v15-v3/2550000/00955e4c-786e-471f-a15d-374ae7cb3181.root on T2_US_Vanderbilt in 1.44 seconds", "error": {"type": null, "message": null, "stack_trace": null}, "process": {"pid": 9}, "@timestamp": "2025-02-19T00:08:05.325Z"}

The time at which I'm writing this issue is : 2025-02-19T09:16:44.991Z

Reproduction Steps

No response

Expected Behavior

Reapers should keep deleting as long as there are replicas to delete, which is the case at the moment

Possible Solution

No response

Related Issues

@ericvaandering @Panos512 @eachristgr fyi

@haozturk
Copy link
Contributor Author

haozturk commented Feb 19, 2025

The process is in sleep state. I'd like to find what it's waiting for, but the debugging tools I need (strace, lsof, netstat etc.) isn't available in the pod. If I can get access to the node, I'll try to debug further. Checking with Aroosha for the access.

$ k exec -it daemons-reaper-jinr-6cdd8595dc-88jz6  -- /bin/bash
[root@daemons-reaper-jinr-6cdd8595dc-88jz6 tmp]# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4192  3116 ?        Ss   09:25   0:00 /bin/bash /cms-entrypoint.sh
root           9  7.8  3.2 18289296 491468 ?     Sl   09:25   3:40 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T1_RU_JINR_Tape --chunk-size 4000 --sleep-t
memcach+      18  0.0  0.0 420200 10540 ?        Sl   09:25   0:00 /usr/bin/memcached -u memcached -p 11211 -m 64 -c 1024
root          68  0.0  0.0   4836  3816 pts/0    Ss   10:10   0:00 /bin/bash
root          87  0.0  0.0   7552  3196 pts/0    R+   10:11   0:00 ps aux

I see this change in the reaper lately. Perhaps it's related:

@dciangot
Copy link
Contributor

@haozturk about debugging, did you try to use kubectl debug to attach a container to the pod linux namespace?

https://kubernetes.io/docs/reference/kubectl/generated/kubectl_debug/

@haozturk
Copy link
Contributor Author

haozturk commented Feb 19, 2025

I managed to get access to the node. I see that all the threads of the process is sleeping:

[core@cmsrucioprod2-yfehlu7mkan2-node-0 ~]$ ps aux | grep reaper
root          57  0.0  0.0      0     0 ?        S     2023   0:00 [oom_reaper]
root     3257266  2.4  3.1 18290572 473700 ?     Sl   09:18   3:57 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T0_CH_CERN_Disk|T2_CH_CERN --chunk-size 4000 --sleep-time 120
root     3257794 10.2  3.9 18342284 589156 ?     Sl   09:19  16:40 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=reaper=True --chunk-size 4000 --sleep-time 120
root     3264749  2.6  3.2 18289296 483436 ?     Sl   09:25   4:04 /usr/bin/python3 /usr/local/bin/rucio-reaper --threads 4 --include-rses=T1_RU_JINR_Tape --chunk-size 4000 --sleep-time 120
core     3410510  0.0  0.0   3428  1508 pts/1    S+   12:01   0:00 grep --color=auto reaper
$ cat /proc/3264749/task/*/status | grep -E 'Pid|State'
State:	S (sleeping)
Pid:	3264749
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264848
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264849
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264853
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264854
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264855
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3264856
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265467
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265468
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265469
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265470
PPid:	3264735
TracerPid:	0
State:	S (sleeping)
Pid:	3265471
PPid:	3264735
TracerPid:	0

The stack trace is as follows:

$ sudo cat /proc/3264749/stack
[<0>] futex_wait_queue+0x70/0xd0
[<0>] futex_wait+0x15a/0x220
[<0>] do_futex+0x106/0x1b0
[<0>] __x64_sys_futex+0x63/0x190
[<0>] do_syscall_64+0x5b/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

I'm trying to find if a thread is holding the lock and if it's also sleeping. If that's the case, it suggests a deadlock.

@haozturk
Copy link
Contributor Author

haozturk commented Feb 20, 2025

The patch didn't fix it. Reopening the issue:

[haozturk@lxplus919 ~]$ k logs tier0-reaper-daemons-reaper-64584976f8-tv27p   | tail -n 1
Defaulted container "rucio-daemons" out of: rucio-daemons, debugger-9wzws (ephem)
{"log": {"level": "DEBUG", "logger": "root"}, "@timestamp": "2025-02-19T18:51:13.306Z", "process": {"pid": 9}, "message": "[9/12]: gfal.Default: closing protocol connection", "error": {"type": null, "message": null, "stack_trace": null}}
[haozturk@lxplus919 ~]$ date
Thu Feb 20 12:46:38 PM CET 2025

Restarting this pod, since T2_CH_CERN occupancy is at critical levels. There are still other reapers pods that are stuck and we can use them for debugging

@haozturk haozturk reopened this Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants