-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Reapers get stuck #893
Comments
The process is in sleep state. I'd like to find what it's waiting for, but the debugging tools I need (strace, lsof, netstat etc.) isn't available in the pod. If I can get access to the node, I'll try to debug further. Checking with Aroosha for the access.
I see this change in the reaper lately. Perhaps it's related: |
@haozturk about debugging, did you try to use https://kubernetes.io/docs/reference/kubectl/generated/kubectl_debug/ |
I managed to get access to the node. I see that all the threads of the process is sleeping:
The stack trace is as follows:
I'm trying to find if a thread is holding the lock and if it's also sleeping. If that's the case, it suggests a deadlock. |
The patch didn't fix it. Reopening the issue:
Restarting this pod, since T2_CH_CERN occupancy is at critical levels. There are still other reapers pods that are stuck and we can use them for debugging |
Bug Description
We realized that reapers are not deleting much on various disks and tapes. Looking at the logs of some reaper pods, we see that the latest log is from hours and in some cases days ago. For instance, the tier0 reaper was stuck for 1 week. Panos restarted it and it looks stale for 12 hours again:
The JINR reaper I deployed yesterday is also stale for 10 hours now:
Regular reapers have similar behavior:
The time at which I'm writing this issue is :
2025-02-19T09:16:44.991Z
Reproduction Steps
No response
Expected Behavior
Reapers should keep deleting as long as there are replicas to delete, which is the case at the moment
Possible Solution
No response
Related Issues
@ericvaandering @Panos512 @eachristgr fyi
The text was updated successfully, but these errors were encountered: