You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
In an effort to reduce disk space, we have started to delete container levels wma_prod created by agents. In the process of finding these rules. I found around 12.3 PB worth of rucio rules locking data on Disk.
These numbers are only for container-level rules. It is a very high number but please take this number with a grain of salt, as I am not counting by the actual replicas but rather rules, and there is an overlap between the 2 copies of container level rule and block level rule.
Labels:
Deleteable -> container rule can be deleted as all block rules are okay.
Deleteable-ALL -> container rule can be deleted as all block rules are okay AND tape rule is okay.
STUCK -> container rule CANNOT be deleted as all block rules are NOT all okay.
BUG -> container rule CANNOT be deleted as block rules do not equal the total number of blocks.
Nevertheless, this number should Ideally be near zero as MS-RuleCleanor should be cleaning rules before the workflows are archived.
How to reproduce it
Here is the list of all such datasets and the csv used to produce this plot dataset_with_archived_wfs.csv
Expected behavior
No wma_prod rules for workflows that are archived.
Additional context and error message
I will try to find the block level rules as well.
Hi @hassan11196 , let me try to understand the problem that you are reporting here. Are you saying that you have found workflows sitting in archived status and yet with existent wma_prod rules locking their output datasets? If so, then suggesting that:
a) either MSRuleCleaner is not cleaning up all of the wma_prod rules;
b) or that the agent is creating rules after a workflow gets archived.
If that is correct, I can confirm that Amanda identified this problem (the case b) above) last month and we are tracking that in this ticket: #12246
At least for the case that we were investigating, it was caused by the ultra slow JobAccountant polling cycle that we had back during Christmas, caused by a misconfiguration on the Oracle side.
Hi @amaltaro
I suspected case a and was not aware of case b being a possibility, but case b does makes more sense.
I can still go through the MS-RuleCleanor logs to verify that it's not case a (If you have any tips please do share), and then we can close this issue since we already have the case b issue.
Impact of the bug
MS-RuleCleanor
Describe the bug
In an effort to reduce disk space, we have started to delete container levels
wma_prod
created by agents. In the process of finding these rules. I found around 12.3 PB worth of rucio rules locking data on Disk.These numbers are only for container-level rules. It is a very high number but please take this number with a grain of salt, as I am not counting by the actual replicas but rather rules, and there is an overlap between the 2 copies of container level rule and block level rule.
Labels:
Nevertheless, this number should Ideally be near zero as MS-RuleCleanor should be cleaning rules before the workflows are archived.
How to reproduce it
Here is the list of all such datasets and the csv used to produce this plot
dataset_with_archived_wfs.csv
Expected behavior
No wma_prod rules for workflows that are archived.
Additional context and error message
I will try to find the block level rules as well.
FYI @anpicci @amaltaro
The text was updated successfully, but these errors were encountered: