Normal Archived workflows with wma_prod rules #12267

hassan11196 · 2025-02-19T16:57:12Z

Impact of the bug
MS-RuleCleanor

Describe the bug
In an effort to reduce disk space, we have started to delete container levels wma_prod created by agents. In the process of finding these rules. I found around 12.3 PB worth of rucio rules locking data on Disk.
These numbers are only for container-level rules. It is a very high number but please take this number with a grain of salt, as I am not counting by the actual replicas but rather rules, and there is an overlap between the 2 copies of container level rule and block level rule.

Labels:

Deleteable -> container rule can be deleted as all block rules are okay.
Deleteable-ALL -> container rule can be deleted as all block rules are okay AND tape rule is okay.
STUCK -> container rule CANNOT be deleted as all block rules are NOT all okay.
BUG -> container rule CANNOT be deleted as block rules do not equal the total number of blocks.

Nevertheless, this number should Ideally be near zero as MS-RuleCleanor should be cleaning rules before the workflows are archived.

How to reproduce it
Here is the list of all such datasets and the csv used to produce this plot
dataset_with_archived_wfs.csv

Expected behavior
No wma_prod rules for workflows that are archived.

Additional context and error message
I will try to find the block level rules as well.

FYI @anpicci @amaltaro

The text was updated successfully, but these errors were encountered:

amaltaro · 2025-02-19T20:41:16Z

Hi @hassan11196 , let me try to understand the problem that you are reporting here. Are you saying that you have found workflows sitting in archived status and yet with existent wma_prod rules locking their output datasets? If so, then suggesting that:
a) either MSRuleCleaner is not cleaning up all of the wma_prod rules;
b) or that the agent is creating rules after a workflow gets archived.

If that is correct, I can confirm that Amanda identified this problem (the case b) above) last month and we are tracking that in this ticket: #12246

At least for the case that we were investigating, it was caused by the ultra slow JobAccountant polling cycle that we had back during Christmas, caused by a misconfiguration on the Oracle side.

hassan11196 · 2025-02-20T14:09:29Z

Hi @amaltaro
I suspected case a and was not aware of case b being a possibility, but case b does makes more sense.

I can still go through the MS-RuleCleanor logs to verify that it's not case a (If you have any tips please do share), and then we can close this issue since we already have the case b issue.

Thank you Alan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normal Archived workflows with wma_prod rules #12267

Normal Archived workflows with wma_prod rules #12267

hassan11196 commented Feb 19, 2025 •

edited

Loading

amaltaro commented Feb 19, 2025

hassan11196 commented Feb 20, 2025

Normal Archived workflows with wma_prod rules #12267

Normal Archived workflows with wma_prod rules #12267

Comments

hassan11196 commented Feb 19, 2025 • edited Loading

amaltaro commented Feb 19, 2025

hassan11196 commented Feb 20, 2025

hassan11196 commented Feb 19, 2025 •

edited

Loading