Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules created by wma_prod after workflow archival #12246

Open
anehnis opened this issue Jan 30, 2025 · 2 comments
Open

Rules created by wma_prod after workflow archival #12246

anehnis opened this issue Jan 30, 2025 · 2 comments

Comments

@anehnis
Copy link
Contributor

anehnis commented Jan 30, 2025

Impact of the bug
Rules are created for wmaprod account that will never be cleaned by MSRuleCleaner

Describe the bug
An instance was seen with a workflow that was aborted on Dec 24. The agent (vocms0254) had a backlog of ~10 days. It parsed the merge jobs on Dec 31 which then caused dbs3upload and rucioinjector to act. Rules were made for wmaprod and they won't be cleaned up as MSRuleCleaner has already archived the workflow.

Describe the solution you'd like
After a given stage of the workflow, the agent should no longer inject data into DBS and Rucio, as it can cause other type of problems, for instance:

  • adding more statistics to samples already announced to the CMS collaboration. It should not be a problem, but then it would be adding stats after P&R have verified the output data and its consistency between Rucio and DBS
  • it could lead to block rules that will never be removed from Rucio (deletion performed by MSRuleCleaner).

Describe alternatives you've considered
There are multiple ways to address this issue, here are some:

  1. ensure that the agent does not accumulate any backlog (including those caused by components down). --> likely impossible to reach;
  2. avoid having WMAgent to create blocks in DBS and Rucio for workflows that no longer need them (probably for any archived workflows, or closed-out, or announced, or rejected, or aborted)
  3. even after archiving a workflow in MSRuleCleaner, keep querying Rucio for potential wma_prod rules that should be deleted
  4. similar to 3), create a thread/mechanism that would look for wma_prod rules for a longer period of time (how long is good enough?)

Additional context
Included are some logs, @amaltaro was able to find for this example. time_travel_rules_debug.txt

@amaltaro
Copy link
Contributor

amaltaro commented Feb 3, 2025

@anehnis thank you for creating this ticket. I've done some refactoring to the original description above.

About the output data flow, we have a summary of it documented in this section: https://cms-wmcore.docs.cern.ch/training/data_flow/#output-data-flow

On what concerns the 4 options described above, I am slightly inclined to the option 2)., hence insuring that DBS3Upload and RucioInjector do not inject data for workflows that are no longer "relevant". It is still not clear to me what will be the most efficient way to answer the question: "do I still need this output data?". The few possibilities I see are:
a) fetch information from wmstatsserver (and if it is not there, we know for sure that it is not needed - work has already been archived)
b) fetch information from local workqueue, using one of the views that would list output data (if data is not there, then workqueue elements have been deleted). Actually, this is a BAD idea, because we still want to inject data for completed workflows, and those will likely no longer have local workqueue elements.
c) querying ReqMgr2, but it is likely too inefficient.

If we find such case - a file and/or block that needs to be inserted into Rucio (and DBS) - we should skip the injection into the external service and mark it as completed on the component side. For DBS, this would be done through block.status = 'InDBS', here: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L780

For RucioInjector, it looks like we would have to call self.setBlockRules.execute(binds), which updates the database with a rule id for a given block name, here: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/RucioInjectorPoller.py#L271

One of the questions I have is, can we update this with an invalid/null rule id? Should we default it to a fake rule id?

NOTE: I feel that we have to refactor how the polling cycle of these components would work as well. Either we:
a) ask an external service before injecting any given block
b) or we update a cache of no-longer-needed (or still-needed) blocks every X minutes/hours.

@amaltaro amaltaro changed the title Rules created by wmaprod after workflow archival Rules created by wma_prod after workflow archival Feb 11, 2025
@amaltaro
Copy link
Contributor

@anehnis as we discussed today, there is a second option to deal with this which would also resolve a very-long standing issue described in this ticket: #8148 In short, we would couple workflow completion with data injection into DBS and Rucio.

In other words, whenever the agent identifies a workflow ready to be completed, it would ensure that both DBS3Upload and RucioInjector components expedite data injection for that given workflow. As we discussed, components don't talk to each other, only through database information. So this development would involve identifying workflows ready-to-complete and prioritize their output data in those components.

Let me record here what we just discussed.

Challenges

  1. how to ensure that ReqMgr2 will only move a workflow to completed once all its data has been injected?
    To do this, we need to make sure agents won't delete (or mark as Done?) workqueue elements
  2. how to force RucioInjector to process the output for a given workflow? Instead of simply loading everything.
  3. similarly to 2) above, but for DBS3Upload

Questions (and some draft answers)

  1. when does a workflow moves to completed status?
    A: when all its workqueue elements are in a final status (could be in distributed agents); or perhaps when the agents have deleted those WQE.
  2. How does the agent know when a workflow is about to be completed?
    A: could be workqueue elements, but we probably want to look into the relational DB (status of subscriptions?)
  3. How can we change DBS3Upload to only look for data that belongs to a given workflow?
    It boils down to the loadBlocks() method here: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L319
  4. Similarly for RucioInjector, how can we expedite data injection for a given workflow?
    We might have to fork this getUninjected DAO such that we can look only into the output data that we need to expedite,
    here: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/RucioInjector/RucioInjectorPoller.py#L165
  5. Who does this feature change would impact the system and/or teams?
    If one of the agents is not behaving well, it could be that it will take many hours (days?) to get the workflow in completed status.
  6. How heavy (or slow) is the Rucio getUninjected DAO? If not too heavy, we could keep it and then filter data for the expedite process.

Developments

  1. we might have to break down the components' cycles into 2 modes:
    a) what is supposed to be completed
    b) what is standard data to be considered
  2. Or from a different angle, how can we enforce components to only load output data that belongs to a given workflow
  3. [Ops] Alan needs to deploy a testbed agent and stop RucioInjector and DBS3Upload, for playing/learning purposes.

Please add anything that I might have missed from our discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

2 participants