Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output from automatic split probe jobs is not discarded #8938

Open
belforte opened this issue Feb 19, 2025 · 16 comments
Open

output from automatic split probe jobs is not discarded #8938

belforte opened this issue Feb 19, 2025 · 16 comments

Comments

@belforte
Copy link
Member

see https://cmsweb.cern.ch/crabserver/ui/task/250217_154906%3Acerminar_crab_DoublePhoton_FlatPt-1To100_PU200

while transfer tab in there and crab geteoutput --dump only show processing and tail jobs, user showed me that the probe jobs output were also present in the destination directory.

One possibility is that this is due to direct stageout from WN in case the job ran at same site as destination.

It may even be that it has been like this since a while... need to check.

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

indeed from one probe job stdout


======== Stageout at Mon Feb 17 16:17:02 GMT 2025 STARTING ========
====== Mon Feb 17 16:17:03 2025: cmscp.py STARTING.
The user has not specified to transfer the log files. No log files stageout (nor log files metadata upload) will be performed.
Stageout policy: local, remote
SRC site T2_CH_CERN has davs ? True. Has gsiftp ? False.
DST site T2_CH_CERN has davs ? True. Has gsiftp ? False.
====== Mon Feb 17 16:17:03 2025: Starting job report validation.
Job report seems ok (it has the expected structure).
Retrieved payload exit code ('jobExitCode') = 0 from job report.
Retrieved job wrapper exit code ('exitCode') = 0 from job report.
====== Mon Feb 17 16:17:03 2025: Finished job report validation (status 0).
Job execution site is the same as destination site. Changing stageout policy.
New stageout policy: remote, local
====== Mon Feb 17 16:17:03 2025: Starting to check if user output files exist.
Output file inputs140X.root exists.
====== Mon Feb 17 16:17:03 2025: Finished to check if user output files exist (status 0).
====== Mon Feb 17 16:17:03 2025: Starting to check if user output files are in job report.
Output file inputs140X.root found in job report.
====== Mon Feb 17 16:17:03 2025: Finished to check if user output files are in job report (status 0).
====== Mon Feb 17 16:17:03 2025: Starting initialization of stageout manager for local stageouts.
       -----> Stageout manager log start
INFO:root:StageOutMgr::__init__()
INFO:root:==== Stageout configuration start ====
INFO:root:
There are 1 stage out definitions.
INFO:root:
Stage out to : T2_CH_CERN using: gfal2

[...]

Copying 355941281 bytes file:///srv/inputs140X.root => davs://eoscms.cern.ch:443/eos/cms/store/group/cmst3/group/l1tr/cerminar/l1teg/fpinputs/DoublePhoton_FlatPt-1To100-gun/DoublePhoton_FlatPt-1To100_PU200_142Xv0/250217_154906/0000/inputs140X_0-1.root

@belforte
Copy link
Member Author

CRAB_localOutputFiles is passed to the job via the VARS command in the DAG. Here's from Master branch

VARS Job{count} My.CRAB_localOutputFiles="\\"{localOutputFiles}\\""

for the good task, the RunJos.dag file has this line

VARS Job0-1 My.CRAB_localOutputFiles="\"\""

full spec for one probe job node is

JOB Job0-1 Job.0-1.submit
SCRIPT  PRE  Job0-1 dag_bootstrap.sh PREJOB $RETRY 0-1 250219_162741:belforte_crab_20250219_172737 crab-preprod-tw01.cern.ch probe
SCRIPT DEFER 4 1800 POST Job0-1 dag_bootstrap.sh POSTJOB $JOBID $RETURN $RETRY $MAX_RETRIES 250219_162741:belforte_crab_20250219_172737 0-1 /store/temp/user/belforte.4bba3c2d14d54a938291c268ff4eba8b575b197f/GenericTTbar/crab_20250219_172737/250219_162741/0000 /store/user/belforte/GenericTTbar/crab_20250219_172737/250219_162741/0000 cmsRun_0-1.log.tar.gz probe 
#PRE_SKIP Job0-1 3
RETRY Job0-1 2 UNLESS-EXIT 2
VARS Job0-1 count="0-1"
# following 3 classAds could possibly be moved to Job.submit but as they are job-dependent
# would need to be done in the PreJob... doing it here is a bit ugly, but simpler
VARS Job0-1 My.CRAB_localOutputFiles="\"\""
VARS Job0-1 My.CRAB_DataBlock="\"/GenericTTbar/HC-CMSSW_9_2_6_91X_mcRun1_realistic_v2-v2/AODSIM#35197562-76e3-11e7-a0c8-02163e00d7b3\""
VARS Job0-1 My.CRAB_Destination="\"davs://eoscms.cern.ch:443/eos/cms/store/user/belforte/GenericTTbar/crab_20250219_172737/250219_162741/0000/log/cmsRun_0-1.log.tar.gz\""
ABORT-DAG-ON Job0-1 3

@belforte
Copy link
Member Author

But I find the same

VARS Job0-1 My.CRAB_localOutputFiles="\"\""

also when using crabtaskworker:v3.250215-stable which was the same TW used by the problematic task reported above.

I am waiting for those jobs to run (at CERN) to see what's in job stdout ads...

@belforte
Copy link
Member Author

I need to make sure that I use a PSet which does produce output files :-(

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

I have initially focused on CRAB_localOutputFiles classAd. If it is empty if causes cmscp.py to do nothing. But it is set with the output file name in DAG file for both v3.250109 (current prod TW) and v3.250215 (used in the bad task in initial report).

But my jobs haven't run at CERN yet to see if local stageout was performed. pff...

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

one sure thing is that output transfer for probe jobs is disabled in PostJob, i.e. no ASO.

if 'CRAB_TransferOutputs' not in self.job_ad:
msg = "Job's HTCondor ClassAd is missing attribute CRAB_TransferOutputs."
msg += " Will assume CRAB_TransferOutputs = True."
self.logger.warning(msg)
self.transfer_outputs = 1
else:
self.transfer_outputs = int(self.job_ad['CRAB_TransferOutputs'])
if self.stage == 'probe':
self.transfer_logs = 0
self.transfer_outputs = 0

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

from a quick look at cmscp.py, the relevant classAd should be CRAB_TransferOutputs

things get interesting. Looking at how it is set in submit files for probes

  • PRODUCTION
belforte@vocms059/SPOOL_DIR> grep v3 TaskWorker/__init__.py 
__version__ = "v3.250109.patch1" #Automatically added during build process
belforte@vocms059/SPOOL_DIR> grep CRAB_TransferOutputs *0-*submit
Job.0-1.submit:+CRAB_TransferOutputs = 0
Job.0-2.submit:+CRAB_TransferOutputs = 0
Job.0-3.submit:+CRAB_TransferOutputs = 0
Job.0-4.submit:+CRAB_TransferOutputs = 0
Job.0-5.submit:+CRAB_TransferOutputs = 0
belforte@vocms059/SPOOL_DIR>
belforte@vocms059/SPOOL_DIR> grep CRAB_TransferOutputs Job.submit
belforte@vocms059/SPOOL_DIR> 
  • NEW
belforte@vocms0194/SPOOL_DIR> grep v3 TaskWorker/__init__.py 
__version__ = "v3.250215" #Automatically added during build process
belforte@vocms0194/SPOOL_DIR> grep CRAB_TransferOutputs *0-*submit
Job.0-1.submit:My.CRAB_TransferOutputs = 0
Job.0-1.submit:My.CRAB_TransferOutputs = 1
Job.0-2.submit:My.CRAB_TransferOutputs = 0
Job.0-2.submit:My.CRAB_TransferOutputs = 1
Job.0-3.submit:My.CRAB_TransferOutputs = 0
Job.0-3.submit:My.CRAB_TransferOutputs = 1
Job.0-4.submit:My.CRAB_TransferOutputs = 0
Job.0-4.submit:My.CRAB_TransferOutputs = 1
Job.0-5.submit:My.CRAB_TransferOutputs = 0
Job.0-5.submit:My.CRAB_TransferOutputs = 1
belforte@vocms0194/SPOOL_DIR> 
belforte@vocms0194/SPOOL_DIR> grep CRAB_TransferOutputs Job.submit
My.CRAB_TransferOutputs = 1
belforte@vocms0194/SPOOL_DIR> 

It seems that new code inserts My.CRAB_TransferOutputs = 1 in the common Job.submit template while before it was left to PreJob

saveoutputs = 0 if self.stage == 'probe' else self.task_ad.lookup('CRAB_TransferOutputs')
new_submit_text += 'My.CRAB_TransferOutputs = {0}\n+CRAB_SaveLogsFlag = {1}\n'.format(saveoutputs, savelogs)

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

bloody mess.

IIUC code before the "refactoring", CRAB_TransferOutputs ad was not put in Job.submit by DagmanCreator, but passed and ad to the dag bootstrap by DagmanSubmitter to be later used in PreJob.

One clear goal of the refactoring make explicit all such obscure information passing. I.e. I could move

jobSubmit['My.CRAB_TransferOutputs'] = transferOutputs

to after the Job.submit file was created, after this comment
# add maxidle, maxpost and faillimit to the object passed to DagmanSubmitter
# first two be used in the DAG submission and the latter the PostJob

Another option could be to enforce that PreJob overrides existing ad in Job.submit by changing

## Finally add (copy) all the content of the generic Job.submit file.
with open("Job.submit", 'r', encoding='utf-8') as fd:
new_submit_text += fd.read()

so that new_submit_text is added after existing file, rather then the other way around. I have no idea why current code is like that.
But so far I have hesitated to change code w/o a full understanding, maybe there were reasons for current implementation ?

@belforte
Copy link
Member Author

OK. I have submitte an auto-split task to v3.250215 with no sitewhitelist and indeed all jobs performed local stageout.
https://cmsweb-test2.cern.ch/crabserver/ui/task/250219_212948%3Abelforte_crab_20250219_222943

belforte@lxplus802/TC3> crab status -d ./crab_20250219_222943  --long 
Rucio client intialized for account belforte
CRAB project directory:		/afs/cern.ch/work/b/belforte/CRAB3/TC3/crab_20250219_222943
Task name:			250219_212948:belforte_crab_20250219_222943
Grid scheduler - Task Worker:	[email protected] - crab-dev-tw01

[...]

 Job State        Most Recent Site        Runtime   Mem (MB)      CPU %    Retries   Restarts      Waste       Exit Code
 0-1 no output    T1_US_FNAL              0:23:37       1494         60          0          0    0:00:08               0
 0-2 no output    T1_FR_CCIN2P3           0:16:01       1196         90          0          0    0:00:08               0
 0-3 no output    T1_FR_CCIN2P3           0:16:12       1207         90          0          0    0:00:09               0
 0-4 no output    T1_FR_CCIN2P3           0:16:24       1444         86          0          0    0:00:09               0
 0-5 no output    T1_RU_JINR              0:18:35       1086         79          0          0    0:00:09               0

probe 0-1 ran at FNAL where local stage out failed and the file was indeed pushed to CERN

belforte@lxplus802/~> ls /eos/cms/store/user/belforte/GenericTTbar/crab_20250219_222943/250219_212948/0000
kk_0-1.root
belforte@lxplus802/~> 

@belforte
Copy link
Member Author

belforte commented Feb 19, 2025

same task submitted to prod tw (v3.250109.patch1)
https://cmsweb.cern.ch/crabserver/ui/task/250219_220503%3Abelforte_crab_20250219_230459

indeed now

belforte@vocms059/250219_220503:belforte_crab_20250219_230459> condor_q -con crab_reqname==\"250219_220503:belforte_crab_20250219_230459\" -af jobuniverse jobstatus CRAB_TransferOutputs
7 2 1
12 2 undefined
5 2 0
5 2 0
5 2 0
5 2 0
5 2 0
belforte@vocms059/250219_220503:belforte_crab_20250219_230459> 

@belforte
Copy link
Member Author

[...]

== JOB AD: CRAB_TransferOutputs = 0

[...]

======== Stageout at Wed Feb 19 22:25:21 GMT 2025 STARTING ========
====== Wed Feb 19 22:25:22 2025: cmscp.py STARTING.
The user has not specified to transfer the log files. No log files stageout (nor log files metadata upload) will be performed.
The user has specified to not transfer the output files. No output files or logs stageout (nor output files metadata upload) will be performed.
Stageout wrapper has no work to do. Finishing here.
Setting stageout wrapper exit info to {'exit_code': 0, 'exit_acronym': 'OK', 'exit_msg': 'OK'}.
Cannot retrieve the job exit code from the job report (does None exist?).
====== Wed Feb 19 22:25:22 2025: cmscp.py FINISHING (status 0).

real	0m0.864s
user	0m0.425s
sys	0m0.171s
======== Stageout at Wed Feb 19 22:25:22 GMT 2025 FINISHING (short status 0) ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Wed Feb 19 22:25:22 GMT 2025 on ccwcondor0678 with (short) status 0 ========
[...]

== JOB AD: CRAB_TransferOutputs = 1

[...]

======== Stageout at Wed Feb 19 21:47:58 GMT 2025 STARTING ========
====== Wed Feb 19 21:47:58 2025: cmscp.py STARTING.
The user has not specified to transfer the log files. No log files stageout (nor log files metadata upload) will be performed.
Stageout policy: local, remote
SRC site T1_US_FNAL has davs ? True. Has gsiftp ? False.
DST site T2_CH_CERN has davs ? True. Has gsiftp ? False.
====== Wed Feb 19 21:47:59 2025: Starting job report validation.
Job report seems ok (it has the expected structure).
Retrieved payload exit code ('jobExitCode') = 0 from job report.
Retrieved job wrapper exit code ('exitCode') = 0 from job report.
====== Wed Feb 19 21:47:59 2025: Finished job report validation (status 0).
====== Wed Feb 19 21:47:59 2025: Starting to check if user output files exist.
Output file kk.root exists.
====== Wed Feb 19 21:47:59 2025: Finished to check if user output files exist (status 0).
====== Wed Feb 19 21:47:59 2025: Starting to check if user output files are in job report.
Output file kk.root found in job report.
====== Wed Feb 19 21:47:59 2025: Finished to check if user output files are in job report (status 0).
====== Wed Feb 19 21:47:59 2025: Starting initialization of stageout manager for local stageouts.

CONCLUSION
relevant classAd is indeed CRAB_TransferOutputs which is not set to 0 for probe jobs in latest code

@belforte
Copy link
Member Author

OTOH.. since CRAB_TransferOutputs is used in the Job Wrapper, why should it not be present in Job.submit ? PreJob is there exactly to customize it for each job and already has code to take care

saveoutputs = 0 if self.stage == 'probe' else self.task_ad.lookup('CRAB_TransferOutputs')

!!!!!!!!!!!!!

@belforte
Copy link
Member Author

So I will go for

PreJob adds its stuff AFTER the common Job.submit template. Reversing Line 354 here

## Finally add (copy) all the content of the generic Job.submit file.
with open("Job.submit", 'r', encoding='utf-8') as fd:
new_submit_text += fd.read()

belforte added a commit to belforte/CRABServer that referenced this issue Feb 20, 2025
@belforte
Copy link
Member Author

rats. it did not work !
lines in Job.0-*.submit are in the expected order

belforte@vocms059/cluster10126689.proc0.subproc0> grep CRAB_TransferOutputs *submit
Job.0-1.submit:My.CRAB_TransferOutputs = 1
Job.0-1.submit:My.CRAB_TransferOutputs = 0
Job.0-2.submit:My.CRAB_TransferOutputs = 1
Job.0-2.submit:My.CRAB_TransferOutputs = 0
Job.0-3.submit:My.CRAB_TransferOutputs = 1
Job.0-3.submit:My.CRAB_TransferOutputs = 0
Job.0-4.submit:My.CRAB_TransferOutputs = 1
Job.0-4.submit:My.CRAB_TransferOutputs = 0
Job.0-5.submit:My.CRAB_TransferOutputs = 1
Job.0-5.submit:My.CRAB_TransferOutputs = 0
Job.submit:My.CRAB_TransferOutputs = 1
belforte@vocms059/cluster10126689.proc0.subproc0> 

yet submitted jobs have the wrong value

belforte@vocms059/cluster10126689.proc0.subproc0> condor_q -con crab_reqname==\"250220_123754:belforte_crab_20250220_133751\" -af CRAB_TransferOutputs 
1
undefined
1
1
1
1
1
belforte@vocms059/cluster10126689.proc0.subproc0> 

@belforte
Copy link
Member Author

maybe lines added in JDL file after the queue statement are ignored ?
Yes, Sigh

belforte@vocms059/SPOOL_DIR> condor_q -con crab_reqname==\"250220_123754:belforte_crab_20250220_133751\" -af DESIRED_SITES
undefined
undefined
undefined
undefined
undefined
undefined
undefined
belforte@vocms059/SPOOL_DIR> 

@belforte
Copy link
Member Author

belforte commented Feb 20, 2025

Need a bit more care in PreJob to make sure that queue\n stays as the last line.
I lean toward reading Job.submit into an htcondor.Submit() object, manipulate it, write it out as Job.x.submit
And avoid all that line manipulation.

I.e. I will also take this chance to change new_submit_text into a newJobSubmit object and complete transition to modern HTC condor API also in PreJob.

belforte added a commit to belforte/CRABServer that referenced this issue Feb 21, 2025
belforte added a commit to belforte/CRABServer that referenced this issue Feb 21, 2025
belforte added a commit to belforte/CRABServer that referenced this issue Feb 21, 2025
belforte added a commit to belforte/CRABServer that referenced this issue Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant