Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FJR crash on bad storage.json #12255

Open
stlammel opened this issue Feb 10, 2025 · 14 comments
Open

FJR crash on bad storage.json #12255

stlammel opened this issue Feb 10, 2025 · 14 comments

Comments

@stlammel
Copy link

The framework job report crashes in case a site has a fallback reference to a site
with a "broken" SITECONF/storage.json.
I assume FJR accesses the fallback's site storage.json for fallback stage-out information.
In the HammerCloud example attached, input data were read locally and there was no
fallback/no stage out.
It seems to be python, so would be good to put the read/JSON decoding into a try and
ignore errors if not needed (or skip the read/decode altogether if not needed). Given
that SITECONF is on CVMFS, i would catch-and-continue on any read/decoding errors
in general and only error in case information is missing/needed.

Impact of the bug
All jobs of sites using a storage.json in case it's bad.

Describe the bug
see above

How to reproduce it
corrupt JSON syntax

Expected behavior
see above

Additional context and error message
https://cmsweb.cern.ch/scheddmon/0197/sciaba/250208_103551:sciaba_crab_HC-205-T2_US_Florida-110860-20250208113502/job_out.105.1.txt

Sanitize FJR

Job Exit Code from FrameworkJobReport.xml: 0
CONDITION FOR REPORTING READ BRANCHES WAS FALSE
==== Job Exit Code from FrameworkJobReport.xml and Application exit code: 0 ====
==== Checksum computation STARTING at Sat Feb 8 20:15:29 2025 UTC ====
==== Checksum FINISHED at Sat Feb 8 20:15:30 2025 UTC ====
== FileName: b.root - FileAdler32: 3c6aa3b7- FileSize: 97.70085906982422.3f MBytes
== Adding PSet Hash for filename: b.root
==== PSet Hash computation STARTING at Sat Feb 8 20:15:30 2025 UTC ====
==== PSet Hash computation FINISHED at Sat Feb 8 20:15:31 2025 UTC ====
== edmProvDump pset hash d3610cbfc09efdad6393a7de796d6d85
WARNING: Unable to parse WMCore's jobReport.json; FJR will not be useful.
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 240, in handleException
report = json.load(fh)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

ERROR: Exceptional exit at Sat Feb 8 20:15:31 2025 UTC 50115: Exception while handling the job report.
ERROR: Traceback follows:
Traceback (most recent call last):
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 238, in rseName
jsElements = json.load(jsonFile)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 15 (char 1370)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "CMSRunAnalysis.py", line 852, in
slCfg = SiteLocalConfig.loadSiteLocalConfig()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 53, in loadSiteLocalConfig
config = SiteLocalConfig(actualPath)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 100, in init
self.read()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 143, in read
nodeResult = nodeReader(node)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 208, in nodeReader
processor.send((report, node))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 224, in processNode
target.send((report, child))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 248, in processSite
targets['stage-out'].send((report, subnode))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 310, in processStageOut
localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 242, in rseName
raise RuntimeError(msg)
RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /cvmfs/cms.cern.ch/SITECONF/T2_US_MIT/storage.json
Expecting value: line 42 column 15 (char 1370)

ERROR: Failed to record execution site name in the FJR from the site-local-config.xml
Traceback (most recent call last):
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 238, in rseName
jsElements = json.load(jsonFile)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 15 (char 1370)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "CMSRunAnalysis.py", line 274, in handleException
sLCfg = SiteLocalConfig.loadSiteLocalConfig()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 53, in loadSiteLocalConfig
config = SiteLocalConfig(actualPath)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 100, in init
self.read()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 143, in read
nodeResult = nodeReader(node)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 208, in nodeReader
processor.send((report, node))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 224, in processNode
target.send((report, child))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 248, in processSite
targets['stage-out'].send((report, subnode))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 310, in processStageOut
localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 242, in rseName
raise RuntimeError(msg)
RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /cvmfs/cms.cern.ch/SITECONF/T2_US_MIT/storage.json
Expecting value: line 42 column 15 (char 1370)

  • jobrc=195
  • set +x
    == The job had an exit code of 195
    ======== CMSRunAnalysis.py FINISHING at Sat Feb 8 20:15:31 GMT 2025 ========
@anpicci
Copy link
Contributor

anpicci commented Feb 11, 2025

One thing to investigate on WM side is that if we can run SiteLocalConfig over only the site that is intended to be used, and not all the sites. For example, for this failed job, only Florida is meant to be used, but SiteLocalConfig fails when checking the MIT config json

@amaltaro
Copy link
Contributor

@stlammel thank you for creating this issue.

It seems to be python, so would be good to put the read/JSON decoding into a try and
ignore errors if not needed (or skip the read/decode altogether if not needed). Given
that SITECONF is on CVMFS, i would catch-and-continue on any read/decoding errors
in general and only error in case information is missing/needed.

I do not think this is possible. If a JSON document is broken, we cannot partially read it. The whole operation is doomed to fail. With that said, is there anything else that you think should be changed in WMCore? A more friendly error message? Or else?

For the record, here is how one can reproduce this error:

  1. copy the T2_US_MIT storage.json and site-local-config.xml locally (as we do not have CVMFS from inside the WMAgent docker containers)
  2. add an extra comma to the storage.json such that it is broken. They would be located under:
(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ ls logs/storage.json 
logs/storage.json
(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ ls logs/JobConfig/site-local-config.xml 
logs/JobConfig/site-local-config.xml
  1. from inside a WMAgent docker container open a python interpreter and:
>>> from WMCore.Storage.SiteLocalConfig import loadSiteLocalConfig
>>> import os
>>> os.environ['SITECONFIG_PATH'] = '/data/srv/wmagent/current/logs'
>>> sLCfg = loadSiteLocalConfig()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/RucioFileCatalog.py", line 238, in rseName
    jsElements = json.load(jsonFile)
  File "/usr/local/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 45 column 24 (char 1418)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 53, in loadSiteLocalConfig
    config = SiteLocalConfig(actualPath)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 100, in __init__
    self.read()
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 143, in read
    nodeResult = nodeReader(node)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 208, in nodeReader
    processor.send((report, node))
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 224, in processNode
    target.send((report, child))
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 248, in processSite
    targets['stage-out'].send((report, subnode))
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/SiteLocalConfig.py", line 310, in processStageOut
    localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Storage/RucioFileCatalog.py", line 242, in rseName
    raise RuntimeError(msg)
RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /data/srv/wmagent/2.3.9/logs/storage.json
Expecting property name enclosed in double quotes: line 45 column 24 (char 1418)
>>> 

@stlammel
Copy link
Author

Hallo Alan,
no suggestion to partially read the JSON but to read only applicable storage.json files.
Thanks,
cheers, Stephan

@amaltaro
Copy link
Contributor

Thank you for the confirmation.

I am afraid those logs are not clear enough. I cannot understand why it loaded T2_US_MIT SITECONF instead of T2_US_Florida. There is no mention at all for the Florida SITECONF in the logs; and the only SITECONF reported in the logs is

RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /cvmfs/cms.cern.ch/SITECONF/T2_US_MIT/storage.json

From what I can say, even if fallback stage out has to be used, the job would still only need to load the local site SITECONF. Perhaps there is something that I am missing that @nhduongvn would be able to clarify?

@stlammel
Copy link
Author

Hallo Alan,
i suspect that the storage.json of T2_US_Florida was read and successfull reads
doesn't resulting in any printout.
Thanks,
cheers, Stephan

@nhduongvn
Copy link
Collaborator

nhduongvn commented Feb 12, 2025

Ok, now I think I understand what are happening here. When site-local-config.xml is loaded (read), all of the stage-out defined in <stage-out> block, for example
https://gitlab.cern.ch/SITECONF/T2_US_Florida/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L31
are read/parsed by processStageOut() here:


As you see in the site-local-config.xml above, these are all stage out listed:
<method volume="Florida_Lustre" protocol="WebDAV"/>
<!-- method site="T2_US_MIT" volume="MIT_HADOOP" protocol="WebDAV"/ -->
<method site="T2_US_Nebraska" volume="Nebraska_CEPH" protocol="WebDAV" command="gfal2"/>
Presumably, T2_US_MIT was not commented out (someone might turned it off because of the crash), the information is read/parsed in processStageOut() and when it tried to read the rseName:
localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)

it visited the broken storage.json at T2_US_MIT and this triggered the crash

In summary, all the storage.json of all stage out listed in <stage-out> block will be visited when site-local-config.xml is parsed disregard of whether they are used in actual stage out or not since we still do not know which one will be used when site-local-config.xml is parsed.

Do you have suggestions for solutions? (I am thinking about it too, maybe delay rseName to when the stage out actually used or we can skip the Rucio storage element matching check so that rseName does not need. Basically retire the phedex-node)

Stephan is right. T2_US_Florida storage.json is fine so no printout (unless we want to add printout what storage.json is reading)

@nhduongvn
Copy link
Collaborator

nhduongvn commented Feb 12, 2025

how about wrapping an exception handling around rseName, print out warning if exception occur and set localReport['phedex-node'] to Nonel and move on? This is the simplest solution I think.

@stlammel
Copy link
Author

Hallo Duong,
i don't know what is needed by FJR. Maybe loading JobConfig/site-local-config.xml is an overkill / too general approach? But yes, protecting loading of JobConfig/site-local-config.xml from bad storage.json files makes sense to me.
Thanks,
cheers, Stephan

@amaltaro
Copy link
Contributor

I fear that setting phedex-node=None can cause unexpected and/or hidden issues downstream.
A potentially better commitment would be:
a) if the JSON file is broken for local stage out, fail the job right away
b) if a broken JSON is found for a remote stage out, then we should completely discard that remote stage out configuration.

What do you think?

@stlammel
Copy link
Author

Hallo Alan,
i don't see the reason to fail the job: If local stage-out is broken, cmsRun went to the fallback, so the primary stage-out information is irrelevant. FJR should just do the same, no?
Thanks,
cheers, Stephan

@amaltaro
Copy link
Contributor

Yes, I think you are right - which also proves my lack of understanding of the SITECONF schema.

Taking as example the file pointed out by Duong above:

    <stage-out>
       <method volume="Florida_Lustre" protocol="WebDAV"/>
       <!-- method site="T2_US_MIT" volume="MIT_HADOOP" protocol="WebDAV"/ -->
       <method site="T2_US_Nebraska" volume="Nebraska_CEPH" protocol="WebDAV" command="gfal2"/>
    </stage-out>

how do I know which one is the "local" stage out? Is it the first element in this stage-out node?
If I recall it correctly, lack of site name in this stage-out section defaults to the site name defined at the top of the site-local-config.xml, right? Which means, this is the way the runtime job knows where to load the storage.json file from.

Back to the local vs remote stage out. If the Florida storage.json is broken (not JSON compliant), should we:
a) assume Nebraska_CEPH as being the local stage out
b) not have any local stage out and just play with the remote/fallback one
c) or else?

The expected behavior will help us understanding how to resolve this in WMAgent runtime.

@nhduongvn
Copy link
Collaborator

nhduongvn commented Feb 18, 2025

I am not sure whether it is important to distinguish between local and fall back stage out down stream (in WMAgen runtime?). Even though in my code refactor I still keep local and fallback to reuse as much old codes as possible and maintain comparability elsewhere, the roles of local and fall back are not that important for stage out because in the end they will be tried one by one starting from the first one. There are cases such as subsites where local sites do not have own storage so all stage out are fallback/remote. Therefore, unless the difference between local and fallback/remote is important down stream somewhere, I would suggest that we check the validity of storage.json of all stage out in processStageOut and just simply skip the stage out if its corresponding storage.json is broken (together with print out for debug purpose).

However, if we always expect that the first stage out (which is local in most cases) must succeed, we should fail the jobs if its storage.json is broken.

Could you let me know which option sounds better for you?

  1. Fail jobs if the first stage out (local) storage.json is broken
  2. Skip any of stage out if its storage.json is broken

@nhduongvn
Copy link
Collaborator

I read Alan post more careful, so if you think that local and fall back roles are important to resolve issues at WMAgent runtime, I can try option 1. in my post above (Fail jobs if the first stage out (local) storage.json is broken). Do we want to proceed in this direction?

@stlammel
Copy link
Author

Hallo Alan,
yes, as Duong wrote there is no local/fallback anymore but an ordered list that is
being tried. The first one that works will/should be taken. Any "broken JSON" ones
can/should be skipped like failed stage-out/copy. If none succeed, fail the job.
The default in case of no site entry is the current site.
Thanks,

  • Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants