-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FJR crash on bad storage.json #12255
Comments
One thing to investigate on WM side is that if we can run SiteLocalConfig over only the site that is intended to be used, and not all the sites. For example, for this failed job, only Florida is meant to be used, but SiteLocalConfig fails when checking the MIT config json |
@stlammel thank you for creating this issue.
I do not think this is possible. If a JSON document is broken, we cannot partially read it. The whole operation is doomed to fail. With that said, is there anything else that you think should be changed in WMCore? A more friendly error message? Or else? For the record, here is how one can reproduce this error:
|
Hallo Alan, |
Thank you for the confirmation. I am afraid those logs are not clear enough. I cannot understand why it loaded T2_US_MIT SITECONF instead of T2_US_Florida. There is no mention at all for the Florida SITECONF in the logs; and the only SITECONF reported in the logs is
From what I can say, even if fallback stage out has to be used, the job would still only need to load the local site SITECONF. Perhaps there is something that I am missing that @nhduongvn would be able to clarify? |
Hallo Alan, |
Ok, now I think I understand what are happening here. When site-local-config.xml is loaded (read), all of the stage-out defined in
As you see in the site-local-config.xml above, these are all stage out listed:<method volume="Florida_Lustre" protocol="WebDAV"/> <!-- method site="T2_US_MIT" volume="MIT_HADOOP" protocol="WebDAV"/ --> <method site="T2_US_Nebraska" volume="Nebraska_CEPH" protocol="WebDAV" command="gfal2"/> Presumably, T2_US_MIT was not commented out (someone might turned it off because of the crash), the information is read/parsed in processStageOut() and when it tried to read the rseName:
it visited the broken storage.json at T2_US_MIT and this triggered the crash In summary, all the storage.json of all stage out listed in Do you have suggestions for solutions? (I am thinking about it too, maybe delay rseName to when the stage out actually used or we can skip the Rucio storage element matching check so that rseName does not need. Basically retire the Stephan is right. T2_US_Florida storage.json is fine so no printout (unless we want to add printout what storage.json is reading) |
how about wrapping an exception handling around rseName, print out warning if exception occur and set |
Hallo Duong, |
I fear that setting What do you think? |
Hallo Alan, |
Yes, I think you are right - which also proves my lack of understanding of the SITECONF schema. Taking as example the file pointed out by Duong above:
how do I know which one is the "local" stage out? Is it the first element in this Back to the local vs remote stage out. If the Florida storage.json is broken (not JSON compliant), should we: The expected behavior will help us understanding how to resolve this in WMAgent runtime. |
I am not sure whether it is important to distinguish between local and fall back stage out down stream (in WMAgen runtime?). Even though in my code refactor I still keep local and fallback to reuse as much old codes as possible and maintain comparability elsewhere, the roles of local and fall back are not that important for stage out because in the end they will be tried one by one starting from the first one. There are cases such as subsites where local sites do not have own storage so all stage out are fallback/remote. Therefore, unless the difference between local and fallback/remote is important down stream somewhere, I would suggest that we check the validity of storage.json of all stage out in However, if we always expect that the first stage out (which is local in most cases) must succeed, we should fail the jobs if its storage.json is broken. Could you let me know which option sounds better for you?
|
I read Alan post more careful, so if you think that local and fall back roles are important to resolve issues at WMAgent runtime, I can try option 1. in my post above (Fail jobs if the first stage out (local) storage.json is broken). Do we want to proceed in this direction? |
Hallo Alan,
|
The framework job report crashes in case a site has a fallback reference to a site
with a "broken" SITECONF/storage.json.
I assume FJR accesses the fallback's site storage.json for fallback stage-out information.
In the HammerCloud example attached, input data were read locally and there was no
fallback/no stage out.
It seems to be python, so would be good to put the read/JSON decoding into a try and
ignore errors if not needed (or skip the read/decode altogether if not needed). Given
that SITECONF is on CVMFS, i would catch-and-continue on any read/decoding errors
in general and only error in case information is missing/needed.
Impact of the bug
All jobs of sites using a storage.json in case it's bad.
Describe the bug
see above
How to reproduce it
corrupt JSON syntax
Expected behavior
see above
Additional context and error message
https://cmsweb.cern.ch/scheddmon/0197/sciaba/250208_103551:sciaba_crab_HC-205-T2_US_Florida-110860-20250208113502/job_out.105.1.txt
Sanitize FJR
Job Exit Code from FrameworkJobReport.xml: 0
CONDITION FOR REPORTING READ BRANCHES WAS FALSE
==== Job Exit Code from FrameworkJobReport.xml and Application exit code: 0 ====
==== Checksum computation STARTING at Sat Feb 8 20:15:29 2025 UTC ====
==== Checksum FINISHED at Sat Feb 8 20:15:30 2025 UTC ====
== FileName: b.root - FileAdler32: 3c6aa3b7- FileSize: 97.70085906982422.3f MBytes
== Adding PSet Hash for filename: b.root
==== PSet Hash computation STARTING at Sat Feb 8 20:15:30 2025 UTC ====
==== PSet Hash computation FINISHED at Sat Feb 8 20:15:31 2025 UTC ====
== edmProvDump pset hash d3610cbfc09efdad6393a7de796d6d85
WARNING: Unable to parse WMCore's jobReport.json; FJR will not be useful.
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 240, in handleException
report = json.load(fh)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ERROR: Exceptional exit at Sat Feb 8 20:15:31 2025 UTC 50115: Exception while handling the job report.
ERROR: Traceback follows:
Traceback (most recent call last):
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 238, in rseName
jsElements = json.load(jsonFile)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 15 (char 1370)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 852, in
slCfg = SiteLocalConfig.loadSiteLocalConfig()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 53, in loadSiteLocalConfig
config = SiteLocalConfig(actualPath)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 100, in init
self.read()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 143, in read
nodeResult = nodeReader(node)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 208, in nodeReader
processor.send((report, node))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 224, in processNode
target.send((report, child))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 248, in processSite
targets['stage-out'].send((report, subnode))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 310, in processStageOut
localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 242, in rseName
raise RuntimeError(msg)
RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /cvmfs/cms.cern.ch/SITECONF/T2_US_MIT/storage.json
Expecting value: line 42 column 15 (char 1370)
ERROR: Failed to record execution site name in the FJR from the site-local-config.xml
Traceback (most recent call last):
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 238, in rseName
jsElements = json.load(jsonFile)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 42 column 15 (char 1370)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 274, in handleException
sLCfg = SiteLocalConfig.loadSiteLocalConfig()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 53, in loadSiteLocalConfig
config = SiteLocalConfig(actualPath)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 100, in init
self.read()
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 143, in read
nodeResult = nodeReader(node)
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 208, in nodeReader
processor.send((report, node))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 224, in processNode
target.send((report, child))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 248, in processSite
targets['stage-out'].send((report, subnode))
File "/srv/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 310, in processStageOut
localReport['phedex-node'] = rseName(report["siteName"], subSiteName, aStorageSite, aVolume)
File "/srv/WMCore.zip/WMCore/Storage/RucioFileCatalog.py", line 242, in rseName
raise RuntimeError(msg)
RuntimeError: RucioFileCatalog.py:rseName() Error reading storage.json: /cvmfs/cms.cern.ch/SITECONF/T2_US_MIT/storage.json
Expecting value: line 42 column 15 (char 1370)
== The job had an exit code of 195
======== CMSRunAnalysis.py FINISHING at Sat Feb 8 20:15:31 GMT 2025 ========
The text was updated successfully, but these errors were encountered: