fix: atomic JSON writes for pipeline outputs#43
Draft
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
Draft
fix: atomic JSON writes for pipeline outputs#43joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
Conversation
Wrap the three final pipeline output writes (results.json, enhanced_dataset.json, results_verified.json) in a new atomic_write_json() helper that writes to a same-directory temp file, fsyncs, then os.replaces it onto the target. If a crash, power loss, or KeyboardInterrupt occurs mid-write, the previous output file is preserved intact rather than left truncated. Same-directory temp is required for os.replace to be atomic on Windows (cross-volume rename falls back to copy+delete, losing atomicity). Addresses item 20 of #16.
tempfile.mkstemp creates files with mode 0600 (owner-only). After os.replace the target inherits those tightened bits, silently regressing the permissions a plain open(path, "w") would have produced under umask. Restore umask-derived 0666 & ~umask on POSIX (no-op on Windows). Adds a POSIX-gated test pinning the behaviour.
Contributor
Author
Manual verification
|
Contributor
Author
Local test resultsExercised Test script (round-trip, non-ASCII, no-stray-tempfile, preserve-on-failure, implementation-shape): from utilities.atomic_io import atomic_write_json
# 1. Round-trip dict round-trips identically (assert equality)
# 2. Non-ASCII payload {"jp": "<JP characters>"} written with ensure_ascii=False:
# raw bytes contain UTF-8 sequences, no \uXXXX escapes
# 3. After successful write, target dir contains only out.json (no stray .tmp-*)
# 4. Trigger TypeError mid-write (non-serializable value): existing target byte-for-byte
# unchanged, no stray temp file left behind
# 5. inspect.getsource confirms tempfile.mkstemp(dir=directory) + os.replace patternOutput: Outcome:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wraps the three final pipeline output writes (
results.json,enhanced_dataset.json,results_verified.json) in a newatomic_write_json()helper — temp file in the same directory + fsync +os.replace. If a crash or power loss occurs mid-write, the previous output file is preserved intact rather than being left truncated.Upstream currently uses plain
json.dump()withopen(), which can corrupt multi-hour scan results on interrupt.Same-directory temp is load-bearing on Windows:
os.replaceis only atomic when source and target sit on the same volume; cross-volume falls back to copy+delete and loses atomicity.Addresses item 20 from #16 (does not close the issue).
Test plan
ensure_ascii=FalseUnicode)..tmp-files left on failure.