Skip to content

fix: atomic JSON writes for pipeline outputs#43

Draft
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-20-atomic-json
Draft

fix: atomic JSON writes for pipeline outputs#43
joshbouncesecurity wants to merge 2 commits intoknostic:masterfrom
joshbouncesecurity:fix/issue16-20-atomic-json

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

Summary

Wraps the three final pipeline output writes (results.json, enhanced_dataset.json, results_verified.json) in a new atomic_write_json() helper — temp file in the same directory + fsync + os.replace. If a crash or power loss occurs mid-write, the previous output file is preserved intact rather than being left truncated.

Upstream currently uses plain json.dump() with open(), which can corrupt multi-hour scan results on interrupt.

Same-directory temp is load-bearing on Windows: os.replace is only atomic when source and target sit on the same volume; cross-volume falls back to copy+delete and loses atomicity.

Addresses item 20 from #16 (does not close the issue).

Test plan

  • Helper round-trip preserves data (incl. ensure_ascii=False Unicode).
  • Mid-write failure leaves the existing file intact.
  • Mid-write failure does not create a target if none existed.
  • No stray .tmp- files left on failure.
  • Temp file is in the same directory as the target.
  • Existing tests pass (113 collected, 91 passed, 22 skipped — go/js parser binaries unavailable in this env).

Wrap the three final pipeline output writes (results.json,
enhanced_dataset.json, results_verified.json) in a new
atomic_write_json() helper that writes to a same-directory temp file,
fsyncs, then os.replaces it onto the target. If a crash, power loss, or
KeyboardInterrupt occurs mid-write, the previous output file is
preserved intact rather than left truncated.

Same-directory temp is required for os.replace to be atomic on Windows
(cross-volume rename falls back to copy+delete, losing atomicity).

Addresses item 20 of #16.
tempfile.mkstemp creates files with mode 0600 (owner-only). After
os.replace the target inherits those tightened bits, silently
regressing the permissions a plain open(path, "w") would have
produced under umask. Restore umask-derived 0666 & ~umask on POSIX
(no-op on Windows). Adds a POSIX-gated test pinning the behaviour.
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification

  • Run a small openant scan to completion: results.json, enhanced_dataset.json, results_verified.json all exist and parse as valid JSON.
  • Atomicity smoke test: start a scan, kill the process (Ctrl+C) during the analyze stage. Re-inspect results.json — should be either the previous good version OR non-existent. Never truncated/corrupt.
  • Permissions (POSIX): confirm output file mode matches umask (typically 0644), not 0600.
  • Same-dir temp: while a write is happening, a hidden *.tmp file appears in the SAME directory as the target (use watch ls -la <scan-dir> or lsof). Cross-device renames aren't atomic.
  • Unicode: a finding containing non-ASCII text (e.g., a Japanese string) round-trips through results_verified.json without escape sequences (ensure_ascii=False preserved).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Exercised utilities.atomic_io.atomic_write_json directly on Windows from this branch.

Test script (round-trip, non-ASCII, no-stray-tempfile, preserve-on-failure, implementation-shape):

from utilities.atomic_io import atomic_write_json
# 1. Round-trip dict round-trips identically (assert equality)
# 2. Non-ASCII payload {"jp": "<JP characters>"} written with ensure_ascii=False:
#    raw bytes contain UTF-8 sequences, no \uXXXX escapes
# 3. After successful write, target dir contains only out.json (no stray .tmp-*)
# 4. Trigger TypeError mid-write (non-serializable value): existing target byte-for-byte
#    unchanged, no stray temp file left behind
# 5. inspect.getsource confirms tempfile.mkstemp(dir=directory) + os.replace pattern

Output:

OK round-trip
OK non-ASCII preserved as UTF-8 (no \uXXXX escapes)
Files in target dir: ['out.json']
OK no stray temp files
OK target preserved on failure, no stray temp file
OK helper uses tempfile.mkstemp(dir=...) + os.replace

Outcome:

  • Round-trip preserves data ✅
  • ensure_ascii=False non-ASCII round-trips as UTF-8 (no \uXXXX escapes) ✅
  • Mid-write failure leaves existing target intact, no stray .tmp-* files ✅
  • Same-directory temp + os.replace pattern verified by reading the helper source ✅
  • Did not run a real openant scan and Ctrl+C-mid-write — would have required a paid LLM call. The unit-level checks above + the in-tree tests (tests/test_atomic_io.py per the diff) cover the same code path.
  • POSIX umask path: I'm on Windows so the os.name == "posix" chmod branch wasn't exercised; no os.chmod is invoked on Windows by design (the helper notes this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant