-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety
Repo: dlio_benchmark
Date: March 2026
Severity: Medium–High (performance regression for all NPZ filesystem/S3 readers; correctness concern for O_DIRECT; security concern in O_DIRECT parser)
Affects: NPZ and NPY readers across filesystem, S3-simple, and O_DIRECT paths
Background
A review of all NPZ/NPY reader implementations against an expected behavior table
(covering CRC verification, member materialization, allocation count, and I/O API)
revealed four distinct issues. The S3-iterable path (NPZReaderS3Iterable /
NPYReaderS3Iterable via _S3IterableMixin) is unaffected — it never calls
np.load() and never decodes numpy.
Related Bug
This is really just bug / issue #223 with more details and proposed solution
Issue 1 — np.load() CRC-32 Cannot Be Disabled; No Bypass Exists
Files affected
dlio_benchmark/reader/npz_reader.py(NPZReader.open)dlio_benchmark/reader/npz_reader_s3.py(NPZReaderS3.open)
Description
Both NPZReader and NPZReaderS3 call np.load() to read NPZ files:
# npz_reader.py
return np.load(filename, allow_pickle=True)['x']
# npz_reader_s3.py
return np.load(io.BytesIO(data), allow_pickle=True)['x']np.load() opens NPZ files via Python's zipfile.ZipExtFile. That class
always performs CRC-32 verification on every read() call — there is no
verify_crc=False parameter in np.load() or in zipfile.
Confirmed from zipfile.ZipExtFile source:
def _update_crc(self, newdata):
if self._expected_crc is None:
return
self._running_crc = crc32(newdata, self._running_crc)
if self._eof and self._running_crc != self._expected_crc:
raise BadZipFile("Bad CRC-32 for file %r" % self.name)For a storage benchmark, CRC verification adds pure CPU overhead on every file
read with zero benefit: the files are synthetic, generated by the benchmark itself,
and the benchmark discards the decoded data immediately after reading
(DLIO yields self._args.resized_image, not the decoded bytes).
Expected behavior
An optimized buffered reader that manually parses the ZIP local-file header and
decompresses the member data without performing CRC verification — as the O_DIRECT
reader (npz_reader_odirect.py) already does via parse_npz() + parse_npy().
Workaround
None for the filesystem and S3-simple paths. The S3-iterable readers
(NPZReaderS3Iterable) already avoid this by not decoding numpy at all.
Issue 2 — npz_reader_odirect.py Decodes ALL Members, Not One
File affected
dlio_benchmark/reader/npz_reader_odirect.py(NPZReaderODIRECT.parse_npz)
Description
parse_npz() iterates the entire ZIP local-file stream, decoding and storing
every member it encounters before returning the requested one:
def parse_npz(self, mem_view):
files = {}
pos = 0
while pos < len(mem_view):
local_header_signature = mem_view[pos:pos+4].tobytes()
if local_header_signature != b'\x50\x4b\x03\x04':
break
# ... parse compressed_size, uncompressed_size, filename ...
compressed_data = mem_view[pos:pos+compressed_size]
pos += compressed_size
files[filename] = self.parse_npy(uncompressed_data) # ← decodes ALL
return files # caller picks ["x"] and discards the restFor DLIO-generated NPZ files that contain exactly one member (x), this is
equivalent to reading one member. However:
- The code is misleadingly documented — the table accompanying this issue
describes O_DIRECT as reading "exactly one (target)" member, which is only
accidentally true. - Any NPZ file with multiple members will have all of them decoded and allocated
in memory simultaneously, then discarded — wasted CPU and allocation. - The loop should break early once the target member is found.
Expected behavior
parse_npz() should accept an optional only_key parameter (defaulting to the
value of DLIO_NPZ_KEY env var, then "x"). When set, the loop breaks
immediately after the target member is parsed:
def parse_npz(self, mem_view, only_key=None):
target = only_key or os.environ.get("DLIO_NPZ_KEY", "x")
files = {}
pos = 0
while pos < len(mem_view):
...
files[filename] = self.parse_npy(uncompressed_data)
if filename == target:
break # ← exit early
return filesAdditional fragility
parse_npz() raises ValueError("Unexpected file in npz: {filename}") for any
ZIP entry that does not end in .npy. Standard NPZ files written by NumPy only
contain .npy entries, but this makes the parser unnecessarily brittle. Non-npy
entries should be skipped, not treated as errors.
Issue 3 — npz_reader_s3.py Allocates Three Copies of Each File
File affected
dlio_benchmark/reader/npz_reader_s3.py(NPZReaderS3.open)
Description
NPZReaderS3.open() has three sequential allocations for every file read:
def open(self, filename):
data = self.storage.get_data(filename, None) # copy 1: bytes from S3
image = io.BytesIO(data) # copy 2: BytesIO internal buffer
return np.load(image, allow_pickle=True)['x'] # copy 3: decompressed ndarrayio.BytesIO(data) copies data into a new internal byte buffer — it does not
wrap the existing buffer. So peak RSS per-file is approximately:
3 × file_size_on_wire (plus zipfile's internal decompression workspace).
For a file size of 150 MB this is 450 MB of peak allocation per file per thread.
Expected behavior
Use a zero-copy path equivalent to what npz_reader_odirect.py does:
replace io.BytesIO(data) + np.load() with bytearray(data) →
memoryview(buf) → parse_npz(mem_view)["x"]. This eliminates the BytesIO
copy and the CRC computation (Issue 1) simultaneously:
def open(self, filename):
data = self.storage.get_data(filename, None)
buf = bytearray(data) # one allocation (same size as data)
return parse_npz(memoryview(buf))["x"] # zero-copy ndarray viewThe same pattern applies to NPYReaderS3.open():
# current (2 copies):
data = self.storage.get_data(filename, None)
return np.load(io.BytesIO(data), allow_pickle=True)
# better (1 copy):
data = self.storage.get_data(filename, None)
return parse_npy(memoryview(bytearray(data)))Issue 4 — eval() on File Content in parse_npy() (Security / Correctness)
Files affected
dlio_benchmark/reader/npy_reader_odirect.py(NPYReaderODirect.parse_npy)dlio_benchmark/reader/npz_reader_odirect.py(inherits viaNPYReaderODirect)
Description
The NPY header parser uses Python's eval() to parse the NPY file header:
header_dict = eval(header.decode('latin1'))The NPY header is a Python literal string (e.g. {'descr': '<f4', 'fortran_order': False, 'shape': (1, 224, 224), }) that NumPy's own loader also evaluates. For DLIO's use case — reading files generated by the benchmark itself — the content is always safe.
However, eval() on binary file content is arbitrary code execution if the file
is ever sourced from an untrusted location. NumPy's own numpy.lib.format
module uses ast.literal_eval() for this parsing, which is safe:
import ast
header_dict = ast.literal_eval(header.decode('latin1'))ast.literal_eval() only evaluates Python literals (dicts, tuples, strings,
ints, bools) and raises ValueError on anything unsafe.
Expected behavior
Replace eval(...) with ast.literal_eval(...) in parse_npy().
Summary Table
| # | Issue | Files | Impact |
|---|---|---|---|
| 1 | np.load() always runs CRC-32 via zipfile; no bypass |
npz_reader.py, npz_reader_s3.py |
CPU overhead on every file read; no way to disable |
| 2 | parse_npz() decodes all ZIP members, not just the target |
npz_reader_odirect.py |
Wasted decode + allocation for multi-member files; misleading docs |
| 3 | NPZReaderS3.open() makes 3 copies (bytes + BytesIO + ndarray) |
npz_reader_s3.py, npy_reader_s3.py |
3× peak memory per file; BytesIO is an avoidable copy |
| 4 | parse_npy() uses eval() on file header content |
npy_reader_odirect.py |
Potential arbitrary code execution on untrusted input; use ast.literal_eval() |
Proposed Fix Path
The O_DIRECT reader already has the right approach in parse_npz() + parse_npy().
Refactor those two methods into a shared module (e.g. _npz_parser.py) and use it
in all three paths:
| Reader | Current | After fix |
|---|---|---|
NPZReader (filesystem) |
np.load(filename) — CRC on |
open(f,'rb') → bytearray(f.read()) → parse_npz(mv)["x"] — CRC off |
NPZReaderS3 (S3-simple) |
np.load(BytesIO(data)) — CRC on, 3 copies |
bytearray(data) → parse_npz(mv)["x"] — CRC off, 1 copy |
NPZReaderODIRECT (O_DIRECT) |
parse_npz(mv)["x"] — decodes all, eval() |
parse_npz(mv, only_key="x")["x"] — early exit, ast.literal_eval() |
NPZReaderS3Iterable |
no decode (byte count only) | no change needed |
The NPY readers (NPYReader, NPYReaderS3) follow the same pattern but without
the ZIP layer — only Issue 3 (BytesIO extra copy) and Issue 4 (eval()) apply.
Reproduction
No special setup needed — these are code-path issues visible from static analysis:
# Confirm CRC is always on in zipfile:
python3 -c "
import zipfile, inspect
src = inspect.getsource(zipfile.ZipExtFile)
for i, line in enumerate(src.split('\n')):
if 'crc' in line.lower() or 'BadZip' in line:
print(f'{i}: {line}')
"
# Show eval() in parse_npy:
grep -n 'eval(' dlio_benchmark/dlio_benchmark/reader/npy_reader_odirect.py
# Show BytesIO double-copy in S3 readers:
grep -n 'BytesIO' dlio_benchmark/dlio_benchmark/reader/npz_reader_s3.py \
dlio_benchmark/dlio_benchmark/reader/npy_reader_s3.py
# Show parse_npz decodes all members (no break):
grep -n 'break\|only_key\|DLIO_NPZ_KEY' dlio_benchmark/dlio_benchmark/reader/npz_reader_odirect.py