Skip to content

NPZ CRC Checksum overhead for File I/O - Expansion of issue #223 #286

@russfellows

Description

@russfellows

Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety

Repo: dlio_benchmark
Date: March 2026
Severity: Medium–High (performance regression for all NPZ filesystem/S3 readers; correctness concern for O_DIRECT; security concern in O_DIRECT parser)
Affects: NPZ and NPY readers across filesystem, S3-simple, and O_DIRECT paths


Background

A review of all NPZ/NPY reader implementations against an expected behavior table
(covering CRC verification, member materialization, allocation count, and I/O API)
revealed four distinct issues. The S3-iterable path (NPZReaderS3Iterable /
NPYReaderS3Iterable via _S3IterableMixin) is unaffected — it never calls
np.load() and never decodes numpy.

Related Bug

This is really just bug / issue #223 with more details and proposed solution

Issue 1 — np.load() CRC-32 Cannot Be Disabled; No Bypass Exists

Files affected

  • dlio_benchmark/reader/npz_reader.py (NPZReader.open)
  • dlio_benchmark/reader/npz_reader_s3.py (NPZReaderS3.open)

Description

Both NPZReader and NPZReaderS3 call np.load() to read NPZ files:

# npz_reader.py
return np.load(filename, allow_pickle=True)['x']

# npz_reader_s3.py
return np.load(io.BytesIO(data), allow_pickle=True)['x']

np.load() opens NPZ files via Python's zipfile.ZipExtFile. That class
always performs CRC-32 verification on every read() call — there is no
verify_crc=False parameter in np.load() or in zipfile.

Confirmed from zipfile.ZipExtFile source:

def _update_crc(self, newdata):
    if self._expected_crc is None:
        return
    self._running_crc = crc32(newdata, self._running_crc)
    if self._eof and self._running_crc != self._expected_crc:
        raise BadZipFile("Bad CRC-32 for file %r" % self.name)

For a storage benchmark, CRC verification adds pure CPU overhead on every file
read with zero benefit: the files are synthetic, generated by the benchmark itself,
and the benchmark discards the decoded data immediately after reading
(DLIO yields self._args.resized_image, not the decoded bytes).

Expected behavior

An optimized buffered reader that manually parses the ZIP local-file header and
decompresses the member data without performing CRC verification — as the O_DIRECT
reader (npz_reader_odirect.py) already does via parse_npz() + parse_npy().

Workaround

None for the filesystem and S3-simple paths. The S3-iterable readers
(NPZReaderS3Iterable) already avoid this by not decoding numpy at all.


Issue 2 — npz_reader_odirect.py Decodes ALL Members, Not One

File affected

  • dlio_benchmark/reader/npz_reader_odirect.py (NPZReaderODIRECT.parse_npz)

Description

parse_npz() iterates the entire ZIP local-file stream, decoding and storing
every member it encounters before returning the requested one:

def parse_npz(self, mem_view):
    files = {}
    pos = 0
    while pos < len(mem_view):
        local_header_signature = mem_view[pos:pos+4].tobytes()
        if local_header_signature != b'\x50\x4b\x03\x04':
            break
        # ... parse compressed_size, uncompressed_size, filename ...
        compressed_data = mem_view[pos:pos+compressed_size]
        pos += compressed_size
        files[filename] = self.parse_npy(uncompressed_data)   # ← decodes ALL
    return files   # caller picks ["x"] and discards the rest

For DLIO-generated NPZ files that contain exactly one member (x), this is
equivalent to reading one member. However:

  1. The code is misleadingly documented — the table accompanying this issue
    describes O_DIRECT as reading "exactly one (target)" member, which is only
    accidentally true.
  2. Any NPZ file with multiple members will have all of them decoded and allocated
    in memory simultaneously, then discarded — wasted CPU and allocation.
  3. The loop should break early once the target member is found.

Expected behavior

parse_npz() should accept an optional only_key parameter (defaulting to the
value of DLIO_NPZ_KEY env var, then "x"). When set, the loop breaks
immediately after the target member is parsed:

def parse_npz(self, mem_view, only_key=None):
    target = only_key or os.environ.get("DLIO_NPZ_KEY", "x")
    files = {}
    pos = 0
    while pos < len(mem_view):
        ...
        files[filename] = self.parse_npy(uncompressed_data)
        if filename == target:
            break   # ← exit early
    return files

Additional fragility

parse_npz() raises ValueError("Unexpected file in npz: {filename}") for any
ZIP entry that does not end in .npy. Standard NPZ files written by NumPy only
contain .npy entries, but this makes the parser unnecessarily brittle. Non-npy
entries should be skipped, not treated as errors.


Issue 3 — npz_reader_s3.py Allocates Three Copies of Each File

File affected

  • dlio_benchmark/reader/npz_reader_s3.py (NPZReaderS3.open)

Description

NPZReaderS3.open() has three sequential allocations for every file read:

def open(self, filename):
    data = self.storage.get_data(filename, None)  # copy 1: bytes from S3
    image = io.BytesIO(data)                       # copy 2: BytesIO internal buffer
    return np.load(image, allow_pickle=True)['x']  # copy 3: decompressed ndarray

io.BytesIO(data) copies data into a new internal byte buffer — it does not
wrap the existing buffer. So peak RSS per-file is approximately:
3 × file_size_on_wire (plus zipfile's internal decompression workspace).

For a file size of 150 MB this is 450 MB of peak allocation per file per thread.

Expected behavior

Use a zero-copy path equivalent to what npz_reader_odirect.py does:
replace io.BytesIO(data) + np.load() with bytearray(data)
memoryview(buf)parse_npz(mem_view)["x"]. This eliminates the BytesIO
copy and the CRC computation (Issue 1) simultaneously:

def open(self, filename):
    data = self.storage.get_data(filename, None)
    buf = bytearray(data)                          # one allocation (same size as data)
    return parse_npz(memoryview(buf))["x"]         # zero-copy ndarray view

The same pattern applies to NPYReaderS3.open():

# current (2 copies):
data = self.storage.get_data(filename, None)
return np.load(io.BytesIO(data), allow_pickle=True)

# better (1 copy):
data = self.storage.get_data(filename, None)
return parse_npy(memoryview(bytearray(data)))

Issue 4 — eval() on File Content in parse_npy() (Security / Correctness)

Files affected

  • dlio_benchmark/reader/npy_reader_odirect.py (NPYReaderODirect.parse_npy)
  • dlio_benchmark/reader/npz_reader_odirect.py (inherits via NPYReaderODirect)

Description

The NPY header parser uses Python's eval() to parse the NPY file header:

header_dict = eval(header.decode('latin1'))

The NPY header is a Python literal string (e.g. {'descr': '<f4', 'fortran_order': False, 'shape': (1, 224, 224), }) that NumPy's own loader also evaluates. For DLIO's use case — reading files generated by the benchmark itself — the content is always safe.

However, eval() on binary file content is arbitrary code execution if the file
is ever sourced from an untrusted location. NumPy's own numpy.lib.format
module uses ast.literal_eval() for this parsing, which is safe:

import ast
header_dict = ast.literal_eval(header.decode('latin1'))

ast.literal_eval() only evaluates Python literals (dicts, tuples, strings,
ints, bools) and raises ValueError on anything unsafe.

Expected behavior

Replace eval(...) with ast.literal_eval(...) in parse_npy().


Summary Table

# Issue Files Impact
1 np.load() always runs CRC-32 via zipfile; no bypass npz_reader.py, npz_reader_s3.py CPU overhead on every file read; no way to disable
2 parse_npz() decodes all ZIP members, not just the target npz_reader_odirect.py Wasted decode + allocation for multi-member files; misleading docs
3 NPZReaderS3.open() makes 3 copies (bytes + BytesIO + ndarray) npz_reader_s3.py, npy_reader_s3.py 3× peak memory per file; BytesIO is an avoidable copy
4 parse_npy() uses eval() on file header content npy_reader_odirect.py Potential arbitrary code execution on untrusted input; use ast.literal_eval()

Proposed Fix Path

The O_DIRECT reader already has the right approach in parse_npz() + parse_npy().
Refactor those two methods into a shared module (e.g. _npz_parser.py) and use it
in all three paths:

Reader Current After fix
NPZReader (filesystem) np.load(filename) — CRC on open(f,'rb')bytearray(f.read())parse_npz(mv)["x"] — CRC off
NPZReaderS3 (S3-simple) np.load(BytesIO(data)) — CRC on, 3 copies bytearray(data)parse_npz(mv)["x"] — CRC off, 1 copy
NPZReaderODIRECT (O_DIRECT) parse_npz(mv)["x"] — decodes all, eval() parse_npz(mv, only_key="x")["x"] — early exit, ast.literal_eval()
NPZReaderS3Iterable no decode (byte count only) no change needed

The NPY readers (NPYReader, NPYReaderS3) follow the same pattern but without
the ZIP layer — only Issue 3 (BytesIO extra copy) and Issue 4 (eval()) apply.


Reproduction

No special setup needed — these are code-path issues visible from static analysis:

# Confirm CRC is always on in zipfile:
python3 -c "
import zipfile, inspect
src = inspect.getsource(zipfile.ZipExtFile)
for i, line in enumerate(src.split('\n')):
    if 'crc' in line.lower() or 'BadZip' in line:
        print(f'{i}: {line}')
"

# Show eval() in parse_npy:
grep -n 'eval(' dlio_benchmark/dlio_benchmark/reader/npy_reader_odirect.py

# Show BytesIO double-copy in S3 readers:
grep -n 'BytesIO' dlio_benchmark/dlio_benchmark/reader/npz_reader_s3.py \
                  dlio_benchmark/dlio_benchmark/reader/npy_reader_s3.py

# Show parse_npz decodes all members (no break):
grep -n 'break\|only_key\|DLIO_NPZ_KEY' dlio_benchmark/dlio_benchmark/reader/npz_reader_odirect.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions