NPZ CRC Checksum overhead for File I/O - Expansion of issue #223

# Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety

**Repo:** `dlio_benchmark`  
**Date:** March 2026  
**Severity:** Medium–High (performance regression for all NPZ filesystem/S3 readers; correctness concern for O_DIRECT; security concern in O_DIRECT parser)  
**Affects:** NPZ and NPY readers across filesystem, S3-simple, and O_DIRECT paths

---

## Background

A review of all NPZ/NPY reader implementations against an expected behavior table
(covering CRC verification, member materialization, allocation count, and I/O API)
revealed four distinct issues.  The S3-iterable path (`NPZReaderS3Iterable` /
`NPYReaderS3Iterable` via `_S3IterableMixin`) is unaffected — it never calls
`np.load()` and never decodes numpy.

## Related Bug
This is really just bug / issue #223  with more details and proposed solution
---

## Issue 1 — `np.load()` CRC-32 Cannot Be Disabled; No Bypass Exists

### Files affected
- `dlio_benchmark/reader/npz_reader.py` (`NPZReader.open`)
- `dlio_benchmark/reader/npz_reader_s3.py` (`NPZReaderS3.open`)

### Description

Both `NPZReader` and `NPZReaderS3` call `np.load()` to read NPZ files:

```python
# npz_reader.py
return np.load(filename, allow_pickle=True)['x']

# npz_reader_s3.py
return np.load(io.BytesIO(data), allow_pickle=True)['x']
```

`np.load()` opens NPZ files via Python's `zipfile.ZipExtFile`.  That class
**always** performs CRC-32 verification on every `read()` call — there is no
`verify_crc=False` parameter in `np.load()` or in `zipfile`.

Confirmed from `zipfile.ZipExtFile` source:

```python
def _update_crc(self, newdata):
    if self._expected_crc is None:
        return
    self._running_crc = crc32(newdata, self._running_crc)
    if self._eof and self._running_crc != self._expected_crc:
        raise BadZipFile("Bad CRC-32 for file %r" % self.name)
```

For a storage benchmark, CRC verification adds pure CPU overhead on every file
read with zero benefit: the files are synthetic, generated by the benchmark itself,
and the benchmark discards the decoded data immediately after reading
(DLIO yields `self._args.resized_image`, not the decoded bytes).

### Expected behavior
An optimized buffered reader that manually parses the ZIP local-file header and
decompresses the member data without performing CRC verification — as the O_DIRECT
reader (`npz_reader_odirect.py`) already does via `parse_npz()` + `parse_npy()`.

### Workaround
None for the filesystem and S3-simple paths.  The S3-iterable readers
(`NPZReaderS3Iterable`) already avoid this by not decoding numpy at all.

---

## Issue 2 — `npz_reader_odirect.py` Decodes ALL Members, Not One

### File affected
- `dlio_benchmark/reader/npz_reader_odirect.py` (`NPZReaderODIRECT.parse_npz`)

### Description

`parse_npz()` iterates the entire ZIP local-file stream, decoding and storing
**every** member it encounters before returning the requested one:

```python
def parse_npz(self, mem_view):
    files = {}
    pos = 0
    while pos < len(mem_view):
        local_header_signature = mem_view[pos:pos+4].tobytes()
        if local_header_signature != b'\x50\x4b\x03\x04':
            break
        # ... parse compressed_size, uncompressed_size, filename ...
        compressed_data = mem_view[pos:pos+compressed_size]
        pos += compressed_size
        files[filename] = self.parse_npy(uncompressed_data)   # ← decodes ALL
    return files   # caller picks ["x"] and discards the rest
```

For DLIO-generated NPZ files that contain exactly one member (`x`), this is
equivalent to reading one member.  However:

1. The code is misleadingly documented — the table accompanying this issue
   describes O_DIRECT as reading "exactly one (target)" member, which is only
   accidentally true.
2. Any NPZ file with multiple members will have all of them decoded and allocated
   in memory simultaneously, then discarded — wasted CPU and allocation.
3. The loop should break early once the target member is found.

### Expected behavior
`parse_npz()` should accept an optional `only_key` parameter (defaulting to the
value of `DLIO_NPZ_KEY` env var, then `"x"`).  When set, the loop breaks
immediately after the target member is parsed:

```python
def parse_npz(self, mem_view, only_key=None):
    target = only_key or os.environ.get("DLIO_NPZ_KEY", "x")
    files = {}
    pos = 0
    while pos < len(mem_view):
        ...
        files[filename] = self.parse_npy(uncompressed_data)
        if filename == target:
            break   # ← exit early
    return files
```

### Additional fragility
`parse_npz()` raises `ValueError("Unexpected file in npz: {filename}")` for any
ZIP entry that does not end in `.npy`.  Standard NPZ files written by NumPy only
contain `.npy` entries, but this makes the parser unnecessarily brittle.  Non-npy
entries should be skipped, not treated as errors.

---

## Issue 3 — `npz_reader_s3.py` Allocates Three Copies of Each File

### File affected
- `dlio_benchmark/reader/npz_reader_s3.py` (`NPZReaderS3.open`)

### Description

`NPZReaderS3.open()` has three sequential allocations for every file read:

```python
def open(self, filename):
    data = self.storage.get_data(filename, None)  # copy 1: bytes from S3
    image = io.BytesIO(data)                       # copy 2: BytesIO internal buffer
    return np.load(image, allow_pickle=True)['x']  # copy 3: decompressed ndarray
```

`io.BytesIO(data)` copies `data` into a new internal byte buffer — it does not
wrap the existing buffer.  So peak RSS per-file is approximately:
`3 × file_size_on_wire` (plus zipfile's internal decompression workspace).

For a file size of 150 MB this is 450 MB of peak allocation per file per thread.

### Expected behavior
Use a zero-copy path equivalent to what `npz_reader_odirect.py` does:
replace `io.BytesIO(data)` + `np.load()` with `bytearray(data)` →
`memoryview(buf)` → `parse_npz(mem_view)["x"]`.  This eliminates the BytesIO
copy and the CRC computation (Issue 1) simultaneously:

```python
def open(self, filename):
    data = self.storage.get_data(filename, None)
    buf = bytearray(data)                          # one allocation (same size as data)
    return parse_npz(memoryview(buf))["x"]         # zero-copy ndarray view
```

The same pattern applies to `NPYReaderS3.open()`:
```python
# current (2 copies):
data = self.storage.get_data(filename, None)
return np.load(io.BytesIO(data), allow_pickle=True)

# better (1 copy):
data = self.storage.get_data(filename, None)
return parse_npy(memoryview(bytearray(data)))
```

---

## Issue 4 — `eval()` on File Content in `parse_npy()` (Security / Correctness)

### Files affected
- `dlio_benchmark/reader/npy_reader_odirect.py` (`NPYReaderODirect.parse_npy`)
- `dlio_benchmark/reader/npz_reader_odirect.py` (inherits via `NPYReaderODirect`)

### Description

The NPY header parser uses Python's `eval()` to parse the NPY file header:

```python
header_dict = eval(header.decode('latin1'))
```

The NPY header is a Python literal string (e.g. `{'descr': '<f4', 'fortran_order': False, 'shape': (1, 224, 224), }`) that NumPy's own loader also evaluates.  For DLIO's use case — reading files generated by the benchmark itself — the content is always safe.

However, `eval()` on binary file content is arbitrary code execution if the file
is ever sourced from an untrusted location.  NumPy's own `numpy.lib.format`
module uses `ast.literal_eval()` for this parsing, which is safe:

```python
import ast
header_dict = ast.literal_eval(header.decode('latin1'))
```

`ast.literal_eval()` only evaluates Python literals (dicts, tuples, strings,
ints, bools) and raises `ValueError` on anything unsafe.

### Expected behavior
Replace `eval(...)` with `ast.literal_eval(...)` in `parse_npy()`.

---

## Summary Table

| # | Issue | Files | Impact |
|---|-------|-------|--------|
| 1 | `np.load()` always runs CRC-32 via `zipfile`; no bypass | `npz_reader.py`, `npz_reader_s3.py` | CPU overhead on every file read; no way to disable |
| 2 | `parse_npz()` decodes all ZIP members, not just the target | `npz_reader_odirect.py` | Wasted decode + allocation for multi-member files; misleading docs |
| 3 | `NPZReaderS3.open()` makes 3 copies (bytes + BytesIO + ndarray) | `npz_reader_s3.py`, `npy_reader_s3.py` | 3× peak memory per file; BytesIO is an avoidable copy |
| 4 | `parse_npy()` uses `eval()` on file header content | `npy_reader_odirect.py` | Potential arbitrary code execution on untrusted input; use `ast.literal_eval()` |

---

## Proposed Fix Path

The O_DIRECT reader already has the right approach in `parse_npz()` + `parse_npy()`.
Refactor those two methods into a shared module (e.g. `_npz_parser.py`) and use it
in all three paths:

| Reader | Current | After fix |
|--------|---------|-----------|
| `NPZReader` (filesystem) | `np.load(filename)` — CRC on | `open(f,'rb')` → `bytearray(f.read())` → `parse_npz(mv)["x"]` — CRC off |
| `NPZReaderS3` (S3-simple) | `np.load(BytesIO(data))` — CRC on, 3 copies | `bytearray(data)` → `parse_npz(mv)["x"]` — CRC off, 1 copy |
| `NPZReaderODIRECT` (O_DIRECT) | `parse_npz(mv)["x"]` — decodes all, `eval()` | `parse_npz(mv, only_key="x")["x"]` — early exit, `ast.literal_eval()` |
| `NPZReaderS3Iterable` | no decode (byte count only) | no change needed |

The NPY readers (`NPYReader`, `NPYReaderS3`) follow the same pattern but without
the ZIP layer — only Issue 3 (`BytesIO` extra copy) and Issue 4 (`eval()`) apply.

---

## Reproduction

No special setup needed — these are code-path issues visible from static analysis:

```bash
# Confirm CRC is always on in zipfile:
python3 -c "
import zipfile, inspect
src = inspect.getsource(zipfile.ZipExtFile)
for i, line in enumerate(src.split('\n')):
    if 'crc' in line.lower() or 'BadZip' in line:
        print(f'{i}: {line}')
"

# Show eval() in parse_npy:
grep -n 'eval(' dlio_benchmark/dlio_benchmark/reader/npy_reader_odirect.py

# Show BytesIO double-copy in S3 readers:
grep -n 'BytesIO' dlio_benchmark/dlio_benchmark/reader/npz_reader_s3.py \
                  dlio_benchmark/dlio_benchmark/reader/npy_reader_s3.py

# Show parse_npz decodes all members (no break):
grep -n 'break\|only_key\|DLIO_NPZ_KEY' dlio_benchmark/dlio_benchmark/reader/npz_reader_odirect.py
```

#	Issue	Files	Impact
1	`np.load()` always runs CRC-32 via `zipfile`; no bypass	`npz_reader.py`, `npz_reader_s3.py`	CPU overhead on every file read; no way to disable
2	`parse_npz()` decodes all ZIP members, not just the target	`npz_reader_odirect.py`	Wasted decode + allocation for multi-member files; misleading docs
3	`NPZReaderS3.open()` makes 3 copies (bytes + BytesIO + ndarray)	`npz_reader_s3.py`, `npy_reader_s3.py`	3× peak memory per file; BytesIO is an avoidable copy
4	`parse_npy()` uses `eval()` on file header content	`npy_reader_odirect.py`	Potential arbitrary code execution on untrusted input; use `ast.literal_eval()`

Reader	Current	After fix
`NPZReader` (filesystem)	`np.load(filename)` — CRC on	`open(f,'rb')` → `bytearray(f.read())` → `parse_npz(mv)["x"]` — CRC off
`NPZReaderS3` (S3-simple)	`np.load(BytesIO(data))` — CRC on, 3 copies	`bytearray(data)` → `parse_npz(mv)["x"]` — CRC off, 1 copy
`NPZReaderODIRECT` (O_DIRECT)	`parse_npz(mv)["x"]` — decodes all, `eval()`	`parse_npz(mv, only_key="x")["x"]` — early exit, `ast.literal_eval()`
`NPZReaderS3Iterable`	no decode (byte count only)	no change needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPZ CRC Checksum overhead for File I/O - Expansion of issue #223 #286

Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety

Background

Related Bug

This is really just bug / issue #223 with more details and proposed solution

Issue 1 — `np.load()` CRC-32 Cannot Be Disabled; No Bypass Exists

Files affected

Description

Expected behavior

Workaround

Issue 2 — `npz_reader_odirect.py` Decodes ALL Members, Not One

File affected

Description

Expected behavior

Additional fragility

Issue 3 — `npz_reader_s3.py` Allocates Three Copies of Each File

File affected

Description

Expected behavior

Issue 4 — `eval()` on File Content in `parse_npy()` (Security / Correctness)

Files affected

Description

Expected behavior

Summary Table

Proposed Fix Path

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NPZ CRC Checksum overhead for File I/O - Expansion of issue #223 #286

Description

Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety

Background

Related Bug

This is really just bug / issue #223 with more details and proposed solution

Issue 1 — np.load() CRC-32 Cannot Be Disabled; No Bypass Exists

Files affected

Description

Expected behavior

Workaround

Issue 2 — npz_reader_odirect.py Decodes ALL Members, Not One

File affected

Description

Expected behavior

Additional fragility

Issue 3 — npz_reader_s3.py Allocates Three Copies of Each File

File affected

Description

Expected behavior

Issue 4 — eval() on File Content in parse_npy() (Security / Correctness)

Files affected

Description

Expected behavior

Summary Table

Proposed Fix Path

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 1 — `np.load()` CRC-32 Cannot Be Disabled; No Bypass Exists

Issue 2 — `npz_reader_odirect.py` Decodes ALL Members, Not One

Issue 3 — `npz_reader_s3.py` Allocates Three Copies of Each File

Issue 4 — `eval()` on File Content in `parse_npy()` (Security / Correctness)