Skip to content

Building a dataset with large variable size arrays results in error ArrowInvalid: Value X too large to fit in C integer type #7821

@kkoutini

Description

@kkoutini

Describe the bug

I used map to store raw audio waveforms of variable lengths in a column of a dataset the map call fails with ArrowInvalid: Value X too large to fit in C integer type.

Traceback (most recent call last):
Traceback (most recent call last):
  File "...lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3526, in _map_single
    writer.write_batch(batch)
  File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 605, in write_batch
    arrays.append(pa.array(typed_sequence))
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 252, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
  File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 225, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/features/features.py", line 1538, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/features/features.py", line 1530, in list_of_pa_arrays_to_pyarrow_listarray
    offsets = pa.array(offsets, type=pa.int32())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 362, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 87, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2148479376 too large to fit in C integer type

Steps to reproduce the bug

Calling map on a dataset that returns a column with long 1d numpy arrays of variable length.

Example:

# %%
import logging
import datasets
import pandas as pd
import numpy as np
# %%


def process_batch(batch, rank):
    res = []
    for _ in batch["id"]:
        res.append(np.zeros((2**30)).astype(np.uint16))
    
    return {"audio": res}


if __name__ == "__main__":
    df = pd.DataFrame(
        {
            "id": list(range(400)),
        }
    )

    ds = datasets.Dataset.from_pandas(df)
    try:
        from multiprocess import set_start_method

        set_start_method("spawn")
    except RuntimeError:
        print("Spawn method already set, continuing...")
    mapped_ds = ds.map(
        process_batch,
        batched=True,
        batch_size=2,
        with_rank=True,
        num_proc=2,
        cache_file_name="path_to_cache/tmp.arrow",
        writer_batch_size=200,
        remove_columns=ds.column_names,
        # disable_nullable=True,
    )

Expected behavior

I think the offsets should be pa.int64() if needed and not forced to be pa.int32()
in

offsets = pa.array(offsets, type=pa.int32())

Environment info

  • datasets version: 3.3.1
  • Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
  • Python version: 3.12.9
  • huggingface_hub version: 0.29.0
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions