-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
I used map to store raw audio waveforms of variable lengths in a column of a dataset the map call fails with ArrowInvalid: Value X too large to fit in C integer type.
Traceback (most recent call last):
Traceback (most recent call last):
File "...lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "...lib/python3.12/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
^^^^^^^^^^^^^^^^^^^^^^^^^
File "...lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3526, in _map_single
writer.write_batch(batch)
File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 605, in write_batch
arrays.append(pa.array(typed_sequence))
^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 252, in pyarrow.lib.array
File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 225, in __arrow_array__
out = list_of_np_array_to_pyarrow_listarray(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...lib/python3.12/site-packages/datasets/features/features.py", line 1538, in list_of_np_array_to_pyarrow_listarray
return list_of_pa_arrays_to_pyarrow_listarray(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...lib/python3.12/site-packages/datasets/features/features.py", line 1530, in list_of_pa_arrays_to_pyarrow_listarray
offsets = pa.array(offsets, type=pa.int32())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 362, in pyarrow.lib.array
File "pyarrow/array.pxi", line 87, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2148479376 too large to fit in C integer type
Steps to reproduce the bug
Calling map on a dataset that returns a column with long 1d numpy arrays of variable length.
Example:
# %%
import logging
import datasets
import pandas as pd
import numpy as np
# %%
def process_batch(batch, rank):
res = []
for _ in batch["id"]:
res.append(np.zeros((2**30)).astype(np.uint16))
return {"audio": res}
if __name__ == "__main__":
df = pd.DataFrame(
{
"id": list(range(400)),
}
)
ds = datasets.Dataset.from_pandas(df)
try:
from multiprocess import set_start_method
set_start_method("spawn")
except RuntimeError:
print("Spawn method already set, continuing...")
mapped_ds = ds.map(
process_batch,
batched=True,
batch_size=2,
with_rank=True,
num_proc=2,
cache_file_name="path_to_cache/tmp.arrow",
writer_batch_size=200,
remove_columns=ds.column_names,
# disable_nullable=True,
)Expected behavior
I think the offsets should be pa.int64() if needed and not forced to be pa.int32()
in
datasets/src/datasets/features/features.py
Line 1535 in 3e13d30
| offsets = pa.array(offsets, type=pa.int32()) |
Environment info
datasetsversion: 3.3.1- Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
- Python version: 3.12.9
huggingface_hubversion: 0.29.0- PyArrow version: 19.0.1
- Pandas version: 2.2.3
fsspecversion: 2024.12.0
Metadata
Metadata
Assignees
Labels
No labels