Building a dataset with large variable size arrays results in error ArrowInvalid: Value X too large to fit in C integer type

### Describe the bug

I used map to store raw audio waveforms of variable lengths in a column of a dataset the `map` call fails with ArrowInvalid: Value X too large to fit in C integer type.

```
Traceback (most recent call last):
Traceback (most recent call last):
  File "...lib/python3.12/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/arrow_dataset.py", line 3526, in _map_single
    writer.write_batch(batch)
  File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 605, in write_batch
    arrays.append(pa.array(typed_sequence))
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 252, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
  File "...lib/python3.12/site-packages/datasets/arrow_writer.py", line 225, in __arrow_array__
    out = list_of_np_array_to_pyarrow_listarray(data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/features/features.py", line 1538, in list_of_np_array_to_pyarrow_listarray
    return list_of_pa_arrays_to_pyarrow_listarray(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...lib/python3.12/site-packages/datasets/features/features.py", line 1530, in list_of_pa_arrays_to_pyarrow_listarray
    offsets = pa.array(offsets, type=pa.int32())
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 362, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 87, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value 2148479376 too large to fit in C integer type
```



### Steps to reproduce the bug

Calling map on a dataset that returns a column with long 1d numpy arrays of variable length.

Example:
```python
# %%
import logging
import datasets
import pandas as pd
import numpy as np
# %%


def process_batch(batch, rank):
    res = []
    for _ in batch["id"]:
        res.append(np.zeros((2**30)).astype(np.uint16))
    
    return {"audio": res}


if __name__ == "__main__":
    df = pd.DataFrame(
        {
            "id": list(range(400)),
        }
    )

    ds = datasets.Dataset.from_pandas(df)
    try:
        from multiprocess import set_start_method

        set_start_method("spawn")
    except RuntimeError:
        print("Spawn method already set, continuing...")
    mapped_ds = ds.map(
        process_batch,
        batched=True,
        batch_size=2,
        with_rank=True,
        num_proc=2,
        cache_file_name="path_to_cache/tmp.arrow",
        writer_batch_size=200,
        remove_columns=ds.column_names,
        # disable_nullable=True,
    )

```

### Expected behavior

I think the offsets should be pa.int64() if needed  and not forced to be `pa.int32()` 
in https://github.com/huggingface/datasets/blob/3e13d30823f8ec498d56adbc18c6880a5463b313/src/datasets/features/features.py#L1535

### Environment info

- `datasets` version: 3.3.1
- Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
- Python version: 3.12.9
- `huggingface_hub` version: 0.29.0
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building a dataset with large variable size arrays results in error ArrowInvalid: Value X too large to fit in C integer type #7821

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Building a dataset with large variable size arrays results in error ArrowInvalid: Value X too large to fit in C integer type #7821

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions