Skip to content

[C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays #33049

@asfimport

Description

@asfimport

When working with medium-sized datasets that have very long strings, arrow fails when trying to operate on the strings. The root is the combine_chunks function.

Here is a minimally reproducible example

import numpy as np
import pyarrow as pa

# Create a large string
x = str(np.random.randint(low=0,high=1000, size=(30000,)).tolist())
t = pa.chunked_array([x]*20_000)
# Combine the chunks into large string array - fails
combined = t.combine_chunks()

I get the following error

--------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) /var/folders/x6/00594j4s2yv3swcn98bn8gxr0000gn/T/ipykernel_95780/4128956270.py in <module> ----> 1 z=t.combine_chunks()
~/.venv/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.combine_chunks() 
~/.venv/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays() ~/Documents/Github/dataquality/.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~.venv/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() 
ArrowInvalid: offset overflow while concatenating arrays 

With smaller strings or smaller arrays this works fine.

x = str(np.random.randint(low=0,high=1000, size=(10,)).tolist())
t = pa.chunked_array([x]*1000)
combined = t.combine_chunks()

The first example that fails takes a few minutes to run. If you'd like a faster example for experimentation, you can use vaex to generate the chunked array much faster. This will throw the identical error and will run about 1 second.

import vaex
import numpy as np

n = 50_000
x = str(np.random.randint(low=0,high=1000, size=(30_000,)).tolist())
df = vaex.from_arrays(
    id=list(range(n)),
    y=np.random.randint(low=0,high=1000,size=n)
)
df["text"] = vaex.vconstant(x, len(df))
# text_chunk_array is now a pyarrow.lib.ChunkedArray
text_chunk_array = df.text.values
x = text_chunk_array.combine_chunks() 

Reporter: Ben Epstein

Related issues:

Note: This issue was originally created as ARROW-17828. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions