flowchart TD
subgraph ARROW
subgraph ARROW-CPP
A(arrow array)
B(arrow table)
A --> B
end
subgraph ARROW-PYTHON
E(arrow array)
F(arrow table)
E --> F
end
ARROW-CPP --> ARROW-PYTHON
end
subgraph DATAFRAME
subgraph DATAFRAME-CPP
C(dataframe)
D(distributed dataframe)
C --> D
end
subgraph DATAFRAME-PYTHON
G(dataframe)
H(distributed dataframe)
G --> H
end
DATAFRAME-CPP --> DATAFRAME-PYTHON
end
I(database)
J(sql)
ARROW --> DATAFRAME
ARROW --> I
DATAFRAME --> J
I --> J
standalone | distributed | |
---|---|---|
numpy | x | |
cupy | x | |
vaex | x | |
cudf | x | |
pandas | x | |
modin | x | x |
dask | x | x |
mars | x | x |
xorbits | x | x |
- code
import pandas as pd
print("=========== internal ===========")
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': ["5", "6"]}
df = pd.DataFrame(data=d)
print(df)
print()
mgr = df._mgr
for block in mgr.blocks:
print(block.values)
print(mgr.arrays)
print(mgr.column_arrays)
print("================================")
print()
print("=========== internal ===========")
d = {'col1': [1, 3, "5"], 'col2': [2, 4, "6"]}
df = pd.DataFrame(data=d)
print(df)
print()
mgr = df._mgr
for block in mgr.blocks:
print(block.values)
print(mgr.arrays)
print(mgr.column_arrays)
print("================================")
print()
- output
=========== internal ===========
col1 col2 col3
0 1 3 5
1 2 4 6
[[1 2]
[3 4]]
[['5' '6']]
[array([[1, 2],
[3, 4]], dtype=int64), array([['5', '6']], dtype=object)]
[array([1, 2], dtype=int64), array([3, 4], dtype=int64), array(['5', '6'], dtype=object)]
================================
=========== internal ===========
col1 col2
0 1 2
1 3 4
2 5 6
[[1 3 '5']
[2 4 '6']]
[array([[1, 3, '5'],
[2, 4, '6']], dtype=object)]
[array([1, 3, '5'], dtype=object), array([2, 4, '6'], dtype=object)]
================================
- conclusion
Pandas will
- infer type for each column
- merge column into block with same type, which is
np.array
- construct blocks into mgr
Arrow only support conversion between pandas dataframe and pyarrow table.