Skip to content

Latest commit

 

History

History
158 lines (125 loc) · 3.44 KB

README.md

File metadata and controls

158 lines (125 loc) · 3.44 KB

cpp

overview

flowchart TD
  subgraph ARROW
    subgraph ARROW-CPP
      A(arrow array)
      B(arrow table)
      A --> B
    end

    subgraph ARROW-PYTHON
      E(arrow array)
      F(arrow table)
      E --> F
    end

    ARROW-CPP --> ARROW-PYTHON
  end

  subgraph DATAFRAME
    subgraph DATAFRAME-CPP
      C(dataframe)
      D(distributed dataframe)
      C --> D
    end

    subgraph DATAFRAME-PYTHON
      G(dataframe)
      H(distributed dataframe)
      G --> H
    end

    DATAFRAME-CPP --> DATAFRAME-PYTHON
  end

  I(database)
  J(sql)

  ARROW --> DATAFRAME
  ARROW --> I
  DATAFRAME --> J
  I --> J
Loading

open source

standalone distributed
numpy x
cupy x
vaex x
cudf x
pandas x
modin x x
dask x x
mars x x
xorbits x x

pandas

internal structure

  • code
import pandas as pd

print("=========== internal ===========")
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': ["5", "6"]}
df = pd.DataFrame(data=d)
print(df)
print()

mgr = df._mgr
for block in mgr.blocks:
    print(block.values)
print(mgr.arrays)
print(mgr.column_arrays)
print("================================")
print()

print("=========== internal ===========")
d = {'col1': [1, 3, "5"], 'col2': [2, 4, "6"]}
df = pd.DataFrame(data=d)
print(df)
print()
mgr = df._mgr
for block in mgr.blocks:
    print(block.values)
print(mgr.arrays)
print(mgr.column_arrays)
print("================================")
print()
  • output
=========== internal ===========
   col1  col2 col3
0     1     3    5
1     2     4    6

[[1 2]
 [3 4]]
[['5' '6']]
[array([[1, 2],
       [3, 4]], dtype=int64), array([['5', '6']], dtype=object)]
[array([1, 2], dtype=int64), array([3, 4], dtype=int64), array(['5', '6'], dtype=object)]
================================

=========== internal ===========
  col1 col2
0    1    2
1    3    4
2    5    6

[[1 3 '5']
 [2 4 '6']]
[array([[1, 3, '5'],
       [2, 4, '6']], dtype=object)]
[array([1, 3, '5'], dtype=object), array([2, 4, '6'], dtype=object)]
================================
  • conclusion

Pandas will

  1. infer type for each column
  2. merge column into block with same type, which is np.array
  3. construct blocks into mgr

arrow

Arrow only support conversion between pandas dataframe and pyarrow table.

open source

reference