Skip to content

feat(python): Link against PyArrow arrow libraries to resolve manylinux limitations #836

@yangxk1

Description

@yangxk1

Description

As discussed in issue #804, the system-level libarrow.so provided in standard manylinux environments (or installed via system package managers) is often incomplete or lacks necessary components for our use case.

A more robust solution is to link graphar.so directly against the libarrow.so bundled within the pyarrow python package. This ensures we are using a full-featured Arrow library that matches the Python environment.

However, adopting this approach introduces several significant build and runtime challenges described below.

The dependency relationship is illustrated as follows:

graph TD
    A[pyarrow bundled libarrow.so] --> B[pyarrow.whl]
    A --> C[graphar.so <br> C++ Core]
    C --> D[graphar.whl <br> Python Binding]
    B -.-> D
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px
Loading

Key Challenges

1. ABI Compatibility (The "Segfault" Risk)

C++ ABI (Application Binary Interface) is not guaranteed to be stable across different major versions of Apache Arrow.

  • Risk: If graphar.so is built against the libarrow.so from pyarrow v14.0.0, but the user updates to pyarrow v15.0.0 at runtime, changes in class memory layouts or function signatures could cause immediate Segmentation Faults.
  • Difficulty: We need to determine a strategy to manage version constraints effectively, ensuring the build-time Arrow version is ABI-compatible with the runtime Arrow version.

2. Runtime Linkage (RPATH Resolution)

Unlike system libraries located in /usr/lib, the target libarrow.so resides deep within the python site-packages/pyarrow directory.

  • Challenge: Standard linkers will not find this library by default. graphar.so must be configured (likely via RPATH) to dynamically locate libarrow.so relative to its own location at runtime, without forcing users to manually manipulate LD_LIBRARY_PATH.

3. The "Two Arrows" Problem (ODR Violation)

If this linking is not handled correctly (e.g., if GraphAr accidentally links to a static Arrow or a different system Arrow), we risk having two different copies of Arrow code in the process memory.

  • Consequence: This would violate the One Definition Rule (ODR). Passing objects (like pyarrow.Table) between GraphAr and PyArrow would lead to undefined behavior, data corruption, or crashes.

Objective

We need to design a build strategy that successfully links against the pyarrow-bundled libraries while solving the RPATH and ABI compatibility issues.

Component(s)

Python, Developer Tools

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions