Skip to content

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

@chenghuichen

Description

@chenghuichen

Current Implementation and Issues

Currently, paimon-python leverages Py4J to reuse Java's read/write capabilities, with data serialization between Java and Python processes handled through ArrowUtils.serializeToIpc. This implementation has several performance bottlenecks:

  1. Process Communication Overhead: The Py4J bridge requires inter-process communication (IPC) between Java and Python processes, introducing significant latency.
  2. Serialization/Deserialization Cost: Each data transfer requires serialization to Arrow IPC format and subsequent deserialization, which is computationally expensive.
  3. Memory Management Complexity: The current implementation requires careful management of memory allocators and resources across process boundaries.

Proposed Solution

We propose to refactor paimon-python to use native PyArrow implementations for read/write operations. This would:

  1. Eliminate Process Communication: Remove the need for Py4J bridge and IPC, allowing direct memory access.
  2. Reduce Serialization Overhead: Enable zero-copy data transfer between Python and native code.
  3. Simplify Memory Management: Leverage PyArrow's built-in memory management capabilities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions