Performance Improvement: Replace Py4J-based Implementation with Native PyArrow

## Current Implementation and Issues

Currently, paimon-python leverages Py4J to reuse Java's read/write capabilities, with data serialization between Java and Python processes handled through `ArrowUtils.serializeToIpc`. This implementation has several performance bottlenecks:

1. **Process Communication Overhead**: The Py4J bridge requires inter-process communication (IPC) between Java and Python processes, introducing significant latency.
2. **Serialization/Deserialization Cost**: Each data transfer requires serialization to Arrow IPC format and subsequent deserialization, which is computationally expensive.
3. **Memory Management Complexity**: The current implementation requires careful management of memory allocators and resources across process boundaries.

## Proposed Solution

We propose to refactor paimon-python to use native PyArrow implementations for read/write operations. This would:

1. **Eliminate Process Communication**: Remove the need for Py4J bridge and IPC, allowing direct memory access.
2. **Reduce Serialization Overhead**: Enable zero-copy data transfer between Python and native code.
3. **Simplify Memory Management**: Leverage PyArrow's built-in memory management capabilities.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

Current Implementation and Issues

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

Description

Current Implementation and Issues

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions