-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Current Implementation and Issues
Currently, paimon-python leverages Py4J to reuse Java's read/write capabilities, with data serialization between Java and Python processes handled through ArrowUtils.serializeToIpc
. This implementation has several performance bottlenecks:
- Process Communication Overhead: The Py4J bridge requires inter-process communication (IPC) between Java and Python processes, introducing significant latency.
- Serialization/Deserialization Cost: Each data transfer requires serialization to Arrow IPC format and subsequent deserialization, which is computationally expensive.
- Memory Management Complexity: The current implementation requires careful management of memory allocators and resources across process boundaries.
Proposed Solution
We propose to refactor paimon-python to use native PyArrow implementations for read/write operations. This would:
- Eliminate Process Communication: Remove the need for Py4J bridge and IPC, allowing direct memory access.
- Reduce Serialization Overhead: Enable zero-copy data transfer between Python and native code.
- Simplify Memory Management: Leverage PyArrow's built-in memory management capabilities.
helmiazizm
Metadata
Metadata
Assignees
Labels
No labels