[DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

asl3 · 2025-06-19T20:33:30Z

What changes were proposed in this pull request?

This PR removes pandas <> Arrow <> pandas conversion in Arrow-optimized Python UDF by directly using PyArrow.

Why are the changes needed?

Python UDF arrow serializer has a lot of overhead from converting arrow batches into pandas series and converting UDF results back to a pandas dataframe.

We can instead convert Python object directly into arrow to avoid the expensive pandas conversion.

Does this PR introduce any user-facing change?

Legacy type coercion (arrow batch eval)

# +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |SQL Type \ Python Value(Type)|None(NoneType)|        True(bool)|              1(int)|a(str)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|  1.0(float)|array('i', [1])(array)|         [1](list)|       (1,)(tuple)|bytearray(b'ABC')(bytearray)|          1(Decimal)|{'a': 1}(dict)|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |                      boolean|          None|              True|                True|     X|                   X|                            X|        True|                     X|                 X|                 X|                           X|                   X|             X|  # noqa
    # |                      tinyint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                     smallint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                          int|          None|                 1|                   1|     X|                   0|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       bigint|          None|                 1|                   1|     X|                   X|                            0|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       string|          None|            'True'|                 '1'|   'a'|        '1970-01-01'|         '1970-01-01 00:00...|       '1.0'|     "array('i', [1])"|             '[1]'|            '(1,)'|         "bytearray(b'ABC')"|                 '1'|    "{'a': 1}"|  # noqa
    # |                         date|          None|                 X|                   X|     X|datetime.date(197...|         datetime.date(197...|           X|                     X|                 X|                 X|                           X|datetime.date(197...|             X|  # noqa
    # |                    timestamp|          None|                 X|datetime.datetime...|     X|                   X|         datetime.datetime...|           X|                     X|                 X|                 X|                           X|datetime.datetime...|             X|  # noqa
    # |                        float|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       double|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       binary|          None|bytearray(b'\x00')|  bytearray(b'\x00')|     X|                   X|                            X|           X|  bytearray(b'\x01\...|bytearray(b'\x01')|bytearray(b'\x01')|           bytearray(b'ABC')|                   X|             X|  # noqa
    # |                decimal(10,0)|          None|                 X|                   X|     X|                   X|                            X|Decimal('1')|                     X|                 X|                 X|                           X|        Decimal('1')|             X|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa

New type coercion (arrow batch eval):

# +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |SQL Type \ Python Value(Type)|None(NoneType)|        True(bool)|              1(int)|a(str)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|  1.0(float)|array('i', [1])(array)|         [1](list)|       (1,)(tuple)|bytearray(b'ABC')(bytearray)|          1(Decimal)|{'a': 1}(dict)|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa
    # |                      boolean|          None|              True|                True|     X|                   X|                            X|        True|                     X|                 X|                 X|                           X|                   X|             X|  # noqa
    # |                      tinyint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                     smallint|          None|                 1|                   1|     X|                   X|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                          int|          None|                 1|                   1|     X|                   0|                            X|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       bigint|          None|                 1|                   1|     X|                   X|                            0|           1|                     X|                 X|                 X|                           X|                   1|             X|  # noqa
    # |                       string|          None|            'True'|                 '1'|   'a'|        '1970-01-01'|         '1970-01-01 00:00...|       '1.0'|     "array('i', [1])"|             '[1]'|            '(1,)'|         "bytearray(b'ABC')"|                 '1'|    "{'a': 1}"|  # noqa
    # |                         date|          None|                 X|                   X|     X|datetime.date(197...|         datetime.date(197...|           X|                     X|                 X|                 X|                           X|datetime.date(197...|             X|  # noqa
    # |                    timestamp|          None|                 X|datetime.datetime...|     X|                   X|         datetime.datetime...|           X|                     X|                 X|                 X|                           X|datetime.datetime...|             X|  # noqa
    # |                        float|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       double|          None|               1.0|                 1.0|     X|                   X|                            X|         1.0|                     X|                 X|                 X|                           X|                 1.0|             X|  # noqa
    # |                       binary|          None|bytearray(b'\x00')|  bytearray(b'\x00')|     X|                   X|                            X|           X|  bytearray(b'\x01\...|bytearray(b'\x01')|bytearray(b'\x01')|           bytearray(b'ABC')|                   X|             X|  # noqa
    # |                decimal(10,0)|          None|                 X|                   X|     X|                   X|                            X|Decimal('1')|                     X|                 X|                 X|                           X|        Decimal('1')|             X|  # noqa
    # +-----------------------------+--------------+------------------+--------------------+------+--------------------+-----------------------------+------------+----------------------+------------------+------------------+----------------------------+--------------------+--------------+  # noqa

How was this patch tested?

Added tests for both the legacy and new codepath, for arrow-batch eval

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon

From a cursory look, seems making sense

HyukjinKwon · 2025-06-23T02:53:23Z

python/pyspark/worker.py

-    arrow_return_type = to_arrow_type(
-        return_type, prefers_large_types=use_large_var_types(runner_conf)
-    )
+def wrap_arrow_array_iter_udf(f, return_type, runner_conf):	


let's get rid of those white spaces tho

zhengruifeng · 2025-06-25T07:06:17Z

python/pyspark/sql/tests/arrow/test_arrow_python_udf.py

+    def test_complex_input_types(self):
+        for pandas_conversion in [True, False]:
+            with self.subTest(pandas_conversion=pandas_conversion), self.sql_conf(
+                {"spark.sql.legacy.execution.pythonUDF.pandas.conversion.enabled": str(pandas_conversion).lower()}


It seems we can move the config setting into the setupClass method

zhengruifeng · 2025-06-25T07:07:03Z

python/pyspark/sql/tests/arrow/test_arrow_python_udf.py

+@unittest.skipIf(
+    not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message
+)
+class ArrowPythonUDFLegacyTestsMixin(BaseUDFTestsMixin):


TODO: also add parity tests in pyspark.sql.tests.connect.arrow.test_parity_arrow_python_udf

zhengruifeng · 2025-06-25T07:08:19Z

python/pyspark/sql/pandas/serializers.py

+                elif isinstance(packed, list):
+                    # multiple array UDFs in a projection
+                    arrs = [self._create_array(t[0], t[1], self._arrow_cast) for t in packed]
+                elif isinstance(packed, tuple) and len(packed) == 3:


It seems the conditions in arrow-opt UDF is more complicated.
In what case will this branch be chosen?

asl3 added 12 commits June 18, 2025 23:07

wrap_arrow_udf and serializer

9dfe326

evaltype

779fea9

nit

a3cded6

test

33f6f67

tmp

8440f9c

skip variant tests

05d87c8

arrow batch serializer

1dd0ceb

refactor

ffa562a

rename

e075c0d

refactor

93b44ca

nit

b1f8965

spacing

8ef0726

github-actions bot added SQL DOCS CORE PYTHON labels Jun 19, 2025

asl3 changed the title ~~[DRAFT][PYTHON] Improve Python UDTF arrow serializer performance~~ [DRAFT][PYTHON] Improve Python UDF arrow serializer performance Jun 19, 2025

asl3 changed the title ~~[DRAFT][PYTHON] Improve Python UDF arrow serializer performance~~ [DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance Jun 19, 2025

asl3 added 7 commits June 22, 2025 11:52

fmt

b871ab1

update test

d9570a1

scalar arrow

8f40352

spacing

fc844c2

spacing

8f9420c

comment

915f919

sql scalar arrow iter udf

fc26618

HyukjinKwon marked this pull request as draft June 23, 2025 02:52

HyukjinKwon reviewed Jun 23, 2025

View reviewed changes

asl3 added 2 commits June 22, 2025 21:07

whitespace

03cac02

restore

81f977a

asl3 added 9 commits June 22, 2025 21:12

nit

3e0d81f

nit

a384c81

skip test

d8681dc

test errors

737acf0

SPARK-34545 test

6dd239d

cleanup

3e99d5d

remove skip tests

502f201

fmt

a8ab3e1

fmt

06469b1

zhengruifeng reviewed Jun 25, 2025

View reviewed changes

asl3 added 2 commits June 25, 2025 09:09

refactor legacy/non-legacy tests

80e34ec

tmp

9525c5d

zhengruifeng requested review from xinrong-meng June 26, 2025 04:17

asl3 added 8 commits June 26, 2025 10:22

tmp

b107f0f

nits

5fbde58

fmt

6c4f3b3

spacing

a898ed3

comment

e0898c4

fmt

e54f7be

fmt

42e46db

test

039733c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

[DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

Uh oh!

asl3 commented Jun 19, 2025 •

edited

Loading

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon Jun 23, 2025

Uh oh!

zhengruifeng Jun 25, 2025

Uh oh!

zhengruifeng Jun 25, 2025

Uh oh!

zhengruifeng Jun 25, 2025

Uh oh!

Uh oh!

[DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

Are you sure you want to change the base?

[DRAFT][PYTHON] Improve Python UDF Arrow Serializer Performance #51225

Uh oh!

Conversation

asl3 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asl3 commented Jun 19, 2025 •

edited

Loading