[GLUTEN-8343][CH]Fix cast number to decimal and improve performance of it #8351

KevinyhZou · 2024-12-26T06:44:10Z

What changes were proposed in this pull request?

Fixes: #8343 and improve performance(#8351 (comment))

How was this patch tested?

test by ut

github-actions · 2024-12-26T06:44:28Z

#8343

github-actions · 2024-12-26T06:44:42Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-26T06:45:41Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-26T08:43:52Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-27T01:52:14Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-27T07:24:28Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-27T07:39:04Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-27T10:04:22Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-30T08:22:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-30T09:13:58Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2025-01-06T08:08:11Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

@@ -146,17 +156,18 @@ class FunctionCheckDecimalOverflow : public IFunction
    }

 private:
-    template <typename T, typename ToDataType>
+    template <typename FromDataType, typename ToDataType, typename ColVecType, typename T = FromDataType::FieldType>


why not remove T and add "using T = typename FromDataType::FieldType" below

taiyang-li · 2025-01-06T08:08:59Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

@@ -192,20 +203,61 @@ class FunctionCheckDecimalOverflow : public IFunction
            result_column = std::move(col_to);
    }

-    template <is_decimal FromFieldType, typename ToDataType>
+    template <typename FromDataType, typename ToDataType, typename FromFieldType = FromDataType::FieldType>


remove FromFieldType, keep template parameters simple.

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

github-actions · 2025-01-06T09:49:07Z

Run Gluten Clickhouse CI on x86

...c/test/scala/org/apache/gluten/execution/tpch/GlutenClickHouseTPCHSaltNullParquetSuite.scala

taiyang-li · 2025-01-07T01:49:54Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+                    return true;
+                }
+            }
+            else if constexpr (IsDataTypeNumber<FromDataType>)


above if and else if should be merged. Remind: use using ColVecType = ColumnVectorOrDecimal<T>;

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

taiyang-li · 2025-01-07T01:54:21Z

cpp-ch/local-engine/Parser/ExpressionParser.cpp

-                args.emplace_back(addConstColumn(actions_dag, std::make_shared<DataTypeInt32>(), substrait_type.decimal().scale()));
-                result_node = toFunctionNode(actions_dag, "checkDecimalOverflowSparkOrNull", args);
+                int decimal_precision = substrait_type.decimal().precision();
+                if (decimal_precision != 0)


if (decimal_precision)

github-actions · 2025-01-07T07:17:56Z

Run Gluten Clickhouse CI on x86

KevinyhZou · 2025-01-07T07:21:40Z

端到端性能测试

测试sql： select count(1) from test_tbl where cast(d as decimal(5,2)) > 1;
数据量：60000000
PR改动前：
2.18s 2.245s， 2.259s
PR改动后：
2.623s， 2.618s，2.659s；

valian耗时：
15.936s，15.011s，16.411s；

taiyang-li · 2025-01-07T08:17:51Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

        else
-            return convertDecimalsImpl<DataTypeDecimal<Decimal256>, ToDataType>(decimal, precision_to, scale_from, scale_to, result);
+        {
+            if constexpr (std::is_same_v<FromFieldType, BFloat16>)


remove useless branch

taiyang-li · 2025-01-07T08:18:43Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+        {
+            if constexpr (exception_mode == CheckExceptionMode::Null)
+                return false;
+            else


remove useless branch here and any other places.

github-actions · 2025-01-07T08:20:25Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2025-01-07T08:23:42Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+
+    template <typename FromDataType, typename ToDataType>
+    requires(IsDataTypeNumber<FromDataType> && IsDataTypeDecimal<ToDataType>)
+    static bool convertNumberToDecimalImpl(


ALWAYS_INLINE

taiyang-li · 2025-01-07T08:24:59Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+                                           : static_cast<int>(std::log10(std::fabs(int_part))) + 1;
+        /// If the integer part's digits of the number is greater than (precision - scale), e.g. cast(55 as decimal(2, 1)),
+        /// then we should return NULL or throw exceptions.
+        if (int_part_digits > precision - scale)


if and else could be merged return int_part_digits > precision - scale && tryConvertToDecimal<FromDataType, ToDataType>(value, scale, result);

taiyang-li · 2025-01-07T08:33:26Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+
+        int int_part_digits = int_part == 0 ? 1 :
+                              int_part > 0 ? static_cast<int>(std::log10(int_part)) + 1
+                                           : static_cast<int>(std::log10(std::fabs(int_part))) + 1;


I guess std::log10 and std::fabs is too heavy for this function. Maybe it is better:

auto casted_int_part = static_cast<ToDataType::FieldType>(casted_int_part); bool overflow = casted_int_part >= min_value && casted_int_value <= max_value;

min_value/max_value is the minimum/maximum value which could be represented in precision - scale digits. They could be calculated outside for loop, which remove the cost of std::log10 and std::fabs.

taiyang-li · 2025-01-07T08:35:18Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

+            if constexpr (std::is_same_v<FromFieldType, BFloat16>)
+                return tryConvertToDecimal<DataTypeFloat32, ToDataType>(static_cast<Float32>(value), scale, result);
+            else
+                return tryConvertToDecimal<FromDataType, ToDataType>(value, scale, result);


I'm curious if (int_part_digits > precision - scale) is true, will tryConvertToDecimal returns false?

taiyang-li · 2025-01-07T08:42:32Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

            ToFieldType result;
-            bool success = convertToDecimalImpl<T, ToDataType>(datas[i], precision, scale_from, scale_to, result);
+            bool success = convertToDecimalImpl<FromDataType, ToDataType>(datas[i], precision, scale_from, scale_to, result);

            if (success)


remove if else in loops if possible

vec_to[i] = static_cast<ToFieldType>(result); (*vec_null_map_to)[i] = success;

github-actions · 2025-01-07T12:37:57Z

Run Gluten Clickhouse CI on x86

KevinyhZou · 2025-01-07T12:39:23Z

端到端性能测试

测试sql： select count(1) from test_tbl where cast(d as decimal(5,2)) > 1;
数据量：60000000
PR改动前：
2.18s 2.245s， 2.259s
PR改动后：
1.988s, 1.891s, 1.933s

valian耗时：
15.936s，15.011s，16.411s；

github-actions · 2025-01-07T12:42:42Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T13:35:01Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2025-01-07T15:31:36Z

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp

        UInt32 scale,
+        Int64 decimal_int_part_max,


It is not enough to represent min/max value. Consider precision = 38 and scale = 0.

use NativeTypeToDataType::FieldType

github-actions · 2025-01-09T02:24:36Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-09T02:31:11Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-09T06:05:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-09T06:33:06Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-09T07:47:49Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-09T10:05:59Z

Run Gluten Clickhouse CI on x86

github-actions bot added the CLICKHOUSE label Dec 26, 2024

KevinyhZou force-pushed the fix_cast_number_to_decimal branch from 9daf8db to d92a3f8 Compare December 27, 2024 07:23

taiyang-li reviewed Jan 6, 2025

View reviewed changes

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp Show resolved Hide resolved

taiyang-li reviewed Jan 6, 2025

View reviewed changes

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp Outdated Show resolved Hide resolved

taiyang-li reviewed Jan 6, 2025

View reviewed changes

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp Show resolved Hide resolved

KevinyhZou force-pushed the fix_cast_number_to_decimal branch from a7fc8fe to 8d94e26 Compare January 6, 2025 09:48

taiyang-li reviewed Jan 7, 2025

View reviewed changes

...c/test/scala/org/apache/gluten/execution/tpch/GlutenClickHouseTPCHSaltNullParquetSuite.scala Outdated Show resolved Hide resolved

taiyang-li reviewed Jan 7, 2025

View reviewed changes

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp Outdated Show resolved Hide resolved

taiyang-li reviewed Jan 7, 2025

View reviewed changes

cpp-ch/local-engine/Functions/SparkFunctionCheckDecimalOverflow.cpp Outdated Show resolved Hide resolved

taiyang-li reviewed Jan 7, 2025

View reviewed changes

taiyang-li requested changes Jan 7, 2025

View reviewed changes

taiyang-li reviewed Jan 7, 2025

View reviewed changes

fix cast number to decimal

db3a9e2

KevinyhZou force-pushed the fix_cast_number_to_decimal branch from f70615e to db3a9e2 Compare January 9, 2025 02:30

simply code

5881bc4

fix ci

06ec1c3

KevinyhZou force-pushed the fix_cast_number_to_decimal branch from 0b64fdc to 06ec1c3 Compare January 9, 2025 10:05

taiyang-li approved these changes Jan 10, 2025

View reviewed changes

taiyang-li merged commit 66e816f into apache:main Jan 10, 2025
8 checks passed

taiyang-li changed the title ~~[GLUTEN-8343][CH]Fix cast number to decimal~~ [GLUTEN-8343][CH]Fix cast number to decimal and improve performance of it Jan 10, 2025

[GLUTEN-8343][CH]Fix cast number to decimal and improve performance of it #8351

[GLUTEN-8343][CH]Fix cast number to decimal and improve performance of it #8351

Conversation

KevinyhZou commented Dec 26, 2024 • edited by taiyang-li Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Dec 26, 2024

github-actions bot commented Dec 26, 2024

github-actions bot commented Dec 26, 2024

github-actions bot commented Dec 26, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Dec 30, 2024

github-actions bot commented Dec 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

KevinyhZou commented Jan 7, 2025

端到端性能测试

Choose a reason for hiding this comment

taiyang-li Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taiyang-li Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taiyang-li Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

KevinyhZou commented Jan 7, 2025

端到端性能测试

github-actions bot commented Jan 7, 2025

github-actions bot commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

KevinyhZou commented Dec 26, 2024 •

edited by taiyang-li

Loading

taiyang-li Jan 7, 2025 •

edited

Loading

taiyang-li Jan 7, 2025 •

edited

Loading

taiyang-li Jan 7, 2025 •

edited

Loading