feat: Add Spark from_json function #11709

zhli1142015 · 2024-12-02T09:49:36Z

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.

netlify · 2024-12-02T09:49:51Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`48024c2`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/677bf3d2dad88a0008c48867

zhli1142015 · 2024-12-02T09:53:36Z

cc @rui-mo and @PHILO-HE , thanks.

rui-mo

Thanks. Added some initial comments.

velox/functions/sparksql/specialforms/CMakeLists.txt

velox/functions/sparksql/specialforms/FromJson.cpp

jinchengchenghh

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

zhli1142015 · 2024-12-06T06:51:33Z

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

rui-mo

Thanks.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.h

velox/functions/sparksql/specialforms/FromJson.cpp

rui-mo

Thanks for the update! Added some comments.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

rui-mo

Thanks for iterating!

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

PHILO-HE

Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!

velox/docs/functions/spark/json.rst

zhli1142015 · 2024-12-19T12:14:09Z

UT failure is not related to this PR.

E20241219 05:35:37.812153 14341 Exceptions.h:66] Line: /__w/velox/velox/velox/connectors/hive/storage_adapters/hdfs/tests/HdfsMiniCluster.cpp:92, Function:addFile, Expression:  Failed to add file to hdfs, exit code: 383, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to add file to hdfs, exit code: 383
Retriable: False
Function: addFile
File: /__w/velox/velox/velox/connectors/hive/storage_adapters/hdfs/tests/HdfsMiniCluster.cpp
Line: 92
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox11filesystems4test15HdfsMiniCluster7addFileENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_
# 4  _ZN18HdfsFileSystemTest14SetUpTestSuiteEv
# 5  _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_9TestSuiteEvEET0_PT_MS4_FS3_vEPKc
# 6  _ZN7testing9TestSuite3RunEv
# 7  _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 8  _ZN7testing8UnitTest3RunEv
# 9  main
# 10 __libc_start_call_main
# 11 __libc_start_main
# 12 _start
" thrown in SetUpTestSuite().

ayushi-agarwal · 2024-12-20T05:52:13Z

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition
Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))

This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced.
Could you suggest what can be the issue here?

I tried on a smaller dataset
// Sample data
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("item_array", StringType, nullable = true)
))
val data = Seq(
(1, """["item1", "item2"]"""),
(2, null),
(3, """[]"""),
(4, """["item3", null, "item4"]"""),
(5, null),
(6, """["item5"]""")
)
val query = """
SELECT id, from_json(item_array, 'array')
FROM items_parquet_no_implicits
WHERE size(from_json(item_array, 'array')) > 0
AND isnotnull(from_json(item_array, 'array'))
"""
But this gives correct result with offloading.

zhli1142015 · 2024-12-20T07:16:38Z

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))

This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced. Could you suggest what can be the issue here and does it needs handling in gluten as to not offload such expressions to velox?

This patch currently supports the function only with default settings. For other cases, fallback handling is required in Gluten. Additionally, there is one known limitation: single quotes are not supported.
Please check if this is related. If not, could you provide a simple reproduction case? I’d be happy to investigate further.
Thanks.

ayushi-agarwal · 2024-12-20T07:47:41Z

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))
This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced. Could you suggest what can be the issue here and does it needs handling in gluten as to not offload such expressions to velox?

This patch currently supports the function only with default settings. For other cases, fallback handling is required in Gluten. Additionally, there is one known limitation: single quotes are not supported. Please check if this is related. If not, could you provide a simple reproduction case? I’d be happy to investigate further. Thanks.

In the above data I replaced array items with single quotes for row containing item5 and that row was not selected by filter. But when I remove filter and just run a select then it gives correct results, will this single quote issue come just with filter? Also could you elaborate on what comes under default settings?
val query = """
SELECT id, from_json(item_array, 'array')
FROM items_parquet_no_implicits
"""
// Sample data
val data = Seq(
(1, """["item1", "item2"]"""),
(2, null),
(3, """[]"""),
(4, """["item3", null, "item4"]"""),
(5, null),
(6, """['item5']""")
)

Output
+---+---------------------+
| id|from_json(item_array)|
+---+---------------------+
| 4| [item3, null, item4]|
| 1| [item1, item2]|
| 6| [item5]|
| 3| []|
| 2| null|
| 5| null|
+---+---------------------+

zhli1142015 · 2024-12-20T08:36:59Z

The single quotes limit is from the json praser used in Velox. Null would be generated as json parser reports error when there are single quotes.
From output above I think you use show() to get the results which triggers fallback with cast, You may use collect it should produce consistent results.
[6,null]

BTW JSON standard requires double quotes and will not accept single quotes, so most parsers don't support it. Is this required in your cases?
cc @PHILO-HE

ayushi-agarwal · 2024-12-20T10:14:08Z

Thank you @zhli1142015 for clarification and pointer. I was doing show earlier, with collect(). results are consistent. Will get the actual data from team and will check with them if it can be modified.

PHILO-HE · 2024-12-24T05:45:42Z

@zhli1142015, could you create a Gluten pr to see Gluten CI's feedback in advance? Thus, we can discover unsupported cases from Spark UTs. It's ok to me that we can make some unsupported cases fall back to Spark or we can just clarify the unsupported cases in Gluten doc if it's not easy to fix in this pr.

It requires changing Velox branch to the one with your patch applied. And the change can be reverted once this pr is merged.
https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/get_velox.sh#L20

zhouyuan · 2024-12-24T06:52:15Z

@zhli1142015, could you create a Gluten pr to see Gluten CI's feedback in advance? Thus, we can discover unsupported cases from Spark UTs. It's ok to me that we can make some unsupported cases fall back to Spark or we can just clarify the unsupported cases in Gluten doc if it's not easy to fix in this pr.

It requires changing Velox branch to the one with your patch applied. And the change can be reverted once this pr is merged. https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/get_velox.sh#L20

I made one quick pr to test this:
apache/incubator-gluten#8318

zhli1142015 · 2024-12-24T06:55:07Z

Thanks @zhouyuan .
Let me create one, there are certain cases we need to fallback

zhli1142015 · 2024-12-24T07:14:48Z

apache/incubator-gluten#8320

ayushi-agarwal · 2024-12-24T10:05:19Z

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6
val data = Seq(
(6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""")
)
spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

zhli1142015 · 2024-12-24T11:58:35Z

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

ayushi-agarwal · 2024-12-24T16:23:21Z

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

Thanks for the quick response and updating it.

rui-mo

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

zhli1142015 · 2024-12-27T08:06:14Z

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

Updated, thanks.

address comments address comments address comments address comments address comments minor change address comments minor change

zhli1142015 · 2025-01-06T02:09:10Z

Kindly ping~, @rui-mo and @PHILO-HE , do you still have more comments? Thanks.

zhli1142015 requested review from assignUser and majetideepak as code owners December 2, 2024 09:49

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024

zhli1142015 force-pushed the add_from_json branch from 006efc5 to 89d888e Compare December 4, 2024 07:01

rui-mo reviewed Dec 5, 2024

View reviewed changes

rui-mo mentioned this pull request Dec 5, 2024

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] apache/incubator-gluten#4039

Open

99 tasks

zhli1142015 force-pushed the add_from_json branch from d1c7d69 to d74a262 Compare December 5, 2024 05:47

zhli1142015 requested a review from rui-mo December 5, 2024 07:31

jinchengchenghh reviewed Dec 6, 2024

View reviewed changes

zhli1142015 requested a review from jinchengchenghh December 6, 2024 06:52

zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57

rui-mo reviewed Dec 10, 2024

View reviewed changes

rui-mo changed the title ~~feat: Add from_json Spark function~~ feat: Add Spark from_json function Dec 10, 2024

rui-mo reviewed Dec 10, 2024

View reviewed changes

velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved

velox/functions/sparksql/specialforms/FromJson.cpp Show resolved Hide resolved

zhli1142015 requested a review from rui-mo December 11, 2024 03:39

rui-mo reviewed Dec 12, 2024

View reviewed changes

zhli1142015 force-pushed the add_from_json branch from 68dab93 to 5bdc4c2 Compare December 12, 2024 08:19

zhli1142015 requested a review from rui-mo December 12, 2024 10:56

zhli1142015 force-pushed the add_from_json branch 3 times, most recently from c3696df to d5d801b Compare December 17, 2024 04:22

rui-mo reviewed Dec 17, 2024

View reviewed changes

PHILO-HE reviewed Dec 18, 2024

View reviewed changes

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved

zhli1142015 force-pushed the add_from_json branch from f19beba to e3e80be Compare December 18, 2024 09:39

zhli1142015 requested review from rui-mo and PHILO-HE December 18, 2024 09:40

zhli1142015 force-pushed the add_from_json branch from de91985 to 60c4ec4 Compare December 20, 2024 07:12

zhli1142015 force-pushed the add_from_json branch from 60c4ec4 to 26ddd08 Compare December 20, 2024 07:24

zhli1142015 force-pushed the add_from_json branch 2 times, most recently from 43490eb to d6d3687 Compare December 24, 2024 02:30

zhli1142015 force-pushed the add_from_json branch from d6d3687 to 8689330 Compare December 24, 2024 11:16

zhli1142015 force-pushed the add_from_json branch from 8689330 to d47f11b Compare December 25, 2024 06:21

rui-mo reviewed Dec 27, 2024

View reviewed changes

zhli1142015 requested a review from rui-mo December 27, 2024 08:05

zhouyuan mentioned this pull request Jan 2, 2025

[GLUTEN-8340][VL] Enable from_json function apache/incubator-gluten#8320

Open

zhli1142015 force-pushed the add_from_json branch 2 times, most recently from 37f1f82 to 1ce4f81 Compare January 3, 2025 02:45

zhli1142015 added 2 commits January 6, 2025 10:03

feat: Add from_json Spark function

178d4e3

address comments address comments address comments address comments address comments minor change address comments minor change

address comments

4f63a5b

zhli1142015 force-pushed the add_from_json branch from 1ce4f81 to 4f63a5b Compare January 6, 2025 02:03

minor fix

48024c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Spark from_json function #11709

feat: Add Spark from_json function #11709

zhli1142015 commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading

zhli1142015 commented Dec 2, 2024

rui-mo left a comment

jinchengchenghh left a comment

zhli1142015 commented Dec 6, 2024

rui-mo left a comment

rui-mo left a comment

rui-mo left a comment

PHILO-HE left a comment

zhli1142015 commented Dec 19, 2024

ayushi-agarwal commented Dec 20, 2024 •

edited

Loading

zhli1142015 commented Dec 20, 2024 •

edited

Loading

ayushi-agarwal commented Dec 20, 2024 •

edited

Loading

zhli1142015 commented Dec 20, 2024 •

edited

Loading

ayushi-agarwal commented Dec 20, 2024

PHILO-HE commented Dec 24, 2024

zhouyuan commented Dec 24, 2024

zhli1142015 commented Dec 24, 2024

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024 •

edited

Loading

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024

rui-mo left a comment

zhli1142015 commented Dec 27, 2024

zhli1142015 commented Jan 6, 2025

feat: Add Spark from_json function #11709

Are you sure you want to change the base?

feat: Add Spark from_json function #11709

Conversation

zhli1142015 commented Dec 2, 2024 • edited Loading

netlify bot commented Dec 2, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

zhli1142015 commented Dec 2, 2024

rui-mo left a comment

Choose a reason for hiding this comment

jinchengchenghh left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 6, 2024

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

PHILO-HE left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 19, 2024

ayushi-agarwal commented Dec 20, 2024 • edited Loading

zhli1142015 commented Dec 20, 2024 • edited Loading

ayushi-agarwal commented Dec 20, 2024 • edited Loading

zhli1142015 commented Dec 20, 2024 • edited Loading

ayushi-agarwal commented Dec 20, 2024

PHILO-HE commented Dec 24, 2024

zhouyuan commented Dec 24, 2024

zhli1142015 commented Dec 24, 2024

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024 • edited Loading

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024

rui-mo left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 27, 2024

zhli1142015 commented Jan 6, 2025

zhli1142015 commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading

ayushi-agarwal commented Dec 20, 2024 •

edited

Loading

zhli1142015 commented Dec 20, 2024 •

edited

Loading

ayushi-agarwal commented Dec 20, 2024 •

edited

Loading

zhli1142015 commented Dec 20, 2024 •

edited

Loading

ayushi-agarwal commented Dec 24, 2024 •

edited

Loading