Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark from_json function #11709

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Dec 2, 2024

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024
Copy link

netlify bot commented Dec 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 48024c2
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/677bf3d2dad88a0008c48867

@zhli1142015
Copy link
Contributor Author

cc @rui-mo and @PHILO-HE , thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some initial comments.

velox/functions/sparksql/specialforms/CMakeLists.txt Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@zhli1142015
Copy link
Contributor Author

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

@zhli1142015 zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@rui-mo rui-mo changed the title feat: Add from_json Spark function feat: Add Spark from_json function Dec 10, 2024
@zhli1142015 zhli1142015 requested a review from rui-mo December 11, 2024 03:39
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Added some comments.

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
@zhli1142015 zhli1142015 requested a review from rui-mo December 12, 2024 10:56
@zhli1142015 zhli1142015 force-pushed the add_from_json branch 3 times, most recently from c3696df to d5d801b Compare December 17, 2024 04:22
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved
@zhli1142015
Copy link
Contributor Author

UT failure is not related to this PR.

E20241219 05:35:37.812153 14341 Exceptions.h:66] Line: /__w/velox/velox/velox/connectors/hive/storage_adapters/hdfs/tests/HdfsMiniCluster.cpp:92, Function:addFile, Expression:  Failed to add file to hdfs, exit code: 383, Source: RUNTIME, ErrorCode: INVALID_STATE
unknown file: Failure
C++ exception with description "Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Failed to add file to hdfs, exit code: 383
Retriable: False
Function: addFile
File: /__w/velox/velox/velox/connectors/hive/storage_adapters/hdfs/tests/HdfsMiniCluster.cpp
Line: 92
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox11filesystems4test15HdfsMiniCluster7addFileENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_
# 4  _ZN18HdfsFileSystemTest14SetUpTestSuiteEv
# 5  _ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_9TestSuiteEvEET0_PT_MS4_FS3_vEPKc
# 6  _ZN7testing9TestSuite3RunEv
# 7  _ZN7testing8internal12UnitTestImpl11RunAllTestsEv
# 8  _ZN7testing8UnitTest3RunEv
# 9  main
# 10 __libc_start_call_main
# 11 __libc_start_main
# 12 _start
" thrown in SetUpTestSuite().

@ayushi-agarwal
Copy link

ayushi-agarwal commented Dec 20, 2024

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition
Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))

This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced.
Could you suggest what can be the issue here?

I tried on a smaller dataset
// Sample data
val schema = StructType(Seq(
StructField("id", IntegerType, nullable = false),
StructField("item_array", StringType, nullable = true)
))
val data = Seq(
(1, """["item1", "item2"]"""),
(2, null),
(3, """[]"""),
(4, """["item3", null, "item4"]"""),
(5, null),
(6, """["item5"]""")
)
val query = """
SELECT id, from_json(item_array, 'array')
FROM items_parquet_no_implicits
WHERE size(from_json(item_array, 'array')) > 0
AND isnotnull(from_json(item_array, 'array'))
"""
But this gives correct result with offloading.

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Dec 20, 2024

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))

This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced. Could you suggest what can be the issue here and does it needs handling in gluten as to not offload such expressions to velox?

This patch currently supports the function only with default settings. For other cases, fallback handling is required in Gluten. Additionally, there is one known limitation: single quotes are not supported.
Please check if this is related. If not, could you provide a simple reproduction case? I’d be happy to investigate further.
Thanks.

@ayushi-agarwal
Copy link

ayushi-agarwal commented Dec 20, 2024

@zhli1142015 I was trying this patch on one of the workload where there is this filter condition Arguments: ((size(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC)), true) > 0) AND isnotnull(from_json(ArrayType(StringType,true), item_array#7302, Some(Etc/UTC))))
This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced. Could you suggest what can be the issue here and does it needs handling in gluten as to not offload such expressions to velox?

This patch currently supports the function only with default settings. For other cases, fallback handling is required in Gluten. Additionally, there is one known limitation: single quotes are not supported. Please check if this is related. If not, could you provide a simple reproduction case? I’d be happy to investigate further. Thanks.

In the above data I replaced array items with single quotes for row containing item5 and that row was not selected by filter. But when I remove filter and just run a select then it gives correct results, will this single quote issue come just with filter? Also could you elaborate on what comes under default settings?
val query = """
SELECT id, from_json(item_array, 'array')
FROM items_parquet_no_implicits
"""
// Sample data
val data = Seq(
(1, """["item1", "item2"]"""),
(2, null),
(3, """[]"""),
(4, """["item3", null, "item4"]"""),
(5, null),
(6, """['item5']""")
)

Output
+---+---------------------+
| id|from_json(item_array)|
+---+---------------------+
| 4| [item3, null, item4]|
| 1| [item1, item2]|
| 6| [item5]|
| 3| []|
| 2| null|
| 5| null|
+---+---------------------+

@zhli1142015
Copy link
Contributor Author

zhli1142015 commented Dec 20, 2024

The single quotes limit is from the json praser used in Velox. Null would be generated as json parser reports error when there are single quotes.
From output above I think you use show() to get the results which triggers fallback with cast, You may use collect it should produce consistent results.
[6,null]

BTW JSON standard requires double quotes and will not accept single quotes, so most parsers don't support it. Is this required in your cases?
cc @PHILO-HE

@ayushi-agarwal
Copy link

Thank you @zhli1142015 for clarification and pointer. I was doing show earlier, with collect(). results are consistent. Will get the actual data from team and will check with them if it can be modified.

@zhli1142015 zhli1142015 force-pushed the add_from_json branch 2 times, most recently from 43490eb to d6d3687 Compare December 24, 2024 02:30
@PHILO-HE
Copy link
Contributor

@zhli1142015, could you create a Gluten pr to see Gluten CI's feedback in advance? Thus, we can discover unsupported cases from Spark UTs. It's ok to me that we can make some unsupported cases fall back to Spark or we can just clarify the unsupported cases in Gluten doc if it's not easy to fix in this pr.

It requires changing Velox branch to the one with your patch applied. And the change can be reverted once this pr is merged.
https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/get_velox.sh#L20

@zhouyuan
Copy link
Contributor

@zhli1142015, could you create a Gluten pr to see Gluten CI's feedback in advance? Thus, we can discover unsupported cases from Spark UTs. It's ok to me that we can make some unsupported cases fall back to Spark or we can just clarify the unsupported cases in Gluten doc if it's not easy to fix in this pr.

It requires changing Velox branch to the one with your patch applied. And the change can be reverted once this pr is merged. https://github.com/apache/incubator-gluten/blob/main/ep/build-velox/src/get_velox.sh#L20

I made one quick pr to test this:
apache/incubator-gluten#8318

@zhli1142015
Copy link
Contributor Author

Thanks @zhouyuan .
Let me create one, there are certain cases we need to fallback

@zhli1142015
Copy link
Contributor Author

@ayushi-agarwal
Copy link

ayushi-agarwal commented Dec 24, 2024

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6
val data = Seq(
(6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""")
)
spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

@zhli1142015
Copy link
Contributor Author

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

@ayushi-agarwal
Copy link

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

Thanks for the quick response and updating it.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

@zhli1142015 zhli1142015 requested a review from rui-mo December 27, 2024 08:05
@zhli1142015
Copy link
Contributor Author

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

Updated, thanks.

address comments

address comments

address comments

address comments

address comments

minor change

address comments

minor change
@zhli1142015
Copy link
Contributor Author

Kindly ping~, @rui-mo and @PHILO-HE , do you still have more comments? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants