-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Spark from_json function #11709
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
006efc5
to
89d888e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added some initial comments.
d1c7d69
to
d74a262
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json
in Spark and make sure the result is correct.
The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as The only existing unit tests in Spark related to this function are found in |
a284e49
to
2762885
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Added some comments.
68dab93
to
5bdc4c2
Compare
c3696df
to
d5d801b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!
f19beba
to
e3e80be
Compare
UT failure is not related to this PR.
|
@zhli1142015 I was trying this patch on one of the workload where there is this filter condition This filter when applied is producing 0 rows whereas when run in vanilla spark 4 millions rows are produced. I tried on a smaller dataset |
de91985
to
60c4ec4
Compare
This patch currently supports the function only with default settings. For other cases, fallback handling is required in Gluten. Additionally, there is one known limitation: single quotes are not supported. |
60c4ec4
to
26ddd08
Compare
In the above data I replaced array items with single quotes for row containing item5 and that row was not selected by filter. But when I remove filter and just run a select then it gives correct results, will this single quote issue come just with filter? Also could you elaborate on what comes under default settings? Output |
The single quotes limit is from the json praser used in Velox. BTW JSON standard requires double quotes and will not accept single quotes, so most parsers don't support it. Is this required in your cases? |
Thank you @zhli1142015 for clarification and pointer. I was doing show earlier, with collect(). results are consistent. Will get the actual data from team and will check with them if it can be modified. |
43490eb
to
d6d3687
Compare
@zhli1142015, could you create a Gluten pr to see Gluten CI's feedback in advance? Thus, we can discover unsupported cases from Spark UTs. It's ok to me that we can make some unsupported cases fall back to Spark or we can just clarify the unsupported cases in Gluten doc if it's not easy to fix in this pr. It requires changing Velox branch to the one with your patch applied. And the change can be reverted once this pr is merged. |
I made one quick pr to test this: |
Thanks @zhouyuan . |
@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 |
d6d3687
to
8689330
Compare
Thanks, this is actually a missed case, I updated the logic to address this case. |
Thanks for the quick response and updating it. |
8689330
to
d47f11b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?
Updated, thanks. |
37f1f82
to
1ce4f81
Compare
address comments address comments address comments address comments address comments minor change address comments minor change
1ce4f81
to
4f63a5b
Compare
Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):
Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.