-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: avro_to_arrow: Handle avro nested nullable struct (union) #7663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Samrose-Ahmed -- is it possible to add a test to ensure this functionality is not broken in the future? |
(Thank you for the contribution, BTW) |
Of course I can add a test. |
ff8bc73
to
8ecb471
Compare
I have added a test to verify this behavior. |
I don't think the CI failures are related to your work (they are fixed in #7701) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @Samrose-Ahmed
I took the liberty of merging up from main to get the CI to pass cleanly. |
@Samrose-Ahmed Thank you for catching this issue! I should have fixed this in #7525 . |
r.put("col1", AvroValue::Union(0, Box::new(AvroValue::Null))); | ||
|
||
let mut w = apache_avro::Writer::new(&schema, vec![]); | ||
w.append(r).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add one more record which contains non-null col1
to ensure it works even in this case?
@alamb @Samrose-Ahmed |
Sure I can do that |
@Samrose-Ahmed Or, do you want to change the test data too? |
I would appreciate if you can do it. |
@Samrose-Ahmed |
I've opened a PR to update the test data. |
This PR proposes to update `nested_recods.avro` to support nullable records. This change is necessary for [this PR](apache/datafusion#7663). This change appends new fields `f3` and `f4` to the existing schema (the last two fields). `f3` is nullable record, and `f4` is array of nullable record. ``` { "name": "record1", "namespace": "ns1", "type": "record", "fields": [ { "name": "f1", "type": { "name": "record2", "namespace": "ns2", "type": "record", "fields": [ { "name": "f1_1", "type": "string" }, { "name": "f1_2", "type": "int" }, { "name": "f1_3", "type": { "name": "record3", "namespace": "ns3", "type": "record", "fields": [ { "name": "f1_3_1", "type": "double" } ] } } ] } }, { "name": "f2", "type": "array", "items": { "name": "record4", "namespace": "ns4", "type": "record", "fields": [ { "name": "f2_1", "type": "boolean" }, { "name": "f2_2", "type": "float" } ] } }, { "name": "f3", "type": [ "null", { "name": "record5", "namespace": "ns5", "type": "record", "fields": [ { "name": "f3_1", "type": "string" } ] } ], "default": null }, { "name": "f4", "type": "array", "items": [ "null", { "name": "record6", "namespace": "ns6", "type": "record", "fields": [ { "name": "f4_1", "type": "long" } ] } ] } ] } ``` And the data represented in JSON is as follows. ``` {"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2000000476837158},{"f2_1":true,"f2_2":2.200000047683716}],"f3":{"f3_1":"xyz"},"f4":[{"f4_1":200},null]} {"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.199999809265137}],"f3":null,"f4":[null,{"f4_1":300}]} ```
apache/arrow-testing#94 is merged |
48a6b04
to
268598d
Compare
I've updated the PR. I've fixed a correctness issue needed for this PR with mapping the Avro and Arrow schema as well, the Avro Record name should not be used in Arrow (the code that was doing |
Actually this code is a bit weird I found more issues with nested types... will have to fix. |
It might make sense to get this PR in as it improves things, even if it doesn't fix everything completely and then work on follow ups Also, note that @tustvold I think is planning to work on adding upstream support in apache/arrow-rs#4886 arrow-rs (which might automatically handle nested types), though that won't land in datafusion for a week or two |
datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
Outdated
Show resolved
Hide resolved
Oh that's great, let me make another revision and see if theirs something improving we can get in. |
datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
Outdated
Show resolved
Hide resolved
datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
Outdated
Show resolved
Hide resolved
268598d
to
ce6d0a2
Compare
I have made another revision, I had to edit some of the schema lookup logic to handle nested fields, didn't want to change the code too much. |
ce88ea4
to
cf34a25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me except one minor comment!
Oh, wait. Clippy raises some warnings.
@Samrose-Ahmed Could you fix them? Also you can check the style locally by running |
cf34a25
to
3afc8ef
Compare
@alamb Could you trigger GA workflows? |
Corrects handling of a nullable struct union. Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
3afc8ef
to
9269ae6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Samrose-Ahmed and @sarutak
BTW @tustvold is working on porting / implementing this upstream: apache/arrow-rs#4888
@alamb Thank you for letting me know. I'll check it out. |
In the interim, I think this PR makes DataFusion demonstrably better, and increases test coverage so I think we should merge it in. Thanks again everyone who was involved |
@alamb |
Absolutely -- I view what @tustvold is doing as complementary (even if somewhat duplicative). I would expect that DataFusion will continue to support avro for reading / processing, and that all end to end tests (e.g. Does that make sense @sarutak ? |
@alamb |
…e#7663) Corrects handling of a nullable struct union. Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
Corrects handling of a nullable struct union.
Which issue does this PR close?
Closes #7662.
Rationale for this change
Bug fix
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?