Skip to content

Avro codec enhancements #6965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open

Avro codec enhancements #6965

wants to merge 47 commits into from

Conversation

jecsand838
Copy link

@jecsand838 jecsand838 commented Jan 10, 2025

Which issue does this PR close?

Part of #4886

Rationale for this change

The primary objective of this PR is to enhance the arrow-rs Avro implementation by introducing full support for Avro data types, support for Avro aliases, and implementing a public facing experimental Avro Reader. These enhancements are crucial for several reasons:

1. Enhanced Data Interoperability:
By supporting these additional types, the Avro reader becomes more compatible with a wider range of Avro schemas. This ensures that users can ingest and process diverse datasets without encountering type-related limitations.

What changes are included in this PR?

Avro Codec + RecordDecoder

  1. Support for new types:
    • Lists
    • Fixed
    • Interval
    • Decimal
    • Map
    • Enum
    • UUID
  2. Extended namespace support to all supported types
  3. Added Alias support
  4. Extended support for local timestamps
  5. Added Unit tests
  6. Expanded nullability support
  7. Added module test cases with .avro files

Are there any user-facing changes?

  • N/A

jecsand838 and others added 17 commits December 28, 2024 12:44
1. Namespaces
2. Enums
3. Maps
4. Decimals
Signed-off-by: Connor Sanders <[email protected]>
* Implemented reader decoder for Avro Lists
* Cleaned up reader/record.rs and added comments for readability.

Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
- Fixed
- Interval

Signed-off-by: Connor Sanders <[email protected]>
Added Avro codec + decoder support for new types
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 10, 2025
@tustvold
Copy link
Contributor

tustvold commented Jan 11, 2025

👋 thank you for working on this, it is very exciting to see arrow-avro getting some love. I think what would probably help is to break this up into smaller pieces that can be delivered separately. Whilst I accept this is more work for you, it makes reviewing the code much more practical, especially given our relatively limited review bandwidth. Many of the bullets in your PR description would warrant a separate PR IMO.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 I triggered CI for this PR

I haven't had a chance to review the code yet, but I did start to look at the test coverage. What would you think about adding tests using the existing avro testing data in https://github.com/apache/arrow-testing/tree/master/data/avro (already a submodule in this repo)

Key tests in my mind would be:

  1. Read the avro testing files and verify the schema and data read (and leave comments for tests that don't pass)
  2. For the writer implement round trip tests: create a RecordBatch (es), write it to an .avro file and then read it back in and ensure the round tripped batches are equal

You might be able to find code / testing that you can reuse in the datafusion copy: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

FYI @Jefffrey

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some review of this PR, I think I could likely find some additional time to review / help it along if/when it has end to end tests of reading existing avro files (that would give me confidence that the code being reviewed did the right thing functionally).

To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:

  1. First PR: Reader improvements / tests showing reading existing .avro files
  2. Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
  3. Subsequent PRs: Additional PRs to support writing additional data types

The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed

Let me know what you think

And thanks again

@jecsand838
Copy link
Author

@alamb @tustvold Thank you both so much for getting to our PR so quickly! We'd be more than happy to break this PR up as advised and add those additional tests. Your recommendation makes a lot of sense.

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

I'm sure there's functional overlap we can look into. We just attempted to extend the patterns @tustvold put in place. Definitely interested in hearing your thoughts on this however.

@alamb
Copy link
Contributor

alamb commented Jan 14, 2025

@alamb @tustvold Thank you both so much for getting to our PR so quickly! We'd be more than happy to break this PR up as advised and add those additional tests. Your recommendation makes a lot of sense.

I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?

I'm sure there's functional overlap we can look into. We just attempted to extend the patterns @tustvold put in place. Definitely interested in hearing your thoughts on this however.

Sounds good . Ideally it would be great to avoid duplication, but if the existing avro-rs crate is row oriented, the amount of reusable code might be small as this decoder will need to be row oriented

Thanks for the changes -- I just started the CI checks and took a quick look through this PR. I think the only thing left now is to add some tests that read from the existing .avro files in arrow-testing -- then I imagine it will be ready for a more thorough review.

Thank you for working on this.

@svencowart
Copy link

@alamb, we're improving the reader and providing more Avro tests.

As I was working on this, I noticed the module signature for arrow's parquet reader and datafusion's arrow_array_reader.rs are based on a Builder pattern. To simplify adopting these arrow-avro changes into DataFusion, should we implement the module signature to something similar to what is already in DataFusion today? Or are you looking for something else? Also, do you prefer whether the public signature for the reader should be included in this PR or a PR following this PR that proves the reader changes? I'm happy to contribute the public module signature.

@jecsand838
Copy link
Author

Thank you for this, I wonder if there is some way we might break this up into smaller pieces. A single 5000 line diff is not something I can realistically review...

@tustvold I completely understand, my apologies about that!

Roughly 3500 lines of the diff are tests and ~1000 lines of that 3500 is test code in reader/mod.rs.

If I removed all the tests we added save for the ones in reader/mod.rs, I could probably get this diff down to ~2500 lines (with ~1000 lines of that still tests). I know that is still large, but would it be acceptable? We can definitely add those tests back in a future PR.

@tustvold
Copy link
Contributor

It isn't just the sheer size of the PR but also the sheer number of things it is doing. I think this needs to be split up into smaller focused PRs to make reviewing it practical.

@jecsand838
Copy link
Author

@tustvold @alamb I have removed all changes and functionality not related to Avro reader type support and .avro file tests. Based on the plan outlined above, this should strictly align with the requirements for PR 1.

To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:

First PR: Reader improvements / tests showing reading existing .avro files
Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
Subsequent PRs: Additional PRs to support writing additional data types
The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed

The code remaining in this PR successfully reads all .avro test files in the repository, including the Impala files. If this approach is acceptable, I can create a follow-up PR for the Avro Reader, ReaderBuilder, etc., once this is merged.

If I further split the changes, the .avro file tests will no longer all pass. If that level of granularity is required, could you provide guidance on how to break down the PRs by .avro file?

CC: @svencowart

@jecsand838 jecsand838 requested a review from alamb February 15, 2025 23:40
@jecsand838 jecsand838 changed the title Avro codec enhancements + Avro Reader Avro codec enhancements Feb 15, 2025
@alamb
Copy link
Contributor

alamb commented Feb 16, 2025

Thanks @jecsand838 -- I am out on vacation this week so I may not have time to review this PR for a while. I have put it on my list however

@jecsand838
Copy link
Author

I am out on vacation this week so I may not have time to review this PR for a while. I have put it on my list however

@alamb Completely understand!

Also I pushed a fix for that lint error. All of the clippy tests should be passing now.

@klion26
Copy link
Member

klion26 commented Mar 7, 2025

Thanks for the contribution.

I have some question after apply this patch when using the following code

    #[test]
    fn test_contet() {
        let file = "/path/to/56f216f3-7e24-40b0-a76a-87a63a5bc254-m0.avro";
        let rb: RecordBatch = read_file(file, 1024, false);
        println!("{:?}", rb);
    }

encounter the exception, not sure is it the check too strict?

called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Inconsistent struct child length for field #3. Expected 1, got 0")
thread 'reader::test::test_contet' panicked at arrow-avro/src/reader/mod.rs:125:25:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Inconsistent struct child length for field #3. Expected 1, got 0")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/core/src/panicking.rs:76:14
   2: core::result::unwrap_failed
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/core/src/result.rs:1699:5
   3: core::result::Result<T,E>::unwrap
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1104:23
   4: arrow_avro::reader::test::read_file
             at ./src/reader/mod.rs:125:9
   5: arrow_avro::reader::test::test_contet
             at ./src/reader/mod.rs:131:31
   6: arrow_avro::reader::test::test_contet::{{closure}}

after comment the check logic in records.rs Decoder#flush for there will throw some other exception

called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect array length for StructArray field \"partition\", expected 1 got 0")
thread 'reader::test::test_contet' panicked at arrow-array/src/array/struct_array.rs:91:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect array length for StructArray field \"partition\", expected 1 got 0")
stack backtrace:
   0: rust_begin_unwind
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/core/src/panicking.rs:76:14
   2: core::result::unwrap_failed
             at /rustc/ee612c45f00391aff71ec0c52b7fc08fae18c711/library/core/src/result.rs:1699:5
   3: core::result::Result<T,E>::unwrap
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1104:23
   4: arrow_array::array::struct_array::StructArray::new
             at /Users/qiucongxian/arrow-rs/arrow-array/src/array/struct_array.rs:91:9
   5: arrow_avro::reader::record::Decoder::flush
             at ./src/reader/record.rs:450:28
   6: arrow_avro::reader::record::RecordDecoder::flush::{{closure}}
             at ./src/reader/record.rs:83:22
   7: core::iter::adapters::map::map_try_fold::{{closure}}
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:95:28
   8: core::iter::traits::iterator::Iterator::try_fold
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2370:21
   9: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:121:9
  10: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/mod.rs:191:9
  11: core::iter::traits::iterator::Iterator::try_for_each
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:2431:9
  12: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/mod.rs:174:14
  13: alloc::vec::Vec<T,A>::extend_desugared
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:3518:35
  14: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/vec/spec_extend.rs:19:9
  15: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/vec/spec_from_iter_nested.rs:42:9
  16: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/vec/spec_from_iter.rs:34:9
  17: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:3410:9
  18: core::iter::traits::iterator::Iterator::collect
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:1971:9
  19: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1980:51
  20: core::iter::adapters::try_process
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/adapters/mod.rs:160:17
  21: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1980:9
  22: core::iter::traits::iterator::Iterator::collect
             at /Users/qiucongxian/.rustup/toolchains/nightly-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:1971:9
  23: arrow_avro::reader::record::RecordDecoder::flush
             at ./src/reader/record.rs:80:22
  24: arrow_avro::reader::test::read_file

the file used to test is as following, please remove the ".txt" suffix

56f216f3-7e24-40b0-a76a-87a63a5bc254-m0.avro.txt

@jecsand838
Copy link
Author

jecsand838 commented Apr 7, 2025

Hi all, wanted to let you know I'm still working on this. I've just been extremely busy.

@klion26 I'm going to address those issues you called out towards the end of this week / early next week.

Also in the future I'll ensure the contributions are much smaller and focused. I do apologize again for how large and complicated this one was.

@nathaniel-elastiflow
Copy link

nathaniel-elastiflow commented Apr 17, 2025

@klion26 I've pushed a fix for this case. It seems Arrow applies an implicit valid field for the non-nullable partition, which has no explicit fields.

@klion26
Copy link
Member

klion26 commented Apr 21, 2025

@nathaniel-elastiflow thanks for the fix, it works for me now.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@klion26 is there any chance you can help review this PR?

@jecsand838 and @nathaniel-elastiflow do you think this PR is ready for review?

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

The chance I am going to be able to find time to review this PR in its current form (2000+ lines) is quite low. Maybe another reviewer will be able to

Is there any possibility to break it up into smaller parts for eaiser review? Specifically, it would be really great to get whatever parts are non breaking into their own PR first or cleanups / etc

Also see a new related PR here:

@jecsand838
Copy link
Author

@alamb

We think it's ready for review.

However we can break this up to make it easier to review. We do have a completed Avro-Reader already implemented on our fork which we were planning to introduce piece by piece.

I'll work with @nathaniel-elastiflow on creating PRs for each data-type first. That should decompose this down nicely.

@jecsand838
Copy link
Author

@alamb @tustvold We went ahead and made our first decomposed PR for Map types: #7451, which is ready for review.

@klion26
Copy link
Member

klion26 commented Apr 29, 2025

@klion26 is there any chance you can help review this PR?

@jecsand838 and @nathaniel-elastiflow do you think this PR is ready for review?

@alamb I'll try to help review this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants