-
Notifications
You must be signed in to change notification settings - Fork 1k
Regression Testing, Bug Fixes, and Public API Tightening for arrow-avro #8492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
56b7b23
to
e1e90e0
Compare
…nctions within `arrow-avro`; bug fixes; centralize nullability handling; enforce spec-compliant alias and default value behavior; and improve tests with canonical form validation.
e1e90e0
to
5bc743c
Compare
THanks @jecsand838 -- I'll try and get through this over the next few days. @nathaniel-d-ef any chance you can help review this PR too? |
fbb0cbe
to
efe3cba
Compare
.map(|(idx, wf)| (wf.name, idx)) | ||
.collect(); | ||
// Build writer lookup and ambiguous alias set. | ||
let (writer_lookup, ambiguous_writer_aliases) = Self::build_writer_lookup(writer_record); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Related to bug 1) Alias resolution + defaults: These changes makes alias matching spec‑compliant and handles defaults correctly when a reader field isn’t present in the writer schema.
-
full_name_set(..)
builds canonical full names for a type (name + namespace) and its aliases, and names_match(..) now compares these sets rather than raw strings — this allows both fully‑qualified and namespace‑relative aliases to match per the Avro spec. -
ensure_names_match(..)
is updated to pass both writer and reader namespaces/aliases to names_match(..). -
In
Maker::resolve_records(..)
we now pre-build a writer lookup that includes aliases, detect ambiguous writer aliases, and, in strict_mode, error out if a reader alias would map to multiple writer fields.
When a reader field has no writer match, we now require an explicit default or (if the reader union is ["null", T]
) synthesize default null
for the union‑first branch.
let mut nullable = self.nullability.is_some(); | ||
if !nullable { | ||
if let Codec::Union(children, _, _) = self.codec() { | ||
// If any encoded branch is `null`, mark field as nullable | ||
if children.iter().any(|c| matches!(c.codec(), Codec::Null)) { | ||
nullable = true; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes along with the change on line 772 are related to Bug 4: Child nullability fix
We had cases where a list/map child field didn’t get marked nullable when the Avro type was a ["null", T]
union. The fix entailed:
AvroDataType::field_with_name(..)
now inspects union children and flags the field as nullable if any branch is null.- The map path (
Codec::Map
) is updated to usefield_with_name("value")
rather than recomputing nullability by hand. - With nullability now centralized in
field_with_name
, list"item"
(and other child fields built fromAvroDataType
) inherit the correct nullability whenever null is part of the union. - This removes divergent logic and fixes the child‑nullability mismatch observed in tests.
|
||
/// Compare two Avro schemas for equality (identical schemas). | ||
/// Returns true if the schemas have the same parsing canonical form (i.e., logically identical). | ||
pub fn compare_schemas(writer: &Schema, reader: &Schema) -> Result<bool, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was dead code
/// variants is the null type, and use this to derive arrow's notion of nullability | ||
#[derive(Debug, Copy, Clone, PartialEq, Default)] | ||
pub enum Nullability { | ||
pub(crate) enum Nullability { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made all of these pub(crate)
to tighten the public API. There's really no benefit for now for them to be used external to the crate. The idea is to have AvroSchema
act as our public schema interface.
pub(crate) attributes: Attributes<'a>, | ||
} | ||
|
||
fn deserialize_default<'de, D>(deserializer: D) -> Result<Option<Value>, D::Error> | ||
where | ||
D: serde::Deserializer<'de>, | ||
{ | ||
Value::deserialize(deserializer).map(Some) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the change on line 245, #[serde(deserialize_with = "deserialize_default", default)]
, are fixes for Bug 5: "null" (string) vs. null (JSON) defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great candidate for a native method IMO, good solution
/// Alternative names (aliases) for this field (Avro spec: field-level aliases). | ||
/// Borrowed from input JSON where possible. | ||
#[serde(borrow, default)] | ||
pub(crate) aliases: Vec<&'a str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the explicit alias
field on Field
here was part of fixing Bug 1: Alias resolution + defaults
/// Optional documentation for this field | ||
#[serde(borrow, default)] | ||
pub doc: Option<&'a str>, | ||
pub(crate) doc: Option<Cow<'a, str>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of Cow
was to fix Bug 3) Avro Unicode string content
|
||
/// Runtime plan for decoding reader-side `["null", T]` types. | ||
#[derive(Clone, Copy, Debug)] | ||
enum NullablePlan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding NullablePlan
and the rest of the changes in this file are the fixes for Bug 2) Special‑case union resolution (writer not a union, reader is)
|
||
fn write_ocf_block(&mut self, batch: &RecordBatch, sync: &[u8; 16]) -> Result<(), ArrowError> { | ||
let mut buf = Vec::<u8>::with_capacity(1024); | ||
let mut buf = Vec::<u8>::with_capacity(self.capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the fix for Bug 6) Capacity knob in the Writer not used.
Absolutely! I just left some comments that tie the changes to the bug fixes. Hopefully this helps with the reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work squashing these bugs! Just a couple small suggestions.
} | ||
|
||
/// Returns an arrow [`Field`] with the given name | ||
pub fn field_with_name(&self, name: &str) -> Field { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pub(crate) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think you have a point here. I'll tighten the public API in codec.rs
further. We can always make loosen it later if needed.
|
||
fn ensure_names_match( | ||
data_type: &str, | ||
writer_name: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a little DRYer where the name, namespace, alias are abstracted. Not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, any issues if I follow-up on this? Trying to keep this PR more focused, it's already pretty large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all
pub(crate) attributes: Attributes<'a>, | ||
} | ||
|
||
fn deserialize_default<'de, D>(deserializer: D) -> Result<Option<Value>, D::Error> | ||
where | ||
D: serde::Deserializer<'de>, | ||
{ | ||
Value::deserialize(deserializer).map(Some) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great candidate for a native method IMO, good solution
|
||
## Comprehensive E2E Coverage File | ||
|
||
**Purpose:** A single OCF that exercises **all decoder paths** used by `arrow-avro` with both **nested and non‑nested** shapes, including **dense unions** (null‑first, null‑second, multi‑branch), **aliases** (type and field), **default values**, **docs** and **namespaces**, and combinations thereof. It’s intended to validate the final `Reader` implementation and to stress schema‑resolution behavior in the tests under `arrow-avro/src/reader/mod.rs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic 💪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838 and @nathaniel-d-ef
I skimmed this PR (I did not have time to study it in detail) but I found it well documented, well tested and easy to understand
Echoing the sentiments of @nathaniel-d-ef
Epic 💪
Thanks again for this great work 🚀
} | ||
|
||
#[test] | ||
fn comprehensive_e2e_resolution_test() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is quite a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100%. I'll probably come back around at some point and enhance the maintainability of these tests. I just wanted to throw in every edge case I could think of.
# Which issue does this PR close? - **Related to**: #4886 (“Add Avro Support”) **NOTE:** This PR is stacked on #8492 # Rationale for this change `arrow-avro` has seen significant development since `#![allow(unused)]` was temporarily added to `lib.rs`. Due to fast iteration on the code, this has led to unused methods and imports throughout the crate, which need to be cleaned up prior to `arrow-avro` becoming public. This PR simply removes `#![allow(unused)]` and cleans up the `arrow-avro` crate's code to comply. # What changes are included in this PR? Deleted the `#![allow(unused)]` in `lib.rs` and updated the crate's code as needed. This impacted almost every files of the crate, however the changes in this PR are 100% focused and isolated around only the work related to removing `#![allow(unused)]`. # Are these changes tested? The changes in this PR are covered by existing tests. No new functionality or behavior has been changed/added. This PR is simply clean up around removing `#![allow(unused)]` from `lib.rs`. # Are there any user-facing changes? N/A
# Which issue does this PR close? - Closes #8504 - Part of #4886 # Rationale for this change Due to the number of features `arrow-avro` has, a README.md file clearly explaining these features and when to use them would be useful for our users. # What changes are included in this PR? * New README.md file for arrow-avro * Minor change in `schema.rs` (made `Array` `pub(crate)`). This is a one liner that was missed in #8492 # Are these changes tested? These changes involve adding a README.md and tightening the public API in a missed spot. Nothing should break and all existing tests should pass. # Are there any user-facing changes? N/A since `arrow-avro` isn't public yet.
Which issue does this PR close?
Rationale for this change
NOTE: This PR contains over 2300 lines of test code. The actual production code diff is less than 800 LOC.
Before we publish
arrow-avro
, we want to "minimize its public API surface" and ship a well‑tested, spec‑compliant implementation. In the process of adding intensive regression tests and canonical‑form checks, we found several correctness gaps around alias handling, union resolution, Unicode/name validation, list child nullability, “null” string handling, and a mis-wiredWriter
capacity setting. This PR tightens the API and fixes those issues to align with the Avro spec (aliases and defaults, union resolution, names and Unicode, etc.).What changes are included in this PR?
Public API tightening
arrow-avro
so only intended entry points are public ahead of making the crate public.Bug fixes discovered via regression testing (All fixed)
type_id
; per spec, the reader must pick the first union branch that matches the writer’s schema.[A-Za-z_][A-Za-z0-9_]*
. Tests were added to accept valid Unicode string content while enforcing the ASCII identifier regex.ListArray
child item bugListArray
’s inner"item"
field. (By convention the inner field is named"item"
and nullability is explicit.) This aligns with Arrow’s builder/typing docs.null
null
from the string literal"null"
per the Avro defaults table.Writer
capacity knob wired uparrow-avro/src/writer/mod.rs
.)Are these changes tested?
Yes. This PR adds substantial regression coverage:
ListArray
inner field nullability behavior.Writer
with the capacity knob set.A new, comprehensive Avro fixture (
test/data/comprehensive_e2e.avro
) is included to drive end‑to‑end scenarios and edge cases,.Are there any user-facing changes?
N/A