Skip to content

Conversation

jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

NOTE: This PR contains over 2300 lines of test code. The actual production code diff is less than 800 LOC.

Before we publish arrow-avro, we want to "minimize its public API surface" and ship a well‑tested, spec‑compliant implementation. In the process of adding intensive regression tests and canonical‑form checks, we found several correctness gaps around alias handling, union resolution, Unicode/name validation, list child nullability, “null” string handling, and a mis-wired Writer capacity setting. This PR tightens the API and fixes those issues to align with the Avro spec (aliases and defaults, union resolution, names and Unicode, etc.).

What changes are included in this PR?

Public API tightening

  • Restrict visibility of numerous schema/codec types and functions within arrow-avro so only intended entry points are public ahead of making the crate public.

Bug fixes discovered via regression testing (All fixed)

  1. Alias bugs (aliases without defaults / restrictive namespaces)
    • Enforce spec‑compliant alias resolution: aliases may be fully‑qualified or relative to the reader’s namespace, and alias‑based rewrites still require reader defaults when the writer field is absent. This follows Avro’s alias rules and record‑field default behavior.
  2. Special‑case union resolution (writer not a union, reader is)
    • When the writer schema is not a union but the reader is, we no longer attempt to decode a union type_id; per spec, the reader must pick the first union branch that matches the writer’s schema.
  3. Valid Avro Unicode characters & name rules in Schema
    • Distinguish between Unicode strings (which may contain any valid UTF‑8) and identifiers (names/enum symbols) which must match [A-Za-z_][A-Za-z0-9_]*. Tests were added to accept valid Unicode string content while enforcing the ASCII identifier regex.
  4. Nullable ListArray child item bug
    • Correct mapping of Avro array item nullability to Arrow ListArray’s inner "item" field. (By convention the inner field is named "item" and nullability is explicit.) This aligns with Arrow’s builder/typing docs.
  5. “null” string vs. hard null
    • Fix default/value handling to differentiate JSON null from the string literal "null" per the Avro defaults table.
  6. Writer capacity knob wired up
    • Plumb the provided capacity through the writer implementation so preallocation/knobbed capacity is respected. (See changes under arrow-avro/src/writer/mod.rs.)

Are these changes tested?

Yes. This PR adds substantial regression coverage:

  • Canonical‑form checks for schemas.
  • Alias/namespace + default‑value resolution cases.
  • Reader‑union vs. writer‑non‑union decoding paths.
  • Unicode content vs. identifier name rules.
  • ListArray inner field nullability behavior.
  • Round‑trips exercising the Writer with the capacity knob set.

A new, comprehensive Avro fixture (test/data/comprehensive_e2e.avro) is included to drive end‑to‑end scenarios and edge cases,.

Are there any user-facing changes?

N/A

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 29, 2025
@jecsand838 jecsand838 force-pushed the code-cleanup-bug-fixes branch 2 times, most recently from 56b7b23 to e1e90e0 Compare September 29, 2025 09:01
…nctions within `arrow-avro`; bug fixes; centralize nullability handling; enforce spec-compliant alias and default value behavior; and improve tests with canonical form validation.
@alamb
Copy link
Contributor

alamb commented Sep 29, 2025

THanks @jecsand838 -- I'll try and get through this over the next few days. @nathaniel-d-ef any chance you can help review this PR too?

@jecsand838 jecsand838 force-pushed the code-cleanup-bug-fixes branch from fbb0cbe to efe3cba Compare September 29, 2025 19:52
.map(|(idx, wf)| (wf.name, idx))
.collect();
// Build writer lookup and ambiguous alias set.
let (writer_lookup, ambiguous_writer_aliases) = Self::build_writer_lookup(writer_record);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Related to bug 1) Alias resolution + defaults: These changes makes alias matching spec‑compliant and handles defaults correctly when a reader field isn’t present in the writer schema.

  • full_name_set(..) builds canonical full names for a type (name + namespace) and its aliases, and names_match(..) now compares these sets rather than raw strings — this allows both fully‑qualified and namespace‑relative aliases to match per the Avro spec.

  • ensure_names_match(..) is updated to pass both writer and reader namespaces/aliases to names_match(..).

  • In Maker::resolve_records(..) we now pre-build a writer lookup that includes aliases, detect ambiguous writer aliases, and, in strict_mode, error out if a reader alias would map to multiple writer fields.

When a reader field has no writer match, we now require an explicit default or (if the reader union is ["null", T]) synthesize default null for the union‑first branch.

Comment on lines +212 to +220
let mut nullable = self.nullability.is_some();
if !nullable {
if let Codec::Union(children, _, _) = self.codec() {
// If any encoded branch is `null`, mark field as nullable
if children.iter().any(|c| matches!(c.codec(), Codec::Null)) {
nullable = true;
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes along with the change on line 772 are related to Bug 4: Child nullability fix

We had cases where a list/map child field didn’t get marked nullable when the Avro type was a ["null", T] union. The fix entailed:

  • AvroDataType::field_with_name(..) now inspects union children and flags the field as nullable if any branch is null.
  • The map path (Codec::Map) is updated to use field_with_name("value") rather than recomputing nullability by hand.
  • With nullability now centralized in field_with_name, list "item" (and other child fields built from AvroDataType) inherit the correct nullability whenever null is part of the union.
  • This removes divergent logic and fixes the child‑nullability mismatch observed in tests.


/// Compare two Avro schemas for equality (identical schemas).
/// Returns true if the schemas have the same parsing canonical form (i.e., logically identical).
pub fn compare_schemas(writer: &Schema, reader: &Schema) -> Result<bool, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was dead code

/// variants is the null type, and use this to derive arrow's notion of nullability
#[derive(Debug, Copy, Clone, PartialEq, Default)]
pub enum Nullability {
pub(crate) enum Nullability {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all of these pub(crate) to tighten the public API. There's really no benefit for now for them to be used external to the crate. The idea is to have AvroSchema act as our public schema interface.

Comment on lines +222 to +229
pub(crate) attributes: Attributes<'a>,
}

fn deserialize_default<'de, D>(deserializer: D) -> Result<Option<Value>, D::Error>
where
D: serde::Deserializer<'de>,
{
Value::deserialize(deserializer).map(Some)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the change on line 245, #[serde(deserialize_with = "deserialize_default", default)], are fixes for Bug 5: "null" (string) vs. null (JSON) defaults

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great candidate for a native method IMO, good solution

/// Alternative names (aliases) for this field (Avro spec: field-level aliases).
/// Borrowed from input JSON where possible.
#[serde(borrow, default)]
pub(crate) aliases: Vec<&'a str>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the explicit alias field on Field here was part of fixing Bug 1: Alias resolution + defaults

/// Optional documentation for this field
#[serde(borrow, default)]
pub doc: Option<&'a str>,
pub(crate) doc: Option<Cow<'a, str>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of Cow was to fix Bug 3) Avro Unicode string content


/// Runtime plan for decoding reader-side `["null", T]` types.
#[derive(Clone, Copy, Debug)]
enum NullablePlan {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding NullablePlan and the rest of the changes in this file are the fixes for Bug 2) Special‑case union resolution (writer not a union, reader is)


fn write_ocf_block(&mut self, batch: &RecordBatch, sync: &[u8; 16]) -> Result<(), ArrowError> {
let mut buf = Vec::<u8>::with_capacity(1024);
let mut buf = Vec::<u8>::with_capacity(self.capacity);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the fix for Bug 6) Capacity knob in the Writer not used.

@jecsand838
Copy link
Contributor Author

THanks @jecsand838 -- I'll try and get through this over the next few days. @nathaniel-d-ef any chance you can help review this PR too?

Absolutely! I just left some comments that tie the changes to the bug fixes. Hopefully this helps with the reviewing.

Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work squashing these bugs! Just a couple small suggestions.

}

/// Returns an arrow [`Field`] with the given name
pub fn field_with_name(&self, name: &str) -> Field {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pub(crate) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you have a point here. I'll tighten the public API in codec.rs further. We can always make loosen it later if needed.


fn ensure_names_match(
data_type: &str,
writer_name: &str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a little DRYer where the name, namespace, alias are abstracted. Not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, any issues if I follow-up on this? Trying to keep this PR more focused, it's already pretty large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all

Comment on lines +222 to +229
pub(crate) attributes: Attributes<'a>,
}

fn deserialize_default<'de, D>(deserializer: D) -> Result<Option<Value>, D::Error>
where
D: serde::Deserializer<'de>,
{
Value::deserialize(deserializer).map(Some)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great candidate for a native method IMO, good solution


## Comprehensive E2E Coverage File

**Purpose:** A single OCF that exercises **all decoder paths** used by `arrow-avro` with both **nested and non‑nested** shapes, including **dense unions** (null‑first, null‑second, multi‑branch), **aliases** (type and field), **default values**, **docs** and **namespaces**, and combinations thereof. It’s intended to validate the final `Reader` implementation and to stress schema‑resolution behavior in the tests under `arrow-avro/src/reader/mod.rs`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic 💪

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 and @nathaniel-d-ef

I skimmed this PR (I did not have time to study it in detail) but I found it well documented, well tested and easy to understand

Echoing the sentiments of @nathaniel-d-ef

Epic 💪

Thanks again for this great work 🚀

}

#[test]
fn comprehensive_e2e_resolution_test() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%. I'll probably come back around at some point and enhance the maintainability of these tests. I just wanted to throw in every edge case I could think of.

@alamb alamb merged commit 28aaee8 into apache:main Oct 4, 2025
23 checks passed
alamb pushed a commit that referenced this pull request Oct 6, 2025
# Which issue does this PR close?

- **Related to**: #4886 (“Add Avro Support”)

**NOTE:** This PR is stacked on
#8492

# Rationale for this change

`arrow-avro` has seen significant development since `#![allow(unused)]`
was temporarily added to `lib.rs`. Due to fast iteration on the code,
this has led to unused methods and imports throughout the crate, which
need to be cleaned up prior to `arrow-avro` becoming public.

This PR simply removes `#![allow(unused)]` and cleans up the
`arrow-avro` crate's code to comply.

# What changes are included in this PR?

Deleted the `#![allow(unused)]` in `lib.rs` and updated the crate's code
as needed. This impacted almost every files of the crate, however the
changes in this PR are 100% focused and isolated around only the work
related to removing `#![allow(unused)]`.

# Are these changes tested?

The changes in this PR are covered by existing tests. No new
functionality or behavior has been changed/added. This PR is simply
clean up around removing `#![allow(unused)]` from `lib.rs`.

# Are there any user-facing changes?

N/A
mbrobbel pushed a commit that referenced this pull request Oct 6, 2025
# Which issue does this PR close?

- Closes #8504 
- Part of #4886 

# Rationale for this change

Due to the number of features `arrow-avro` has, a README.md file clearly
explaining these features and when to use them would be useful for our
users.

# What changes are included in this PR?

* New README.md file for arrow-avro
* Minor change in `schema.rs` (made `Array` `pub(crate)`). This is a one
liner that was missed in #8492

# Are these changes tested?

These changes involve adding a README.md and tightening the public API
in a missed spot. Nothing should break and all existing tests should
pass.

# Are there any user-facing changes?

N/A since `arrow-avro` isn't public yet.
@jecsand838 jecsand838 deleted the code-cleanup-bug-fixes branch October 8, 2025 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants