Skip to content

chore: fix clippy::large_enum_variant for DataFusionError #15861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

rroelke
Copy link
Contributor

@rroelke rroelke commented Apr 25, 2025

Which issue does this PR close?

Rationale for this change

Fixes clippy::large_enum_variant which has been enabled by default on nightly. Not only does it flag a problem when compiling with features = ["avro"], it also propagates a lot of lint to downstream projects which use Result<T, DataFusionError> as a return type.

What changes are included in this PR?

The DataFusionError::AvroError(AvroError) variant is changed to DataFusionError::AvroError(Box<AvroError>).

Are these changes tested?

Regression testing will validate that error messages from Avro are preserved.

I have tested using the reproducer from #15860 to verify that this change removes the lint. To add a regression test to verify this would require changes to project configuration or CI. I will implement this if requested but didn't see the point in putting in up-front effort.

Are there any user-facing changes?

Users invoking the DataFusionError::AvroError constructor directly must update their code to either DataFusionError::AvroError(Box::new(my_avro_error)) or DataFusionError::from(my_avro_error).

@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Apr 25, 2025
@rroelke
Copy link
Contributor Author

rroelke commented Apr 25, 2025

From the guidelines this should also have the "api-change" label but I don't think I have permission to add it.

@@ -59,7 +59,7 @@ pub enum DataFusionError {
ParquetError(ParquetError),
/// Error when reading Avro data.
#[cfg(feature = "avro")]
AvroError(AvroError),
AvroError(Box<AvroError>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering why now the error needs to be boxed? 🤔

Copy link
Contributor Author

@rroelke rroelke Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the specific numbers but here's a verbose explanation of what the lint is trying to tell us about:

enum MyErrorType {
    Something,
    SomethingElse([u8; 4096])
}

We can see here that the size of MyErrorType is 4096 (ish)

fn try_thing() -> Result<usize, MyErrorType> {
    ...
}

Consequently the size of Result<T, MyErrorType> is always at least 4096. Result<usize, MyErrorType> has size 4096. To call try_thing we have to allocate memory to hold that size of result. We don't return the usize in a register. And most of the time we expect to see Ok rather than Err so this makes the common case a lot worse.

Boxing the large variants fixes the issue since it reduces the size of the variant to a pointer, which fits in a register.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we box other error type too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd favor that. There might be another error variant which is less than the clippy threshold size but is still large enough to prevent certain optimizations. If all the error variants were boxed then we could be confident that DataFusionError could be copied around in registers and then perhaps Result<T, DataFusionError> also could be for small T.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally suggest avoiding additional API churn (aka boxing all varaints) unless there is some particular problem we are trying to solve or improve performance in some way we can measure

@comphead comphead added the api change Changes the API exposed to users of the crate label Apr 26, 2025
@comphead
Copy link
Contributor

comphead commented Apr 26, 2025

Thanks @rroelke
Started the flow, added api-change.
I got the idea about the size, but now its on heap vs stack and we introduce memory roundtrip through indirection. I was just wondering if this Boxing intended to fix the issue or perfomance? if it is the latter it would be good to check planner bench like #15796 (comment)

Btw we do box here SchemaError(SchemaError, Box<Option<String>>), but this is caused by the error details was too large and sometimes lead to stackoverflows and we moved it to the heap.

@comphead
Copy link
Contributor

I just checked AvroError, its actually the wrapper of apache-avro internal and some variants can be pretty large. to avoid possible stackoverflow it make sense moving it to heap like clippy suggested 🤔 btw we dont have a good bench against avro.

Lets go with clippy suggestion here, although it is MaybeIncorrect applicability https://rust-lang.github.io/rust-clippy/master/index.html#large_enum_variant

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @rroelke for the first contribuion, since this is first contribution I need second pair of eyes, @jayzhan211 would you mind to help with review?

@jayzhan211
Copy link
Contributor

https://rust-lang.github.io/rust-clippy/master/index.html#large_enum_variant

Based on the description, I think we box is when

  1. the variant is rarely used
  2. it improves the performance

I think this error is not rarely used so I think it is better we have number shows this helps performance

@rroelke
Copy link
Contributor Author

rroelke commented Apr 27, 2025

From the lint description:

Enum size is bounded by the largest variant. Having one large variant can penalize the memory layout of that enum.

That is to say, the presence of the large variant AvroError affects the whole layout of DataFusionError.

Transitively, the presence of the large variant AvroError affects the whole layout of Result<T, DataFusionError>. This affects nearly every function in the DataFusion API.

This related lint pull request elaborates more specifically:

  • A large Err-variant may force an equally large Result if Err is actually bigger than Ok.
  • There is a cost involved in large Result, as LLVM may choose to memcpy them around above a certain size.
  • We usually expect the Err variant to be seldomly used, but pay the cost every time.
  • Result returned from library code has a high chance of bubbling up the call stack, getting stuffed into MyLibError { IoError(std::io::Error), ParseError(parselib::Error), ...}, exacerbating the problem.

As applied here:

  1. every API which returns Result<T, DataFusionError> might pay a large memcpy cost
  2. a return of Err(DataFusionError::AvroError(...)) will bubble up the call stack in nearly all cases, such that (2a) downstream libraries wrapping DataFusionError in their own error types will also suffer this problem, and (2b) the end user request in application code will terminate

I think this error is not rarely used

Indeed DataFusionError is used nearly everywhere which is precisely the point. Whereas the DataFusion::AvroError is only produced by the avro reader but it affects every place where DataFusionError can appear.

@jayzhan211
Copy link
Contributor

Whereas the DataFusion::AvroError is only produced by the avro reader but it affects every place where DataFusionError can appear.

How about we convert the error into the string and wrap with other datafusion error, so we can avoid this variant entirely

@rroelke
Copy link
Contributor Author

rroelke commented Apr 28, 2025

Whereas the DataFusion::AvroError is only produced by the avro reader but it affects every place where DataFusionError can appear.

How about we convert the error into the string and wrap with other datafusion error, so we can avoid this variant entirely

IMO we should not convert errors to strings at all, downstream library or application code should not be asked to possibly parse error messages. I would be happy to submit a follow-up PR to add a new variant wrapping Box<dyn Error + 'static> or change Other to have that data instead.

But I don't see a reason that this PR should pivot in that direction. Can you articulate your objection to the current diff more precisely please?

@@ -59,7 +59,7 @@ pub enum DataFusionError {
ParquetError(ParquetError),
/// Error when reading Avro data.
#[cfg(feature = "avro")]
AvroError(AvroError),
AvroError(Box<AvroError>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally suggest avoiding additional API churn (aka boxing all varaints) unless there is some particular problem we are trying to solve or improve performance in some way we can measure

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

Thank you @rroelke @comphead and @jayzhan211

@comphead comphead merged commit 2d27ce4 into apache:main Apr 29, 2025
27 checks passed
@comphead
Copy link
Contributor

Thanks everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore: Rust lint clippy::large_enum_variant flags all uses of Result<T, DataFusionError> with features = ["avro"]
4 participants