Skip to content

Conversation

@jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Oct 16, 2025

Which issue does this PR close?

This is a small performance improvement for the thrift remodeling

Rationale for this change

Some of the often-called methods in the thrift protocol implementation created ParquetError instances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining.

What changes are included in this PR?

Introduce a separate error type ThriftProtocolError that is smaller than ParquetError and does not contain any allocated data. The ReadThrift trait is not changed, since its custom implementations actually require the more expressive ParquetError.

Are these changes tested?

The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer.

Are there any user-facing changes?

The ThriftProtocolError is crate-internal so there should be no api changes. Some error messages might differ slightly.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 16, 2025
@jhorstmann jhorstmann force-pushed the avoid-string-formatting-for-thrift-protocol-errors branch from e05f127 to 29083b7 Compare October 16, 2025 18:54
@jhorstmann jhorstmann force-pushed the avoid-string-formatting-for-thrift-protocol-errors branch from 29083b7 to ffd4191 Compare October 16, 2025 18:57
@jhorstmann
Copy link
Contributor Author

@alamb @etseidl I might have found another small performance improvement in the new thrift code :)

Need to fix the clippy issues and format the error messages a bit nicer before this can be merged.

@etseidl
Copy link
Contributor

etseidl commented Oct 16, 2025

Thanks @jhorstmann! It's amazing how much time is spent on innocuous snippets of code. I've found that the skipping code spends a significant amount of time in the FieldType impl of ParitalEq, and have branch where I delay conversion to FieldType until all matching is complete.

@etseidl
Copy link
Contributor

etseidl commented Oct 16, 2025

group                             pq_err                                 thrift_err
-----                             ------                                 ----------
decode parquet metadata           1.14      5.6±0.06µs        ? ?/sec    1.00      4.9±0.03µs        ? ?/sec
decode parquet metadata (wide)    1.32     23.0±0.17ms        ? ?/sec    1.00     17.5±0.20ms        ? ?/sec
open(default)                     1.13      5.8±0.05µs        ? ?/sec    1.00      5.2±0.03µs        ? ?/sec
open(page index)                  1.02    104.0±1.02µs        ? ?/sec    1.00    102.5±0.47µs        ? ?/sec

This is a small performance improvement

🤣 🤣 🤣

I think Andrew may have to redo the images for the blog post again 😂

@jhorstmann jhorstmann marked this pull request as ready for review October 16, 2025 22:27
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhorstmann, this is great.

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

I think Andrew may have to redo the images for the blog post again 😂

That is ok

There is also a report that the C++ thrift generated code is faster than this parser -- https://lists.apache.org/thread/skr7f2tf94q59cx390cq2sw8f1nps675

I haven't been able to reproduce that result yet

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. It is a fascinating observation that constructing errors with strings (even when they aren't constructed) can slow our code down

I wonder how many other codepaths in arrow/parquet have the same property 🤔

@alamb alamb merged commit d49f017 into apache:main Oct 17, 2025
17 checks passed
@etseidl
Copy link
Contributor

etseidl commented Oct 17, 2025

There is also a report that the C++ thrift generated code is faster than this parser -- https://lists.apache.org/thread/skr7f2tf94q59cx390cq2sw8f1nps675

I haven't been able to reproduce that result yet

🤔

So I did a quick sanity check on my workstation.

  // 'buf' contains bytes for footer
  for (int i=0; i < 1000; i++) {
    std::shared_ptr<TMemoryBuffer> strBuf(new TMemoryBuffer(buf, ender.footer_len));
    TCompactProtocol proto{strBuf};
    parquet::format::FileMetaData fmd;
    fmd.read(&proto);
  }

vs

  // 'meta_data' contains bytes for footer
  for _ in 0..1000 {
    ParquetMetaDataReader::decode_metadata(&meta_data).unwrap();
  }

c++ time: 64.004u 8.960s 1:13.06 99.8% 0+0k 0+0io 0pf+0w
rust time: 26.714u 0.019s 0:26.77 99.8% 0+0k 0+0io 0pf+0w

This is with thrift-cpp 0.23

@jhorstmann
Copy link
Contributor Author

Love it. It is a fascinating observation that constructing errors with strings (even when they aren't constructed) can slow our code down

It is, and I also don't really know how much of that is because of removing the error message formatting code, vs the smaller size of the error type. The former might just have influenced some heuristics around inlining.

I could imagine that C++ is actually smarter about eliminating redundant moves of such error structs.

I wonder how many other codepaths in arrow/parquet have the same property 🤔

I think you'd need a really high ratio of error handling vs "real" compute for that to have a measurable effect. Thrift parsing unfortunately needs error handling for nearly every byte, that shouldn't the case for other places in the code base.

Hmm, maybe the checked arithmetic kernels would actually also benefit from a separate error type.

@alamb
Copy link
Contributor

alamb commented Oct 17, 2025

So I did a quick sanity check on my workstation.

Thank you. I will sleep better tonight.

I wasn't able to find any reproduction instructions for their results but when they are able to provide them I plan to give it a good profile to see what is going on

samueleresca pushed a commit to samueleresca/arrow-rs that referenced this pull request Oct 18, 2025
…trings for error messages (apache#8636)

# Which issue does this PR close?

This is a small performance improvement for the thrift remodeling

- Part of apache#5853.

# Rationale for this change

Some of the often-called methods in the thrift protocol implementation
created `ParquetError` instances with a string message that had to be
allocated and formatted. This formatting code and probably also some
drop glue bloats these otherwise small methods and prevented inlining.

# What changes are included in this PR?

Introduce a separate error type `ThriftProtocolError` that is smaller
than `ParquetError` and does not contain any allocated data. The
`ReadThrift` trait is not changed, since its custom implementations
actually require the more expressive `ParquetError`.

# Are these changes tested?

The success path is covered by existing tests. Testing the error paths
would require crafting some actually malformed files, or using a fuzzer.

# Are there any user-facing changes?

The `ThriftProtocolError` is crate-internal so there should be no api
changes. Some error messages might differ slightly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants