-
Couldn't load subscription status.
- Fork 1k
Introduce a ThriftProtocolError to avoid allocating and formattings strings for error messages #8636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a ThriftProtocolError to avoid allocating and formattings strings for error messages #8636
Conversation
e05f127 to
29083b7
Compare
…trings for error messages in hot methods
29083b7 to
ffd4191
Compare
|
Thanks @jhorstmann! It's amazing how much time is spent on innocuous snippets of code. I've found that the skipping code spends a significant amount of time in the |
🤣 🤣 🤣 I think Andrew may have to redo the images for the blog post again 😂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jhorstmann, this is great.
That is ok There is also a report that the C++ thrift generated code is faster than this parser -- https://lists.apache.org/thread/skr7f2tf94q59cx390cq2sw8f1nps675 I haven't been able to reproduce that result yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. It is a fascinating observation that constructing errors with strings (even when they aren't constructed) can slow our code down
I wonder how many other codepaths in arrow/parquet have the same property 🤔
🤔 So I did a quick sanity check on my workstation. // 'buf' contains bytes for footer
for (int i=0; i < 1000; i++) {
std::shared_ptr<TMemoryBuffer> strBuf(new TMemoryBuffer(buf, ender.footer_len));
TCompactProtocol proto{strBuf};
parquet::format::FileMetaData fmd;
fmd.read(&proto);
}vs // 'meta_data' contains bytes for footer
for _ in 0..1000 {
ParquetMetaDataReader::decode_metadata(&meta_data).unwrap();
}c++ time: 64.004u 8.960s 1:13.06 99.8% 0+0k 0+0io 0pf+0w This is with thrift-cpp 0.23 |
It is, and I also don't really know how much of that is because of removing the error message formatting code, vs the smaller size of the error type. The former might just have influenced some heuristics around inlining. I could imagine that C++ is actually smarter about eliminating redundant moves of such error structs.
I think you'd need a really high ratio of error handling vs "real" compute for that to have a measurable effect. Thrift parsing unfortunately needs error handling for nearly every byte, that shouldn't the case for other places in the code base. Hmm, maybe the checked arithmetic kernels would actually also benefit from a separate error type. |
Thank you. I will sleep better tonight. I wasn't able to find any reproduction instructions for their results but when they are able to provide them I plan to give it a good profile to see what is going on |
…trings for error messages (apache#8636) # Which issue does this PR close? This is a small performance improvement for the thrift remodeling - Part of apache#5853. # Rationale for this change Some of the often-called methods in the thrift protocol implementation created `ParquetError` instances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining. # What changes are included in this PR? Introduce a separate error type `ThriftProtocolError` that is smaller than `ParquetError` and does not contain any allocated data. The `ReadThrift` trait is not changed, since its custom implementations actually require the more expressive `ParquetError`. # Are these changes tested? The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer. # Are there any user-facing changes? The `ThriftProtocolError` is crate-internal so there should be no api changes. Some error messages might differ slightly.
Which issue does this PR close?
This is a small performance improvement for the thrift remodeling
Rationale for this change
Some of the often-called methods in the thrift protocol implementation created
ParquetErrorinstances with a string message that had to be allocated and formatted. This formatting code and probably also some drop glue bloats these otherwise small methods and prevented inlining.What changes are included in this PR?
Introduce a separate error type
ThriftProtocolErrorthat is smaller thanParquetErrorand does not contain any allocated data. TheReadThrifttrait is not changed, since its custom implementations actually require the more expressiveParquetError.Are these changes tested?
The success path is covered by existing tests. Testing the error paths would require crafting some actually malformed files, or using a fuzzer.
Are there any user-facing changes?
The
ThriftProtocolErroris crate-internal so there should be no api changes. Some error messages might differ slightly.