-
Notifications
You must be signed in to change notification settings - Fork 1.7k
simdutf_connector: in_tail: skip UTF-16/UTF-8 BOM #10328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Could you please take a look at this @cosmo0920? There are a few coding style issues but I'm more interested validating in the actual UTF stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the patch in my dev box and I got succeeded to convert from UTF-16-LE and UTF-16-BE to UTF-8. And I did use std::unique_ptr<char[]>
to proceed allocate/deallocate automatically in the simdutf module. Changed from it to flb_malloc style was not considered TBH. This could be fine and there's nothing memory leaks.
Ah, jemalloc headers are not found in CI tasks. Need to investigate. |
It's fine to me for this change. Could you proceed to point out minor issues on your side? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing, could you add the following lines into end of the file here?
if(FLB_JEMALLOC)
target_link_libraries(flb-simdutf-connector-static ${JEMALLOC_LIBRARIES})
endif()
It seems jemalloc related error could be caused by missing dependency of jemalloc. So, we have to mark jemalloc as one of the dependencies of simdutf-connector.
Done. Thanks for swift feedback and help with deciphering the build errors. |
I identified the weird compilation errors on Windows here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some change requests, please check all of the code for those issues I've pointed out, I tried not to add one note per incidence but noticed multiple occurrences of some of them (such as the missing exception handling).
plugins/in_tail/tail_file.c
Outdated
@@ -471,6 +471,12 @@ static int process_content(struct flb_tail_file *file, size_t *bytes) | |||
} | |||
else if (ret == FLB_UNICODE_CONVERT_NOP) { | |||
flb_plg_debug(ctx->ins, "nothing to convert encoding '%.*s'", end - data, data); | |||
/* Skip the UTF-8 BOM */ | |||
if ((end - data) >= 3 && (data[0] & 0xFF) == 0xEF && (data[1] & 0xFF) == 0xBB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this branch of the conditional the buffer has not changed and thus file->buf_len
is still valid which means the conditional should be written as :
if (file->buf_len >= 3 &&
(data[0] & 0xFF) == 0xEF &&
(data[1] & 0xFF) == 0xBB &&
(data[2] & 0xFF) == 0xBF) {
Additionally, is there any reason for us not to define data
as unsigned char *
so we can simplify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems accurate, I just took the expression from the line above. Changing it where, data
originally comes from the flb_tail_file
struct. In this case I think an easer solution would be to use char constants instead, e.g. '\xFF'
which makes the integer promotion behave the same way for both the lhs and rhs of the comparison.
result = simdutf::validate_utf8_with_errors(output.get(), clen); | ||
if (result.error == simdutf::error_code::SUCCESS && converted > 0) { | ||
std::string result_string(output.get(), clen); | ||
*utf8_output = (char*)flb_malloc(clen + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the coding style guide for this and other issues.
In this case the missing spaces in the data type type and after the closing parenthesis of the cast.
aligned_input = (const char16_t *)input; | ||
} | ||
else { | ||
str16.resize(len / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the C++ reference we are missing some exception handling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check, it might be be a similar amount of work (and more consistent with the rest of the C codebase) to just use flb_malloc
here as well, since the simdutf
functions used are noexcept
-labelled.
- Do not copy input if data is already aligned. - Only allocate output once. Signed-off-by: Erik Cederberg <[email protected]>
When converting UTF-16 to UTF-8, ingore the BOM so that no UTF-8 BOM is written to the output. Signed-off-by: Erik Cederberg <[email protected]>
If unicode input data is not converted, check if there is a UTF-8 BOM present and skip it. Signed-off-by: Erik Cederberg <[email protected]>
Signed-off-by: Erik Cederberg <[email protected]>
This MR updates
simdutf_connector
to reduce the number of copies when converting UTF-16 to UTF-8 and to remove the UTF-16 BOM prior to conversion so that no UTF-8 BOM is present in the converted output.tail_file
is also updated to skip any encountered UTF-8 BOM if the unicode conversion returnsFLB_UNICODE_CONVERT_NOP
.Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-test
label to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.