Skip to content

simdutf_connector: in_tail: skip UTF-16/UTF-8 BOM #10328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

erikced
Copy link

@erikced erikced commented May 13, 2025

This MR updates simdutf_connector to reduce the number of copies when converting UTF-16 to UTF-8 and to remove the UTF-16 BOM prior to conversion so that no UTF-8 BOM is present in the converted output. tail_file is also updated to skip any encountered UTF-8 BOM if the unicode conversion returns FLB_UNICODE_CONVERT_NOP.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • Debug log output from testing the change
Fluent Bit v4.0.2 | NIGHTLY_BUILD=0 - DO NOT USE IN PRODUCTION!
* Copyright (C) 2015-2025 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/05/13 09:35:34] [ info] Configuration:
[2025/05/13 09:35:34] [ info]  flush time     | 10.000000 seconds
[2025/05/13 09:35:34] [ info]  grace          | 5 seconds
[2025/05/13 09:35:34] [ info]  daemon         | 0
[2025/05/13 09:35:34] [ info] ___________
[2025/05/13 09:35:34] [ info]  inputs:
[2025/05/13 09:35:34] [ info]      tail
[2025/05/13 09:35:34] [ info]      tail
[2025/05/13 09:35:34] [ info] ___________
[2025/05/13 09:35:34] [ info]  filters:
[2025/05/13 09:35:34] [ info] ___________
[2025/05/13 09:35:34] [ info]  outputs:
[2025/05/13 09:35:34] [ info]      stdout.0
[2025/05/13 09:35:34] [ info] ___________
[2025/05/13 09:35:34] [ info]  collectors:
[2025/05/13 09:35:34] [ info] [fluent bit] version=4.0.2, commit=3c8f9f27e3, pid=1
[2025/05/13 09:35:34] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2025/05/13 09:35:34] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/05/13 09:35:34] [ info] [simd    ] disabled
[2025/05/13 09:35:34] [ info] [cmetrics] version=1.0.2
[2025/05/13 09:35:34] [ info] [ctraces ] version=0.6.6
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] initializing
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] storage_strategy='memory' (memory only)
[2025/05/13 09:35:34] [debug] [tail:mssql-tail-input] created event channels: read=25 write=26
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] adjusted buf_max_size to 128001
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] adjusted buf_chunk_size to 32769
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inotify watch fd=31
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] scanning path /tmp/ERRORLOG-LE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] file will be read in POSIX_FADV_DONTNEED mode /tmp/ERRORLOG-LE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=3266809 with offset=0 appended as /tmp/ERRORLOG-LE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] scan_glob add(): /tmp/ERRORLOG-LE, inode 3266809
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] 1 new files found on path '/tmp/ERRORLOG-LE'
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] initializing
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] storage_strategy='memory' (memory only)
[2025/05/13 09:35:34] [debug] [tail:mssql-tail-input] created event channels: read=33 write=34
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] adjusted buf_max_size to 128001
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] adjusted buf_chunk_size to 32769
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inotify watch fd=39
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] scanning path /tmp/ERRORLOG-BE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] file will be read in POSIX_FADV_DONTNEED mode /tmp/ERRORLOG-BE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=4922164 with offset=0 appended as /tmp/ERRORLOG-BE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] scan_glob add(): /tmp/ERRORLOG-BE, inode 4922164
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] 1 new files found on path '/tmp/ERRORLOG-BE'
[2025/05/13 09:35:34] [debug] [stdout:stdout.0] created event channels: read=41 write=42
[2025/05/13 09:35:34] [debug] [router] match rule tail.0:stdout.0
[2025/05/13 09:35:34] [debug] [router] match rule tail.1:stdout.0
[2025/05/13 09:35:34] [ info] [sp] stream processor started
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] [static files] processed 1.4K
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] [static files] processed 1.4K
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inode=3266809 file=/tmp/ERRORLOG-LE ended, stop
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=3266809 file=/tmp/ERRORLOG-LE promote to TAIL_EVENT
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inotify_fs_add(): inode=3266809 watch_fd=1 name=/tmp/ERRORLOG-LE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] [static files] processed 0b, done
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inode=4922164 file=/tmp/ERRORLOG-BE ended, stop
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=4922164 file=/tmp/ERRORLOG-BE promote to TAIL_EVENT
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inotify_fs_add(): inode=4922164 watch_fd=1 name=/tmp/ERRORLOG-BE
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] [static files] processed 0b, done
[2025/05/13 09:35:34] [debug] [task] created task=0x7f480e636e60 id=0 OK
[2025/05/13 09:35:34] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[2025/05/13 09:35:34] [debug] [task] created task=0x7f480e636f00 id=1 OK
[2025/05/13 09:35:34] [debug] [output:stdout:stdout.0] task_id=1 assigned to thread #0
[2025/05/13 09:35:34] [ warn] [engine] service will shutdown in max 5 seconds
[2025/05/13 09:35:34] [debug] [engine] retry=0x5 for task 0 already scheduled to run, not re-scheduling it.
[2025/05/13 09:35:34] [debug] [engine] retry=0x5 for task 1 already scheduled to run, not re-scheduling it.
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input
[2025/05/13 09:35:34] [ warn] [engine] service will shutdown in max 5 seconds
[2025/05/13 09:35:34] [debug] [engine] retry=0x5 for task 0 already scheduled to run, not re-scheduling it.
[2025/05/13 09:35:34] [debug] [engine] retry=0x5 for task 1 already scheduled to run, not re-scheduling it.
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input:qjk
[0] mssql.errorlog.le: [[1747128934.238965455, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.73 Server      Microsoft SQL Server 2022 (RTM-CU9) (KB5030731) - 16.0.4085.2 (X64) "}]
[1] mssql.errorlog.le: [[1747128934.238969253, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"	Sep 27 2023 12:05:43 "}]
[2] mssql.errorlog.le: [[1747128934.238970461, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"	Copyright (C) 2022 Microsoft Corporation"}]
[3] mssql.errorlog.le: [[1747128934.238971867, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"	Standard Edition (64-bit) on Windows Server 2022 Datacenter 10.0 <X64> (Build 20348: ) (Hypervisor)"}]
[4] mssql.errorlog.le: [[1747128934.238972427, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>""}]
[5] mssql.errorlog.le: [[1747128934.238974464, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.73 Server      UTC adjustment: 0:00"}]
[6] mssql.errorlog.le: [[1747128934.238976188, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.73 Server      (c) Microsoft Corporation."}]
[7] mssql.errorlog.le: [[1747128934.238977793, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.73 Server      All rights reserved."}]
[8] mssql.errorlog.le: [[1747128934.238979542, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.73 Server      Server process ID is 2948."}]
[9] mssql.errorlog.le: [[1747128934.238982076, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.74 Server      System Manufacturer: 'Microsoft Corporation', System Model: 'Virtual Machine'."}]
[10] mssql.errorlog.le: [[1747128934.238983865, {}], {"filename"=>"/tmp/ERRORLOG-LE", "log"=>"2025-03-29 03:24:27.74 Server      Authentication mode is MIXED."}]
[0] mssql.errorlog.be: [[1747128934.239409586, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.73 Server      Microsoft SQL Server 2022 (RTM-CU9) (KB5030731) - 16.0.4085.2 (X64) "}]
[1] mssql.errorlog.be: [[1747128934.239412160, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"	Sep 27 2023 12:05:43 "}]
[2] mssql.errorlog.be: [[1747128934.239413323, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"	Copyright (C) 2022 Microsoft Corporation"}]
[3] mssql.errorlog.be: [[1747128934.239414661, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"	Standard Edition (64-bit) on Windows Server 2022 Datacenter 10.0 <X64> (Build 20348: ) (Hypervisor)"}]
[4] mssql.errorlog.be: [[1747128934.239415262, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>""}]
[5] mssql.errorlog.be: [[1747128934.239417324, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.73 Server      UTC adjustment: 0:00"}]
[6] mssql.errorlog.be: [[1747128934.239419050, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.73 Server      (c) Microsoft Corporation."}]
[7] mssql.errorlog.be: [[1747128934.239420616, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.73 Server      All rights reserved."}]
[8] mssql.errorlog.be: [[1747128934.239422207, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.73 Server      Server process ID is 2948."}]
[9] mssql.errorlog.be: [[1747128934.239424601, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.74 Server      System Manufacturer: 'Microsoft Corporation', System Model: 'Virtual Machine'."}]
[10] mssql.errorlog.be: [[1747128934.239426338, {}], {"filename"=>"/tmp/ERRORLOG-BE", "log"=>"2025-03-29 03:24:27.74 Server      Authentication mode is MIXED."}]
[2025/05/13 09:35:34] [ info] [output:stdout:stdout.0] worker #0 started
[2025/05/13 09:35:34] [debug] [out flush] cb_destroy coro_id=0
[2025/05/13 09:35:34] [debug] [out flush] cb_destroy coro_id=1
[2025/05/13 09:35:34] [debug] [task] destroy task=0x7f480e636e60 (task_id=0)
[2025/05/13 09:35:34] [debug] [task] destroy task=0x7f480e636f00 (task_id=1)
[2025/05/13 09:35:34] [ info] [engine] service has stopped (0 pending tasks)
[2025/05/13 09:35:34] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input
[2025/05/13 09:35:34] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2025/05/13 09:35:34] [ info] [input] pausing mssql-tail-input
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=4922164 removing file name /tmp/ERRORLOG-BE
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inotify_fs_remove(): inode=4922164 watch_fd=1
[2025/05/13 09:35:34] [debug] [input:tail:mssql-tail-input] inode=3266809 removing file name /tmp/ERRORLOG-LE
[2025/05/13 09:35:34] [ info] [input:tail:mssql-tail-input] inotify_fs_remove(): inode=3266809 watch_fd=1
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1==
==1== HEAP SUMMARY:
==1==     in use at exit: 0 bytes in 0 blocks
==1==   total heap usage: 4,618 allocs, 4,618 frees, 1,749,754 bytes allocated
==1==
==1== All heap blocks were freed -- no leaks are possible
==1==
==1== For lists of detected and suppressed errors, rerun with: -s
==1== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@leonardo-albertovich
Copy link
Collaborator

Could you please take a look at this @cosmo0920? There are a few coding style issues but I'm more interested validating in the actual UTF stuff.

Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the patch in my dev box and I got succeeded to convert from UTF-16-LE and UTF-16-BE to UTF-8. And I did use std::unique_ptr<char[]> to proceed allocate/deallocate automatically in the simdutf module. Changed from it to flb_malloc style was not considered TBH. This could be fine and there's nothing memory leaks.

@cosmo0920
Copy link
Contributor

Ah, jemalloc headers are not found in CI tasks. Need to investigate.

@cosmo0920
Copy link
Contributor

Could you please take a look at this @cosmo0920? There are a few coding style issues but I'm more interested validating in the actual UTF stuff.

It's fine to me for this change. Could you proceed to point out minor issues on your side?

Copy link
Contributor

@cosmo0920 cosmo0920 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing, could you add the following lines into end of the file here?

if(FLB_JEMALLOC)
  target_link_libraries(flb-simdutf-connector-static ${JEMALLOC_LIBRARIES})
endif()

It seems jemalloc related error could be caused by missing dependency of jemalloc. So, we have to mark jemalloc as one of the dependencies of simdutf-connector.

@erikced
Copy link
Author

erikced commented May 13, 2025

One thing, could you add the following lines into end of the file here?

if(FLB_JEMALLOC)
  target_link_libraries(flb-simdutf-connector-static ${JEMALLOC_LIBRARIES})
endif()

It seems jemalloc related error could be caused by missing dependency of jemalloc. So, we have to mark jemalloc as one of the dependencies of simdutf-connector.

Done. Thanks for swift feedback and help with deciphering the build errors.

@cosmo0920
Copy link
Contributor

cosmo0920 commented May 14, 2025

I identified the weird compilation errors on Windows here:
monkey/monkey#423
This could be caused by old implementation but it was correct at that time. So, we need to fix them first in monkey repo.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some change requests, please check all of the code for those issues I've pointed out, I tried not to add one note per incidence but noticed multiple occurrences of some of them (such as the missing exception handling).

@@ -471,6 +471,12 @@ static int process_content(struct flb_tail_file *file, size_t *bytes)
}
else if (ret == FLB_UNICODE_CONVERT_NOP) {
flb_plg_debug(ctx->ins, "nothing to convert encoding '%.*s'", end - data, data);
/* Skip the UTF-8 BOM */
if ((end - data) >= 3 && (data[0] & 0xFF) == 0xEF && (data[1] & 0xFF) == 0xBB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this branch of the conditional the buffer has not changed and thus file->buf_len is still valid which means the conditional should be written as :

            if (file->buf_len >= 3 && 
                (data[0] & 0xFF) == 0xEF && 
                (data[1] & 0xFF) == 0xBB && 
                (data[2] & 0xFF) == 0xBF) {

Additionally, is there any reason for us not to define data as unsigned char * so we can simplify this?

Copy link
Author

@erikced erikced May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems accurate, I just took the expression from the line above. Changing it where, data originally comes from the flb_tail_file struct. In this case I think an easer solution would be to use char constants instead, e.g. '\xFF' which makes the integer promotion behave the same way for both the lhs and rhs of the comparison.

result = simdutf::validate_utf8_with_errors(output.get(), clen);
if (result.error == simdutf::error_code::SUCCESS && converted > 0) {
std::string result_string(output.get(), clen);
*utf8_output = (char*)flb_malloc(clen + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the coding style guide for this and other issues.

In this case the missing spaces in the data type type and after the closing parenthesis of the cast.

aligned_input = (const char16_t *)input;
}
else {
str16.resize(len / 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the C++ reference we are missing some exception handling here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check, it might be be a similar amount of work (and more consistent with the rest of the C codebase) to just use flb_malloc here as well, since the simdutf functions used are noexcept-labelled.

erikced added 4 commits May 15, 2025 12:57
- Do not copy input if data is already aligned.
- Only allocate output once.

Signed-off-by: Erik Cederberg <[email protected]>
When converting UTF-16 to UTF-8, ingore the BOM so that no UTF-8 BOM is
written to the output.

Signed-off-by: Erik Cederberg <[email protected]>
If unicode input data is not converted, check if there is a UTF-8 BOM
present and skip it.

Signed-off-by: Erik Cederberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants