Skip to content

Conversation

cosmo0920
Copy link
Contributor

@cosmo0920 cosmo0920 commented Oct 7, 2024

In Windows, there are lots of using UTF-16LE programs. This is because Unicode on Windows means UTF-16LE with BOM(Byte Order Mark).
In addition, there is lots of differences between UTF-16LE/UTF-16BE and UTF-8.
I added some of C, J and subdivision flags test cases for converting from UTF-16LE/UTF-16BE to UTF-8 in unit tests for in_tail plugin. This is because in_tail is the main usages to process non-UTF-8 encodings.
At first, we need to process UTF-16LE and UTF-16BE encodings.

Note that simdutf library is written in C++. So, we also provide an option (FLB_UNICODE_ENCODER) to turn on/off this feature.

Closes #9321


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
[SERVICE]
   flush           1
   log_level       trace

[INPUT]
   Name              tail
   Path              <path/to/non-UTF-8_encoded_file.log>
   Read_from_Head    True
   Unicode.Encoding  auto

[OUTPUT]
   Name  stdout
   Match *
  • Debug log output from testing the change
Fluent Bit v4.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

______ _                  _    ______ _ _             ___  _____ 
|  ___| |                | |   | ___ (_) |           /   ||  _  |
| |_  | |_   _  ___ _ __ | |_  | |_/ /_| |_  __   __/ /| || |/' |
|  _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / /_| ||  /| |
| |   | | |_| |  __/ | | | |_  | |_/ / | |_   \ V /\___  |\ |_/ /
\_|   |_|\__,_|\___|_| |_|\__| \____/|_|\__|   \_/     |_(_)___/ 


[2025/01/15 10:27:22] [ info] Configuration:
[2025/01/15 10:27:22] [ info]  flush time     | 1.000000 seconds
[2025/01/15 10:27:22] [ info]  grace          | 5 seconds
[2025/01/15 10:27:22] [ info]  daemon         | 0
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  inputs:
[2025/01/15 10:27:22] [ info]      tail
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  filters:
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  outputs:
[2025/01/15 10:27:22] [ info]      stdout.0
[2025/01/15 10:27:22] [ info] ___________
[2025/01/15 10:27:22] [ info]  collectors:
[2025/01/15 10:27:22] [ info] [fluent bit] version=4.0.0, commit=6d00ba1fde, pid=1537587
[2025/01/15 10:27:22] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2025/01/15 10:27:22] [ info] [storage] ver=1.1.6, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/01/15 10:27:22] [ info] [simd    ] SSE2
[2025/01/15 10:27:22] [ info] [cmetrics] version=0.9.9
[2025/01/15 10:27:22] [ info] [ctraces ] version=0.5.7
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] initializing
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2025/01/15 10:27:22] [debug] [tail:tail.0] created event channels: read=25 write=26
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] adjusted buf_max_size to 4001
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] adjusted buf_chunk_size to 4001
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inotify watch fd=31
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170643 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log, inode 43170643
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log'
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170624 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log, inode 43170624
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log'
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scanning path /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170625 with offset=0 appended as /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] scan_glob add(): /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log, inode 43170625
[2025/01/15 10:27:22] [ info] [output:stdout:stdout.0] worker #0 started
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] 1 new files found on path '/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log'
[2025/01/15 10:27:22] [debug] [stdout:stdout.0] created event channels: read=35 write=36
[2025/01/15 10:27:22] [ info] [sp] stream processor started
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=123, records=1, input=tail.0
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=109, records=1, input=tail.0
[2025/01/15 10:27:22] [trace] [input chunk] update output instances with new chunk size diff=196, records=1, input=tail.0
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] [static files] processed 290b
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170643 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170643 watch_fd=1 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170624 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170624 watch_fd=2 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] inode=43170625 file=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log promote to TAIL_EVENT
[2025/01/15 10:27:22] [ info] [input:tail:tail.0] inotify_fs_add(): inode=43170625 watch_fd=3 name=/media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:22] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2025/01/15 10:27:22] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:22] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [task 0x617ac40] created (id=0)
[2025/01/15 10:27:23] [debug] [task] created task=0x617ac40 id=0 OK
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] sqlerrorlog: [[1736904442.640693144, {}], {"log"=>"🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁷󠁬󠁳󠁿"}]
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[1] sqlerrorlog: [[1736904442.666284429, {}], {"log"=>"用汉字在 Fluent Bit 中处理日志,就像是一个梦一样😀"}]
[2] sqlerrorlog: [[1736904442.668104080, {}], {"log"=>"にほんごテストログふぁいる。文字エンコーディングをUnicodeにできる!?☕😀⚪⚫🔴🔵🟠🟡🟢🟣🟤🇺🇸🇯🇵"}]
[2025/01/15 10:27:23] [debug] [out flush] cb_destroy coro_id=0
[2025/01/15 10:27:23] [trace] [coro] destroy coroutine=0x6180fa0 data=0x6180fc0
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [engine] [task event] task_id=0 out_id=0 return=OK
[2025/01/15 10:27:23] [debug] [task] destroy task=0x617ac40 (task_id=0)
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:23] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
^C[2025/01/15 10:27:24] [engine] caught signal (SIGINT)
[2025/01/15 10:27:24] [trace] [engine] flush enqueued data
[2025/01/15 10:27:24] [ warn] [engine] service will shutdown in max 5 seconds
[2025/01/15 10:27:24] [ info] [input] pausing tail.0
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [ info] [engine] service has stopped (0 pending tasks)
[2025/01/15 10:27:24] [ info] [input] pausing tail.0
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170643 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_subdivision_flags.log
[2025/01/15 10:27:24] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170643 watch_fd=1
[2025/01/15 10:27:24] [trace] [sched] 0 timer coroutines destroyed
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170624 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_c.log
[2025/01/15 10:27:24] [ info] [output:stdout:stdout.0] thread worker #0 stopped
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170624 watch_fd=2
[2025/01/15 10:27:24] [debug] [input:tail:tail.0] inode=43170625 removing file name /media/Data3/Gitrepo/fluent-bit/tests/runtime/data/tail/log/unicode_j.log
[2025/01/15 10:27:24] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=43170625 watch_fd=3
  • Attached Valgrind output that shows no leaks or memory corruption was found
==1537587== 
==1537587== HEAP SUMMARY:
==1537587==     in use at exit: 0 bytes in 0 blocks
==1537587==   total heap usage: 3,465 allocs, 3,465 frees, 1,062,937 bytes allocated
==1537587== 
==1537587== All heap blocks were freed -- no leaks are possible
==1537587== 
==1537587== For lists of detected and suppressed errors, rerun with: -s
==1537587== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#1471

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from d1b404a to 4053bbd Compare October 7, 2024 07:13
@cosmo0920 cosmo0920 force-pushed the cosmo0920-try-to-bundle-simdutf-amalgamation branch from 4053bbd to 2a515ea Compare October 7, 2024 07:17
cosmo0920 added 18 commits March 4, 2025 01:51
Signed-off-by: Hiroshi Hatake <[email protected]>
Signed-off-by: Hiroshi Hatake <[email protected]>
…s not fully support C++11

Signed-off-by: Hiroshi Hatake <[email protected]>
Plus, waiting for relatively longer for the ordinary test cases.
This is because these test cases for unicode need to read contents from
filesystem.

Signed-off-by: Hiroshi Hatake <[email protected]>
@edsiper
Copy link
Member

edsiper commented Mar 18, 2025

@cosmo0920 is this ready to go ?

@cosmo0920
Copy link
Contributor Author

Yes, it's ready to go. I've rebased off master recently.

@edsiper edsiper merged commit 5251a76 into master Mar 29, 2025
100 of 102 checks passed
@edsiper edsiper deleted the cosmo0920-try-to-bundle-simdutf-amalgamation branch March 29, 2025 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-required ok-package-test Run PR packaging tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for reading files encoded in UTF-16 for Tail Input
6 participants