Description
Describe the bug
The Official Documentation states regarding the encoding for the tail plugin:
encoding
,from_encoding
type default version string nil (string encoding is ASCII-8BIT
)0.14.0 Specifies the encoding of reading lines.
By default,
in_tail
emits string value as ASCII-8BIT encoding.These options change it:
If
encoding
is specified,in_tail
changes string toencoding
.This uses Ruby's
String#force_encoding
.If
encoding
andfrom_encoding
both are specified,in_tail
tries toencode string from
from_encoding
toencoding
. This uses Ruby's
source: tail#encoding-from_encoding
I have been checking Fluentd source code and:
-
Regarding the first bullet.
I thinkencoding
parameter is not being used as it states in the Documentation.
I cannot find the functionString#force_encoding
using theencoding
parameter.
On the other side I have found theString#force_encoding
function with thefrom_encoding
parameter in few places.
I think line 992 might be wrong:
https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992
If the Documentation is not wrong the functionString#force_encoding
should use theencoding
value not thefrom_encoding
value. -
Regarding the second bullet.
It states theString#encode
function is used whenfrom _encoding
parameter is set but it seemsString#encode
is used by default is you setencoding
parameter to something different thanASCII-8BIT
becausefrom_encoding
is set by default toASCII-8BIT
. For example,String#encode
is used if you setencoding
parameter toUTF-8
but according to the DocumentationString#force_encoding
should be used when you set theencoding
parameter and notString#encode
.
To Reproduce
Just start a Fluentd container with GROK plugin.
Then run the command:
td-agent --config /home/td-agent/fluentd.conf
Expected behavior
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Your Environment
- Fluentd version: 1.11.2
- TD Agent version: 1.11.2
- Operating system: Alma Linux 9
- Kernel version: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64
Your Configuration
# /home/td-agent/patterns.conf
CUSTOM_LOG_WORKS %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
# HTTPDATE has ä character
# Source: https://github.com/fluent/fluent-plugin-grok-parser/blob/903dfe222984b90c4e1c1151530038d1f242157d/patterns/legacy/grok-patterns#L51
CUSTOM_LOG_FAILS %{HTTPDATE:timestamp} %{NUMBER:response}
# /tmp/encoding-test.log
2023-11-22 18:18:09.823+0100 Testing Zürich
2023-11-22 18:18:09.823+0100 Testing Geneva
# /home/td-agent/fluentd.conf
<source>
@type tail
path /tmp/encoding-test.log
read_from_head true
encoding UTF-8
tag encoding
<parse>
@type grok
grok_failure_key grokfailure
custom_pattern_path /home/td-agent/patterns.conf
<grok>
pattern %{CUSTOM_LOG_FAILS:message}
</grok>
<grok>
pattern %{CUSTOM_LOG_WORKS:message}
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
Your Error Log
[td-agent@dc60c1c5967e ~]$ /opt/td-agent/bin/fluentd --config /home/td-agent/fluentd.conf
2023-11-23 15:58:45 +0100 [info]: parsing config file is succeeded path="/home/td-agent/fluentd.conf"
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.1.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-kafka' version '0.14.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus' version '1.8.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-td' version '1.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5'
2023-11-23 15:58:45 +0100 [info]: gem 'fluentd' version '1.11.2'
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:45 +0100 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:45 +0100 [warn]: this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:45 +0100 [info]: using configuration file: <ROOT>
<source>
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
encoding "UTF-8"
<parse>
@type "grok"
grok_failure_key "grokfailure"
custom_pattern_path "/home/td-agent/patterns.conf"
unmatched_lines
<grok>
pattern "%{CUSTOM_LOG_FAILS:message}"
</grok>
<grok>
pattern "%{CUSTOM_LOG_WORKS:message}"
</grok>
</parse>
</source>
<match encoding>
@type stdout
</match>
</ROOT>
2023-11-23 15:58:45 +0100 [info]: starting fluentd-1.11.2 pid=715 ruby="2.7.1"
2023-11-23 15:58:45 +0100 [info]: spawn command to main: cmdline=["/opt/td-agent/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/bin/fluentd", "--config", "/home/td-agent/fluentd.conf", "--under-supervisor"]
2023-11-23 15:58:45 +0100 [info]: adding match pattern="encoding" type="stdout"
2023-11-23 15:58:45 +0100 [info]: adding source type="tail"
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:46 +0100 [warn]: #0 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:46 +0100 [warn]: #0 this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:46 +0100 [info]: #0 starting fluentd worker pid=720 ppid=715 worker=0
2023-11-23 15:58:46 +0100 [info]: #0 following tail of /tmp/encoding-test.log
2023-11-23 15:58:46.005131856 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005146527 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005152826 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005157747 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46 +0100 [info]: #0 fluentd worker is now running worker=0
Additional details
If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented.
I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?
Configuration
@type tail
path "/tmp/encoding-test.log"
tag "encoding"
read_from_head true
from_encoding "UTF-8"
encoding "UTF-8"
Warning
2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effect
Output
2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
Documentation not clear or wrong
Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.