Skip to content

Is the encoding parameter being used as the Documentation states ? #483

Open
@gadiego92

Description

@gadiego92

Describe the bug

The Official Documentation states regarding the encoding for the tail plugin:

encoding, from_encoding

type default version
string nil (string encoding is ASCII-8BIT) 0.14.0

Specifies the encoding of reading lines.

By default, in_tail emits string value as ASCII-8BIT encoding.

These options change it:

  • If encoding is specified, in_tail changes string to encoding.

    This uses Ruby's String#force_encoding.

  • If encoding and from_encoding both are specified, in_tail tries to

    encode string from from_encoding to encoding. This uses Ruby's

    String#encode.

source: tail#encoding-from_encoding

I have been checking Fluentd source code and:

  1. Regarding the first bullet.
    I think encoding parameter is not being used as it states in the Documentation.
    I cannot find the function String#force_encoding using the encoding parameter.
    On the other side I have found the String#force_encoding function with the from_encoding parameter in few places.
    I think line 992 might be wrong:
    https://github.com/fluent/fluentd/blob/74db9477f445ef83384eca6da8d6c2049945d8cd/lib/fluent/plugin/in_tail.rb#L992
    If the Documentation is not wrong the function String#force_encoding should use the encoding value not the from_encoding value.

  2. Regarding the second bullet.
    It states the String#encode function is used when from _encoding parameter is set but it seems String#encode is used by default is you set encoding parameter to something different than ASCII-8BIT because from_encoding is set by default to ASCII-8BIT. For example, String#encode is used if you set encoding parameter to UTF-8 but according to the Documentation String#force_encoding should be used when you set the encoding parameter and not String#encode.

To Reproduce

Just start a Fluentd container with GROK plugin.

Then run the command:

td-agent --config /home/td-agent/fluentd.conf

Expected behavior

2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}

Your Environment

- Fluentd version: 1.11.2
- TD Agent version: 1.11.2
- Operating system: Alma Linux 9
- Kernel version: Linux 5.14.0-284.30.1.el9_2.x86_64 x86_64

Your Configuration

# /home/td-agent/patterns.conf

CUSTOM_LOG_WORKS %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
# HTTPDATE has ä character
# Source: https://github.com/fluent/fluent-plugin-grok-parser/blob/903dfe222984b90c4e1c1151530038d1f242157d/patterns/legacy/grok-patterns#L51
CUSTOM_LOG_FAILS %{HTTPDATE:timestamp} %{NUMBER:response}
# /tmp/encoding-test.log

2023-11-22 18:18:09.823+0100 Testing Zürich
2023-11-22 18:18:09.823+0100 Testing Geneva
# /home/td-agent/fluentd.conf
<source>

  @type tail

  path /tmp/encoding-test.log
  read_from_head true
  encoding UTF-8
  tag encoding

  <parse>

    @type grok

    grok_failure_key grokfailure
    custom_pattern_path /home/td-agent/patterns.conf

    <grok>
       pattern %{CUSTOM_LOG_FAILS:message}
    </grok>

    <grok>
       pattern %{CUSTOM_LOG_WORKS:message}
    </grok>

  </parse>

</source>

<match encoding>

    @type stdout

</match>

Your Error Log

[td-agent@dc60c1c5967e ~]$ /opt/td-agent/bin/fluentd --config /home/td-agent/fluentd.conf
2023-11-23 15:58:45 +0100 [info]: parsing config file is succeeded path="/home/td-agent/fluentd.conf"
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '4.1.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-kafka' version '0.14.1'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus' version '1.8.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-prometheus_pushgateway' version '0.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-s3' version '1.4.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-td' version '1.1.0'
2023-11-23 15:58:45 +0100 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5'
2023-11-23 15:58:45 +0100 [info]: gem 'fluentd' version '1.11.2'
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:45 +0100 [info]: Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:45 +0100 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:45 +0100 [warn]: this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:45 +0100 [info]: using configuration file: <ROOT>
  <source>
    @type tail
    path "/tmp/encoding-test.log"
    tag "encoding"
    read_from_head true
    encoding "UTF-8"
    <parse>
      @type "grok"
      grok_failure_key "grokfailure"
      custom_pattern_path "/home/td-agent/patterns.conf"
      unmatched_lines
      <grok>
        pattern "%{CUSTOM_LOG_FAILS:message}"
      </grok>
      <grok>
        pattern "%{CUSTOM_LOG_WORKS:message}"
      </grok>
    </parse>
  </source>
  <match encoding>
    @type stdout
  </match>
</ROOT>
2023-11-23 15:58:45 +0100 [info]: starting fluentd-1.11.2 pid=715 ruby="2.7.1"
2023-11-23 15:58:45 +0100 [info]: spawn command to main:  cmdline=["/opt/td-agent/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/td-agent/bin/fluentd", "--config", "/home/td-agent/fluentd.conf", "--under-supervisor"]
2023-11-23 15:58:45 +0100 [info]: adding match pattern="encoding" type="stdout"
2023-11-23 15:58:45 +0100 [info]: adding source type="tail"
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_FAILS:message} into (?<message>(?<timestamp>(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))/(?:\b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|ä)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b)/(?:(?>\d\d){1,2}):(?:(?!<[0-9])(?:(?:2[0123]|[01]?[0-9])):(?:(?:[0-5][0-9]))(?::(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))(?![0-9])) (?:(?:[+-]?(?:[0-9]+)))) (?<response>(?:(?:(?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))))))
2023-11-23 15:58:46 +0100 [info]: #0 Expanded the pattern %{CUSTOM_LOG_WORKS:message} into (?<message>(?<timestamp>(?:(?>\d\d){1,2})-(?:(?:0?[1-9]|1[0-2]))-(?:(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9]))[T ](?:(?:2[0123]|[01]?[0-9])):?(?:(?:[0-5][0-9]))(?::?(?:(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)))?(?:(?:Z|[+-](?:(?:2[0123]|[01]?[0-9]))(?::?(?:(?:[0-5][0-9])))))?) (?<message>.*))
2023-11-23 15:58:46 +0100 [warn]: #0 'pos_file PATH' parameter is not set to a 'tail' source.
2023-11-23 15:58:46 +0100 [warn]: #0 this parameter is highly recommended to save the position to resume tailing.
2023-11-23 15:58:46 +0100 [info]: #0 starting fluentd worker pid=720 ppid=715 worker=0
2023-11-23 15:58:46 +0100 [info]: #0 following tail of /tmp/encoding-test.log
2023-11-23 15:58:46.005131856 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005146527 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005152826 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005157747 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005162458 +0100 encoding: {"message":"Z��rich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46.005176717 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 15:58:46 +0100 [info]: #0 fluentd worker is now running worker=0

Additional details

If I set both encoding parameters to UTF-8 I get a warning on the Fluentd logs but the special characters are represented.
I don't know if this is the proper way to represent the special characters since I get a warning. Shouldn't this warning be change to info ?

Configuration

    @type tail
    path "/tmp/encoding-test.log"
    tag "encoding"
    read_from_head true
    from_encoding "UTF-8"
    encoding "UTF-8"

Warning

2023-11-23 14:44:12 +0100 [warn]: #0 fluent/log.rb:348:warn: 'encoding' and 'from_encoding' are same encoding. No effect

Output

2023-11-23 14:44:12.044957269 +0100 encoding: {"message":"Zürich","timestamp":"2023-11-22 18:18:09.823+0100"}
2023-11-23 14:44:12.044962081 +0100 encoding: {"message":"Geneva","timestamp":"2023-11-22 18:18:09.823+0100"}

Documentation not clear or wrong

Another option could be that Fluentd works as expected but the Documentation is not clear enough or it's wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions