Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling non UTF-8 data. #1168

Closed
mashhurs opened this issue Mar 15, 2024 · 0 comments · Fixed by #1169
Closed

Handling non UTF-8 data. #1168

mashhurs opened this issue Mar 15, 2024 · 0 comments · Fixed by #1169
Assignees

Comments

@mashhurs
Copy link
Contributor

mashhurs commented Mar 15, 2024

Description

Current buggy behaviours:

  • when using HTTP compression (with compression_level > 0), the event get rejected if it has invalid non UTF-8 byte sequences;
  • when non using HTTP compression (with compression_level = 0), the event get accepted even though it has invalid non UTF-8 byte sequences. The reason behind, manticore HTTP client under the hood replaces them (1-byte with 3-bytes, 2 extra bytes appear) when it uses the apache StringEntity

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version) - any, including main (v8.14) branch, es-output-v11.22.2
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) - any, including main (v8.14) branch, es-output-v11.22.2
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes)
  4. How was the Logstash Plugin installed - default, current es-output-v11.22.2

JVM (e.g. java -version):

If the affected version of Logstash is 7.9 (or earlier), or if it is NOT using the bundled JDK or using the 'no-jdk' version in 7.10 (or higher), please provide the following information:

  1. JVM version (java -version)
  2. JVM installation source (e.g. from the Operating System's package manager, from source, etc).
  3. Value of the JAVA_HOME environment variable if set.

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem,
including (e.g.) pipeline definition(s), settings, locale, etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

  1. Use following pipeline config, save as encoding_test.conf in config folder
input { generator { count => 1 } }
filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
output {
 elasticsearch {
   cloud_id => "cloud_id"
   cloud_auth => "elastic:{pwd}"
   http_compression => "${HTTP_COMPRESSION}"
 }
 stdout { }
}
  1. Run with HTTP compression enabled with HTTP_COMPRESSION=true bin/logstash -f config/encoding_test.conf and observe that ES rejects the event because of invalid UTF-8 payload
  2. Run with HTTP compression enabled with HTTP_COMPRESSION=false bin/logstash -f config/encoding_test.conf and observe that ES indexes the event without issue.

Provide logs (if relevant):

# HTTP_COMPRESSION=true bin/logstash -f config/encoding_test.conf --enable-local-plugin-development

[2024-03-15T15:22:19,117][DEBUG][org.apache.http.impl.conn.PoolingHttpClientConnectionManager][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Connection released: [id: 0][route: {s}->https://host.elastic-cloud.com:443][total available: 1; route allocated: 1 of 100; total allocated: 1 of 1000]
[2024-03-15T15:22:19,119][ERROR][logstash.outputs.elasticsearch][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Encountered a retryable error (will retry with exponential backoff) {:code=>400, :url=>"https://host.elastic-cloud.com:443/_bulk?filter_path=errors,items.*.error,items.*.status", :content_length=>248, :body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to type\"}],\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to type\",\"caused_by\":{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0xac\\n at [Source: (byte[])\\\"{\\\"@version\\\":\\\"1\\\",\\\"host\\\":{\\\"name\\\":\\\"MacBook-Pro.local\\\"},\\\"@timestamp\\\":\\\"2024-03-15T22:22:18.892422Z\\\",\\\"message\\\":\\\"�\\\",\\\"event\\\":{\\\"original\\\":\\\"Hello world!\\\",\\\"sequence\\\":0},\\\"data_stream\\\":{\\\"type\\\":\\\"logs\\\",\\\"dataset\\\":\\\"generic\\\",\\\"namespace\\\":\\\"default\\\"}}\\\"; line: 1, column: 117]\"}},\"status\":400}"}

# HTTP_COMPRESSION=false bin/logstash -f config/encoding_test.conf
{
        "host" => {
        "name" => "MacBook-Pro.local"
    },
         "event" => {
        "original" => "Hello world!",
        "sequence" => 0
    },
      "@version" => "1",
       "message" => "\xAC",
    "@timestamp" => 2024-03-15T22:27:03.706976Z
}

Acceptance Criteria

Regardless of HTTP compression mode, the behaviour should stay same, either reject or accept. The possible better option would be considering the acceptance as it may provide benefits in many ways to users. However, filtering out of invalid byte sequence would be a bit dangerous.

@mashhurs mashhurs self-assigned this Mar 15, 2024
@mashhurs mashhurs changed the title Inconsistency of sending non UTF-8 data with HTTP compressed vs uncompressed modes. Handling non UTF-8 data. Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants