When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

mashhurs · 2024-03-17T16:58:43Z

Description

Current buggy behaviours:

when using HTTP compression (with compression_level > 0), the event get rejected if it has invalid non UTF-8 byte sequences;
when non using HTTP compression (with compression_level = 0), the event get accepted even though it has invalid non UTF-8 byte sequences. The reason behind, manticore HTTP client under the hood replaces them (1-byte with 3-bytes, 2 extra bytes appear can be checked in apache trace logs) when it uses the apache StringEntity

This PR introduces an immediate fix and opens a discussion for long general term use case. Tested with apache client trace logs that sending bytes do not change.

an immediate fix, this PR covers
- If any invalid sequence of bytes found in the events JSON payload, replace with a replacement character (\uFFFD). The idea is from the best practise point of view (how most of current S/W programs behave, example editors) and also provides a benefit of utilizing the event (as much as possible valid parts) instead of throwing.
[SOLVED] ~~for long term (requires a discussion)~~ see the comment When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169 (review)
- manticore HTTP client has a logic where if request body is given, it either uses ByteArrayEntity or (apache's common core) StringEntity. Since StringEntity's behaviour to convert the payload, the original bytes will change if invalid UTF-8. From my point of view, the manticore shouldn't align on any conversion regardless of any encoding and use ByteArrayEntity. No idea what feature/behavior manticore was going to provide with StringEntity

# use following config in config/encoding_test.conf fiel
input { generator { count => 1 } }
filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
output {
 elasticsearch {
   cloud_id => "cloud_id"
   cloud_auth => "elastic:{pwd}"
   http_compression => "${HTTP_COMPRESSION}"
 }
 stdout { }
}

# BEFORE the fix: see https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/1168 logs
# AFTER the fix, with any HTTP compression level (0~9), the behavior will be same that data will be indexed
[2024-03-18T09:06:31,891][DEBUG][org.apache.http.impl.conn.PoolingHttpClientConnectionManager][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Connection released: [id: 0][route: {s}->https://68f825b174ea43b5986baabce7534163.ca-central-1.aws.elastic-cloud.com:443][total available: 1; route allocated: 1 of 100; total allocated: 1 of 1000]
{
          "host" => {
        "name" => "localhost"
    },
      "@version" => "1",
    "@timestamp" => 2024-03-18T16:06:31.686611Z,
         "event" => {
        "sequence" => 0,
        "original" => "Hello world!"
    },
       "message" => "\xAC"
}

…eplacement characters.

lib/logstash/outputs/elasticsearch/http_client.rb

yaauie

I think we disagree about the cause, and therefore disagree about next steps, but I agree that String#scrub! will prevent us from propagating invalid byte sequences to Elasticsearch regardless of whether compression is enabled.

The root of the issue is elastic/logstash#15833, in which Logstash's wrapper around JRJackson produces invalid UTF-8 when handling data-structures containing Strings that are not valid UTF-8, because while it escapes meaningful bytes properly it propagates the invalid UTF-8 byte sequences directly.

When this plugin calls through Manticore to make an uncompressed API request, Manticore passes our not-valid-UTF8 ruby-String through the JRuby bridge to invoke a java-constructor that expects a java-String, and the JRuby bridge implicitly scrubs the whole buffer. We get no such "for free" scrubbing when we build a binary compressed buffer ourselves using the ruby-native Zlib.

The solution proposed brings parity to the two routes by scrubbing each bulk action's JSON sequence before appending it to the buffer. We are "lucky" that scrubbing the generated invalid-JSON has the same net effect as producing clean JSON from a data-structure that has been deep-replaced with scrubbed components because JRJackson does correctly handle lower-ASCII control characters and the escaping of semantically-meaningful bytes in a string value.

Manticore is a thin layer on top of Apache Http Client, so it makes sense that Manticore uses Apache Http's StringEntity to represent string entities (which provides functionality for request headers indicating the size and semantic meaning of the request payload), and to use ByteArrayEntity to represent opaque binary sequences.

I would hesitate to take full ownership of encoding the literal bytes of the request payload (along with the requisite headers to indicate what those bytes mean), or to suggest that Manticore change their behaviour in general.

Scrubbing prior to appending to the buffer ensures that when the JSON generation produces invalid UTF-8 (and therefore invalid JSON), a scrubbed version is propagated to Elasticsearch regardless of whether we are sending through a compressed or uncompressed request.

When we fix the upstream issue in Logstash core, this scrub operation should effectively become a no-op.

As a net, I am +1 to merging a fix to String#scrub! the JSON before appending it to the buffer. I do not believe we should add the logging (unless we substantially re-work it to improve signal:noise ratio to make it actionable), and I do not believe we should propagate the issue up the Manticore side.

docs/index.asciidoc

lib/logstash/outputs/elasticsearch/http_client.rb

Take out loggins since they do not much help and add more context to the doc. Co-authored-by: Ry Biesemeyer <[email protected]>

yaauie

Changes themselves look appropriate.

I've left a note about the changelog, and another about ruby-style in the specs; feel free to resolve as you see fit.

The Elastic Stack 7.17 specs were failing in a seemingly-unrelated way, so I kicked them.

CHANGELOG.md

spec/integration/outputs/compressed_indexing_spec.rb

jsvd · 2024-03-19T14:39:31Z

For future reference, the manticore differentiated behaviour we see is this:

> string_entity = Java::org.apache.http.entity.StringEntity.new("\xAC")
=> #<Java::OrgApacheHttpEntity::StringEntity:0x2f930be7>
> Java::org.apache.http.util.EntityUtils.toString(string_entity)
=> "?"
> byte_array_entity = Java::org.apache.http.entity.ByteArrayEntity.new("\xAC".to_java_bytes)
=> #<Java::OrgApacheHttpEntity::ByteArrayEntity:0x27c243a3>
> Java::org.apache.http.util.EntityUtils.toString(byte_array_entity)
=> "¬"

Updated unit test coding style change to a suggested ruby style and changelog updated. Co-authored-by: Ry Biesemeyer <[email protected]>

yaauie · 2024-03-19T16:23:59Z

Travis is failing in the 7.x integration tests because the logs for the job are too verbose, and travis terminates the build when the logs go over the job's maximum length:

The job exceeded the maximum log length, and has been terminated.

-- Job Output

I have run both such dockerized jobs locally and they are green (successful).

mashhurs · 2024-03-19T16:30:12Z

Following CI jobs failed but I do confirm they are passing on my local, should be related to travis. So, they are not blockers.

env vars

INTEGRATION=true
ELASTIC_STACK_VERSION=7.x
SNAPSHOT=true # without snapshot also fine
LOG_LEVEL=info

docker setup: ./.ci/docker-setup.sh

Fetching versions from https://raw.githubusercontent.com/elastic/logstash/master/ci/logstash_releases.json
"7.17.19-SNAPSHOT"
Translated 7.x to 7.17.19-SNAPSHOT
Testing against version: 7.17.19-SNAPSHOT
Pulling docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
7.17.19-SNAPSHOT: Pulling from logstash/logstash
Digest: sha256:e1b7f8e923565d3b5307637d5a984d48ea82802cd110433ab03957e544711b56
Status: Image is up to date for docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
Pulling docker.elastic.co/elasticsearch/elasticsearch:7.17.19-SNAPSHOT
...
 => => writing image sha256:3dcbb6d89a3a6f8e97439f40036897a00f9b43f3066b15d5327a6cbf15fac15f                                                                                        0.0s
[+] Building 8.0s (18/18) FINISHED

docker run: ./.ci/docker-run.sh

w0w"  }
ci-logstash-1       |     increases number of successful inserted documents
ci-logstash-1       | 
ci-logstash-1       | Pending: (Failures listed here are expected and do not affect your suite's status)
ci-logstash-1       | 
ci-logstash-1       |   1) pool sniffer Simple sniff parsing with single node should return the correct sniff URL
ci-logstash-1       |      # No reason given
ci-logstash-1       |      # ./spec/integration/outputs/sniffer_spec.rb:37
ci-logstash-1       | 
ci-logstash-1       | 
ci-logstash-1       | Finished in 5 minutes 21 seconds (files took 4.22 seconds to load)
ci-logstash-1       | 136 examples, 0 failures, 1 pending
ci-logstash-1       | 
ci-logstash-1       | Randomized with seed 65243
ci-logstash-1       | 
ci-logstash-1 exited with code 0
Aborting on container exit...
[+] Running 2/2
 ✔ Container ci-elasticsearch-1  Stopped                                                                                                                                            0.3s 
 ✔ Container ci-logstash-1       Stopped

When event JSON data contains non UTF-8 invalid bytes, replace with r…

99a4e32

…eplacement characters.

roaksoax added the status:needs-review label Mar 17, 2024

roaksoax assigned mashhurs Mar 17, 2024

mashhurs linked an issue Mar 18, 2024 that may be closed by this pull request

Handling non UTF-8 data. #1168

Closed

mashhurs marked this pull request as draft March 18, 2024 15:51

roaksoax added status:work-in-progress and removed status:needs-review labels Mar 18, 2024

mashhurs added 4 commits March 18, 2024 09:10

Remove unnecessary changes.

15e6e6f

Fix the unit test failures.

409079f

Make replacement char global

32c860d

Mention non UTF-8 data replacement in docs.

7111c72

mashhurs commented Mar 18, 2024

View reviewed changes

lib/logstash/outputs/elasticsearch/http_client.rb Outdated Show resolved Hide resolved

mashhurs marked this pull request as ready for review March 18, 2024 20:06

roaksoax added status:needs-review and removed status:work-in-progress labels Mar 18, 2024

yaauie reviewed Mar 18, 2024

View reviewed changes

mashhurs and others added 2 commits March 18, 2024 15:04

Apply suggestions from code review

0a8e365

Take out loggins since they do not much help and add more context to the doc. Co-authored-by: Ry Biesemeyer <[email protected]>

Remove unnecessary source.

d9eb53b

yaauie approved these changes Mar 19, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

spec/integration/outputs/compressed_indexing_spec.rb Outdated Show resolved Hide resolved

roaksoax added status:approved and removed status:needs-review labels Mar 19, 2024

Apply suggestions from code review

1aa9ee2

Updated unit test coding style change to a suggested ruby style and changelog updated. Co-authored-by: Ry Biesemeyer <[email protected]>

mashhurs merged commit 528f58e into logstash-plugins:main Mar 19, 2024
1 of 2 checks passed

roaksoax removed the status:approved label Mar 19, 2024

mashhurs mentioned this pull request Mar 19, 2024

(backport: 11.4 for LS7.x) Fixes an issue where events containing non-unicode strings. #1171

Merged

mashhurs mentioned this pull request Apr 4, 2024

Logstash throws "Incompatible Encodings" error when querying NVARCHAR-Fields from MSSQL-Server elastic/logstash#14679

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

mashhurs commented Mar 17, 2024 •

edited

Loading

yaauie left a comment

yaauie left a comment

jsvd commented Mar 19, 2024

yaauie commented Mar 19, 2024

mashhurs commented Mar 19, 2024 •

edited

Loading

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

Conversation

mashhurs commented Mar 17, 2024 • edited Loading

Description

yaauie left a comment

Choose a reason for hiding this comment

yaauie left a comment

Choose a reason for hiding this comment

jsvd commented Mar 19, 2024

yaauie commented Mar 19, 2024

mashhurs commented Mar 19, 2024 • edited Loading

mashhurs commented Mar 17, 2024 •

edited

Loading

mashhurs commented Mar 19, 2024 •

edited

Loading