Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169

Merged
merged 8 commits into from
Mar 19, 2024

Conversation

mashhurs
Copy link
Contributor

@mashhurs mashhurs commented Mar 17, 2024

Description

Current buggy behaviours:

  • when using HTTP compression (with compression_level > 0), the event get rejected if it has invalid non UTF-8 byte sequences;
  • when non using HTTP compression (with compression_level = 0), the event get accepted even though it has invalid non UTF-8 byte sequences. The reason behind, manticore HTTP client under the hood replaces them (1-byte with 3-bytes, 2 extra bytes appear can be checked in apache trace logs) when it uses the apache StringEntity

This PR introduces an immediate fix and opens a discussion for long general term use case. Tested with apache client trace logs that sending bytes do not change.

  • an immediate fix, this PR covers
    • If any invalid sequence of bytes found in the events JSON payload, replace with a replacement character (\uFFFD). The idea is from the best practise point of view (how most of current S/W programs behave, example editors) and also provides a benefit of utilizing the event (as much as possible valid parts) instead of throwing.
  • [SOLVED] for long term (requires a discussion) see the comment When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169 (review)
    • manticore HTTP client has a logic where if request body is given, it either uses ByteArrayEntity or (apache's common core) StringEntity. Since StringEntity's behaviour to convert the payload, the original bytes will change if invalid UTF-8. From my point of view, the manticore shouldn't align on any conversion regardless of any encoding and use ByteArrayEntity. No idea what feature/behavior manticore was going to provide with StringEntity
# use following config in config/encoding_test.conf fiel
input { generator { count => 1 } }
filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
output {
 elasticsearch {
   cloud_id => "cloud_id"
   cloud_auth => "elastic:{pwd}"
   http_compression => "${HTTP_COMPRESSION}"
 }
 stdout { }
}

# BEFORE the fix: see https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/1168 logs
# AFTER the fix, with any HTTP compression level (0~9), the behavior will be same that data will be indexed
[2024-03-18T09:06:31,891][DEBUG][org.apache.http.impl.conn.PoolingHttpClientConnectionManager][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Connection released: [id: 0][route: {s}->https://68f825b174ea43b5986baabce7534163.ca-central-1.aws.elastic-cloud.com:443][total available: 1; route allocated: 1 of 100; total allocated: 1 of 1000]
{
          "host" => {
        "name" => "localhost"
    },
      "@version" => "1",
    "@timestamp" => 2024-03-18T16:06:31.686611Z,
         "event" => {
        "sequence" => 0,
        "original" => "Hello world!"
    },
       "message" => "\xAC"
}

@mashhurs mashhurs linked an issue Mar 18, 2024 that may be closed by this pull request
@mashhurs mashhurs marked this pull request as draft March 18, 2024 15:51
@mashhurs mashhurs marked this pull request as ready for review March 18, 2024 20:06
Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we disagree about the cause, and therefore disagree about next steps, but I agree that String#scrub! will prevent us from propagating invalid byte sequences to Elasticsearch regardless of whether compression is enabled.

The root of the issue is elastic/logstash#15833, in which Logstash's wrapper around JRJackson produces invalid UTF-8 when handling data-structures containing Strings that are not valid UTF-8, because while it escapes meaningful bytes properly it propagates the invalid UTF-8 byte sequences directly.

When this plugin calls through Manticore to make an uncompressed API request, Manticore passes our not-valid-UTF8 ruby-String through the JRuby bridge to invoke a java-constructor that expects a java-String, and the JRuby bridge implicitly scrubs the whole buffer. We get no such "for free" scrubbing when we build a binary compressed buffer ourselves using the ruby-native Zlib.

The solution proposed brings parity to the two routes by scrubbing each bulk action's JSON sequence before appending it to the buffer. We are "lucky" that scrubbing the generated invalid-JSON has the same net effect as producing clean JSON from a data-structure that has been deep-replaced with scrubbed components because JRJackson does correctly handle lower-ASCII control characters and the escaping of semantically-meaningful bytes in a string value.

Manticore is a thin layer on top of Apache Http Client, so it makes sense that Manticore uses Apache Http's StringEntity to represent string entities (which provides functionality for request headers indicating the size and semantic meaning of the request payload), and to use ByteArrayEntity to represent opaque binary sequences.

I would hesitate to take full ownership of encoding the literal bytes of the request payload (along with the requisite headers to indicate what those bytes mean), or to suggest that Manticore change their behaviour in general.


Scrubbing prior to appending to the buffer ensures that when the JSON generation produces invalid UTF-8 (and therefore invalid JSON), a scrubbed version is propagated to Elasticsearch regardless of whether we are sending through a compressed or uncompressed request.

When we fix the upstream issue in Logstash core, this scrub operation should effectively become a no-op.


As a net, I am +1 to merging a fix to String#scrub! the JSON before appending it to the buffer. I do not believe we should add the logging (unless we substantially re-work it to improve signal:noise ratio to make it actionable), and I do not believe we should propagate the issue up the Manticore side.

docs/index.asciidoc Outdated Show resolved Hide resolved
lib/logstash/outputs/elasticsearch/http_client.rb Outdated Show resolved Hide resolved
lib/logstash/outputs/elasticsearch/http_client.rb Outdated Show resolved Hide resolved
lib/logstash/outputs/elasticsearch/http_client.rb Outdated Show resolved Hide resolved
mashhurs and others added 2 commits March 18, 2024 15:04
Take out loggins since they do not much help and add more context to the doc.

Co-authored-by: Ry Biesemeyer <[email protected]>
Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes themselves look appropriate.

I've left a note about the changelog, and another about ruby-style in the specs; feel free to resolve as you see fit.

The Elastic Stack 7.17 specs were failing in a seemingly-unrelated way, so I kicked them.

CHANGELOG.md Outdated Show resolved Hide resolved
spec/integration/outputs/compressed_indexing_spec.rb Outdated Show resolved Hide resolved
@jsvd
Copy link
Member

jsvd commented Mar 19, 2024

For future reference, the manticore differentiated behaviour we see is this:

> string_entity = Java::org.apache.http.entity.StringEntity.new("\xAC")
=> #<Java::OrgApacheHttpEntity::StringEntity:0x2f930be7>
> Java::org.apache.http.util.EntityUtils.toString(string_entity)
=> "?"
> byte_array_entity = Java::org.apache.http.entity.ByteArrayEntity.new("\xAC".to_java_bytes)
=> #<Java::OrgApacheHttpEntity::ByteArrayEntity:0x27c243a3>
> Java::org.apache.http.util.EntityUtils.toString(byte_array_entity)
=> "¬"

Updated unit test coding style change to a suggested ruby style and changelog updated.

Co-authored-by: Ry Biesemeyer <[email protected]>
@yaauie
Copy link
Contributor

yaauie commented Mar 19, 2024

Travis is failing in the 7.x integration tests because the logs for the job are too verbose, and travis terminates the build when the logs go over the job's maximum length:

The job exceeded the maximum log length, and has been terminated.

-- Job Output

I have run both such dockerized jobs locally and they are green (successful).

@mashhurs
Copy link
Contributor Author

mashhurs commented Mar 19, 2024

Following CI jobs failed but I do confirm they are passing on my local, should be related to travis. So, they are not blockers.

2733.3 | INTEGRATION=true ELASTIC_STACK_VERSION=7.x | Linux | errored
2733.4 | INTEGRATION=true ELASTIC_STACK_VERSION=7.x SNAPSHOT=true LOG_LEVEL=info | Linux | errored

  • env vars
INTEGRATION=true
ELASTIC_STACK_VERSION=7.x
SNAPSHOT=true # without snapshot also fine
LOG_LEVEL=info
  • docker setup: ./.ci/docker-setup.sh
Fetching versions from https://raw.githubusercontent.com/elastic/logstash/master/ci/logstash_releases.json
"7.17.19-SNAPSHOT"
Translated 7.x to 7.17.19-SNAPSHOT
Testing against version: 7.17.19-SNAPSHOT
Pulling docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
7.17.19-SNAPSHOT: Pulling from logstash/logstash
Digest: sha256:e1b7f8e923565d3b5307637d5a984d48ea82802cd110433ab03957e544711b56
Status: Image is up to date for docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
docker.elastic.co/logstash/logstash:7.17.19-SNAPSHOT
Pulling docker.elastic.co/elasticsearch/elasticsearch:7.17.19-SNAPSHOT
...
 => => writing image sha256:3dcbb6d89a3a6f8e97439f40036897a00f9b43f3066b15d5327a6cbf15fac15f                                                                                        0.0s
[+] Building 8.0s (18/18) FINISHED 
  • docker run: ./.ci/docker-run.sh
w0w"  }
ci-logstash-1       |     increases number of successful inserted documents
ci-logstash-1       | 
ci-logstash-1       | Pending: (Failures listed here are expected and do not affect your suite's status)
ci-logstash-1       | 
ci-logstash-1       |   1) pool sniffer Simple sniff parsing with single node should return the correct sniff URL
ci-logstash-1       |      # No reason given
ci-logstash-1       |      # ./spec/integration/outputs/sniffer_spec.rb:37
ci-logstash-1       | 
ci-logstash-1       | 
ci-logstash-1       | Finished in 5 minutes 21 seconds (files took 4.22 seconds to load)
ci-logstash-1       | 136 examples, 0 failures, 1 pending
ci-logstash-1       | 
ci-logstash-1       | Randomized with seed 65243
ci-logstash-1       | 
ci-logstash-1 exited with code 0
Aborting on container exit...
[+] Running 2/2
 ✔ Container ci-elasticsearch-1  Stopped                                                                                                                                            0.3s 
 ✔ Container ci-logstash-1       Stopped  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling non UTF-8 data.
4 participants