-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169
When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169
Conversation
…eplacement characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we disagree about the cause, and therefore disagree about next steps, but I agree that String#scrub!
will prevent us from propagating invalid byte sequences to Elasticsearch regardless of whether compression is enabled.
The root of the issue is elastic/logstash#15833, in which Logstash's wrapper around JRJackson produces invalid UTF-8 when handling data-structures containing Strings that are not valid UTF-8, because while it escapes meaningful bytes properly it propagates the invalid UTF-8 byte sequences directly.
When this plugin calls through Manticore to make an uncompressed API request, Manticore passes our not-valid-UTF8 ruby-String
through the JRuby bridge to invoke a java-constructor that expects a java-String
, and the JRuby bridge implicitly scrubs the whole buffer. We get no such "for free" scrubbing when we build a binary compressed buffer ourselves using the ruby-native Zlib.
The solution proposed brings parity to the two routes by scrubbing each bulk action's JSON sequence before appending it to the buffer. We are "lucky" that scrubbing the generated invalid-JSON has the same net effect as producing clean JSON from a data-structure that has been deep-replaced with scrubbed components because JRJackson does correctly handle lower-ASCII control characters and the escaping of semantically-meaningful bytes in a string value.
Manticore is a thin layer on top of Apache Http Client, so it makes sense that Manticore uses Apache Http's StringEntity
to represent string entities (which provides functionality for request headers indicating the size and semantic meaning of the request payload), and to use ByteArrayEntity
to represent opaque binary sequences.
I would hesitate to take full ownership of encoding the literal bytes of the request payload (along with the requisite headers to indicate what those bytes mean), or to suggest that Manticore change their behaviour in general.
Scrubbing prior to appending to the buffer ensures that when the JSON generation produces invalid UTF-8 (and therefore invalid JSON), a scrubbed version is propagated to Elasticsearch regardless of whether we are sending through a compressed or uncompressed request.
When we fix the upstream issue in Logstash core, this scrub operation should effectively become a no-op.
As a net, I am +1 to merging a fix to String#scrub!
the JSON before appending it to the buffer. I do not believe we should add the logging (unless we substantially re-work it to improve signal:noise ratio to make it actionable), and I do not believe we should propagate the issue up the Manticore side.
Take out loggins since they do not much help and add more context to the doc. Co-authored-by: Ry Biesemeyer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes themselves look appropriate.
I've left a note about the changelog, and another about ruby-style in the specs; feel free to resolve as you see fit.
The Elastic Stack 7.17 specs were failing in a seemingly-unrelated way, so I kicked them.
For future reference, the manticore differentiated behaviour we see is this:
|
Updated unit test coding style change to a suggested ruby style and changelog updated. Co-authored-by: Ry Biesemeyer <[email protected]>
Travis is failing in the 7.x integration tests because the logs for the job are too verbose, and travis terminates the build when the logs go over the job's maximum length:
I have run both such dockerized jobs locally and they are green (successful). |
Following CI jobs failed but I do confirm they are passing on my local, should be related to travis. So, they are not blockers. 2733.3 | INTEGRATION=true ELASTIC_STACK_VERSION=7.x | Linux | errored
|
Description
Current buggy behaviours:
compression_level > 0
), the event get rejected if it has invalid non UTF-8 byte sequences;compression_level = 0
), the event get accepted even though it has invalid non UTF-8 byte sequences. The reason behind,manticore
HTTP client under the hood replaces them (1-byte with 3-bytes, 2 extra bytes appear can be checked in apache trace logs) when it uses the apacheStringEntity
This PR introduces an immediate fix and opens a discussion for long general term use case. Tested with apache client trace logs that sending bytes do not change.
\uFFFD
). The idea is from the best practise point of view (how most of current S/W programs behave, example editors) and also provides a benefit of utilizing the event (as much as possible valid parts) instead of throwing.for long term (requires a discussion)see the comment When event JSON data contains non UTF-8 invalid bytes, replace with replacement characters. #1169 (review)manticore
HTTP client has a logic where if request body is given, it either usesByteArrayEntity
or (apache's common core)StringEntity
. SinceStringEntity
's behaviour to convert the payload, the original bytes will change if invalid UTF-8. From my point of view, themanticore
shouldn't align on any conversion regardless of any encoding and useByteArrayEntity
. No idea what feature/behaviormanticore
was going to provide withStringEntity