is StringDeserializer the right default to work correctly with codec charset? #13

colinsurprenant · 2020-02-13T21:14:43Z

I recently reviewed an issue where the data in kafka was encoded in ISO8859-1 and we could not correctly decode it using charset => "ISO8859-1" in the codec.

it appears that when using the org.apache.kafka.common.serialization.StringDeserializer (the default) the kafka lib will assume UTF-8 data resulting in receiving incorrectly encoded strings in the kafka input.

Per the kafka docs https://kafka.apache.org/10/javadoc/org/apache/kafka/common/serialization/StringDeserializer.html

String encoding defaults to UTF8 and can be customized by setting the property key.deserializer.encoding, value.deserializer.encoding or deserializer.encoding. The first two take precedence over the last.

I believe (not tested) that setting the property value.deserializer.encoding to ISO8859 would have worked.
OTOH, by using the org.apache.kafka.common.serialization.ByteArrayDeserializer and setting charset => "ISO8859-1" worked correctly.

This leads me to think that we should probably use the ByteArrayDeserializer by default if we want that to be compatible by default with our codecs + charset conversion.

In any case we should also have a note about this in the docs.

The text was updated successfully, but these errors were encountered:

colinsurprenant added bug Something isn't working enhancement New feature or request labels Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is StringDeserializer the right default to work correctly with codec charset? #13

is StringDeserializer the right default to work correctly with codec charset? #13

colinsurprenant commented Feb 13, 2020

is StringDeserializer the right default to work correctly with codec charset? #13

is StringDeserializer the right default to work correctly with codec charset? #13

Comments

colinsurprenant commented Feb 13, 2020