Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for unescaped non-ascii characters (preferably utf8 encoded) #96

Open
sjoerd-vogel opened this issue Jun 18, 2018 · 0 comments
Open

Comments

@sjoerd-vogel
Copy link

sjoerd-vogel commented Jun 18, 2018

Currently all output appears to be escaped by org.apache.commons.lang.StringEscapeUtils::escapeJava, which appears to be designed to escape strings for usage in java code (i.e. strings such escaped could be copy-pasted directly into a .java file). Apparently this includes a encoding of non-ascii characters into a \u[codepoint] format. The CSV reader of our choice did not expect this. I propose adding the option to not escape the output in this way. If no double quotes or line breaks appear in the original string, this is perfectly fine when dealing with CSV files.

Additionally, all instances of PrintStream are new-ed using a single-argument constructor, a such constructed PrintStream apparently reduces all non-ascii characters to question marks (?). To allow for utf8 output, these could simply be replaced by three parameter constructors by following substitution:

new PrinstStream(param) -> new PrintStream(param, false, StandardCharsets.UTF_8.name());

where false is the autoflush setting which is false in the single-parameter constructor.

It would be even better to allow type-specific escapes (in the case of CSV: escape double quotes by doubling them), but this could be a separate effort.

I would be happy to create a merge-request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant