Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur curate passthru can add double quotations #1312

Closed
joverlee521 opened this issue Sep 12, 2023 · 6 comments · Fixed by #1493 or #1565
Closed

augur curate passthru can add double quotations #1312

joverlee521 opened this issue Sep 12, 2023 · 6 comments · Fixed by #1493 or #1565
Assignees
Labels
bug Something isn't working

Comments

@joverlee521
Copy link
Contributor

Current Behavior

When using the --metadata input, field values with double quotes in them can result in additional double quotes in the output.

Since the metadata is read through csv.DictReader, we can probably tweak this behavior through the csv.Dialect attributes

metadata_reader = csv.DictReader(handle, dialect=dialect)

Additional context

This was first observed in nextstrain/mpox#179

@joverlee521 joverlee521 added the bug Something isn't working label Sep 12, 2023
@tsibley
Copy link
Member

tsibley commented Sep 12, 2023

The escaped double quotes ("" for internal quotes) are more correct, no?

@joverlee521
Copy link
Contributor Author

The escaped double quotes ("" for internal quotes) are more correct, no?

Hmm, yes. I guess the issue comes up when running through augur curate multiple times.
The double quotes keep getting added on with each pass. This is an example of a string with internal quotes that goes through curate 3 times:

SRC VB "Vector", Molecular Biology of Genomes
"SRC VB ""Vector"", Molecular Biology of Genomes"
"SRC VB ""Vector"""", Molecular Biology of Genomes"""
"SRC VB ""Vector"""""""", Molecular Biology of Genomes"""""""

@tsibley
Copy link
Member

tsibley commented Sep 12, 2023

OH! Yeah, that's a misconfiguration of the parser/producer then.

@joverlee521 joverlee521 self-assigned this Jun 26, 2024
joverlee521 added a commit that referenced this issue Jun 27, 2024
Shows the undesired behavior with internal quotes described in
<#1312> comes from the
`write_records_to_tsv` function.
joverlee521 added a commit that referenced this issue Jun 27, 2024
According to TSV specs,¹ there are no restrictions on special characters
other than tabs are not allowed in a field. This is different from the
CSV specs,² which require double quotes around fields that contain
special characters.

Since this function only produces TSVs, follow the TSV specs and stop
adding quotes.

Resolves <#1312>

¹ <https://www.iana.org/assignments/media-types/text/tab-separated-values>
² <https://datatracker.ietf.org/doc/html/rfc4180#page-2>
joverlee521 added a commit that referenced this issue Jun 27, 2024
Shows the undesired behavior with internal quotes described in
<#1312> comes from the
`write_records_to_tsv` function.
joverlee521 added a commit that referenced this issue Jun 27, 2024
According to TSV specs,¹ there are no restrictions on special characters
other than tabs are not allowed in a field. This is different from the
CSV specs,² which require double quotes around fields that contain
special characters.

Since this function only produces TSVs, follow the TSV specs and stop
adding quotes.

Resolves <#1312>

¹ <https://www.iana.org/assignments/media-types/text/tab-separated-values>
² <https://datatracker.ietf.org/doc/html/rfc4180#page-2>
@tsibley
Copy link
Member

tsibley commented Aug 1, 2024

I think we want to revisit the change here, i.e. 915672e.

When I said:

OH! Yeah, that's a misconfiguration of the parser/producer then.

What I meant was, "we should fix the configuration of the parser/producer to not endlessly add additional quotes, but we should still use RFC-compliant CSV-like quoting in our TSVs rather than no quoting". With QUOTE_NONE, as added by this change, we've made it impossible for TSV field values to contain tabs or newlines; they'll error out.

Relatedly: #1563 (comment)

@tsibley tsibley mentioned this issue Aug 1, 2024
7 tasks
@tsibley
Copy link
Member

tsibley commented Aug 1, 2024

For example:

$ augur curate passthru --output-metadata - <<<'{"x":"\u0009"}'
x
Traceback (most recent call last):
  File "/home/tom/nextstrain/augur/augur/__init__.py", line 69, in run
    return args.__command__.run(args)
  File "/home/tom/nextstrain/augur/augur/curate/__init__.py", line 243, in run
    write_records_to_tsv(validated_output_records, args.output_metadata)
  File "/home/tom/nextstrain/augur/augur/io/metadata.py", line 556, in write_records_to_tsv
    tsv_writer.writerow(first_record)
  File "/home/tom/.conda/envs/augur/lib/python3.9/csv.py", line 154, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
_csv.Error: need to escape, but no escapechar set


An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>

@joverlee521 joverlee521 reopened this Aug 1, 2024
@joverlee521
Copy link
Contributor Author

What I meant was, "we should fix the configuration of the parser/producer to not endlessly add additional quotes, but we should still use RFC-compliant CSV-like quoting in our TSVs rather than no quoting".

Ah gotcha. I was wondering why augur filter didn't have the same issue even though it uses QUOTE_MINIMAL and realized that setting to QUOTE_NONE fixed the bug but not the underlying issue. The reader is what adds the extra quotes, not the writer 🤦‍♀️

joverlee521 added a commit that referenced this issue Aug 1, 2024
Resolves <#1312>

We are expecting the CSV-like double quoting when there are internal
quotes. If the field value is already correctly double quoted, then
there should not be any additional quotes.
joverlee521 added a commit that referenced this issue Aug 1, 2024
Resolves <#1312>

We are expecting the CSV-like double quoting when there are internal
quotes. If the field value is already correctly double quoted, then
there should not be any additional quotes.
joverlee521 added a commit that referenced this issue Nov 12, 2024
Resolves <#1312>

We are expecting the CSV-like double quoting when there are internal
quotes. If the field value is already correctly double quoted, then
there should not be any additional quotes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
2 participants