Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using UTF-8 when encoding is unspecified #200

Open
danstoner opened this issue Jan 11, 2022 · 0 comments
Open

Consider using UTF-8 when encoding is unspecified #200

danstoner opened this issue Jan 11, 2022 · 0 comments

Comments

@danstoner
Copy link
Contributor

For example, attempting to ingest:

https://unhcollection.unh.edu/database/content/dwca/UNHC-UNHC_DwC-A.zip

The published Darwin Core Archive includes a meta.xml which has a blank encoding value:

encoding=""

The rest of that line looks like:

<core dateFormat="YYYY-MM-DD" encoding="" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

The encoding value tell the consumers of the occurrence file how to process the file properly.

The data provider has been unable to resolve the situation in over a year.

https://redmine.idigbio.org/issues/3002

Consider whether it is worth applying UTF-8 encoding in this situation so the data can be ingested, or whether it still makes sense to hard fail since there is a chance of "bad things" if the encoding turns out to be mismatched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant