Consider using UTF-8 when encoding is unspecified #200

danstoner · 2022-01-11T12:10:12Z

For example, attempting to ingest:

https://unhcollection.unh.edu/database/content/dwca/UNHC-UNHC_DwC-A.zip

The published Darwin Core Archive includes a meta.xml which has a blank encoding value:

encoding=""

The rest of that line looks like:

<core dateFormat="YYYY-MM-DD" encoding="" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">

The encoding value tell the consumers of the occurrence file how to process the file properly.

The data provider has been unable to resolve the situation in over a year.

https://redmine.idigbio.org/issues/3002

Consider whether it is worth applying UTF-8 encoding in this situation so the data can be ingested, or whether it still makes sense to hard fail since there is a chance of "bad things" if the encoding turns out to be mismatched.

The text was updated successfully, but these errors were encountered:

danstoner added the enhancement label Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using UTF-8 when encoding is unspecified #200

Consider using UTF-8 when encoding is unspecified #200

danstoner commented Jan 11, 2022

Consider using UTF-8 when encoding is unspecified #200

Consider using UTF-8 when encoding is unspecified #200

Comments

danstoner commented Jan 11, 2022