Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with geospatial data (shapefile) #394

Open
2 tasks done
vlebert opened this issue Sep 10, 2024 · 5 comments
Open
2 tasks done

Encoding issue with geospatial data (shapefile) #394

vlebert opened this issue Sep 10, 2024 · 5 comments

Comments

@vlebert
Copy link

vlebert commented Sep 10, 2024

What happens?

When trying to import a shapefile encoded with CP1252 I have the following error

InvalidInputException: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

I tried various options in the st_read but no success

A current workaround is to convert first shapefile to geoparquet with ogr2ogr and then import the geoparquet

To Reproduce

CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp');

Note : the shapefile does have a .cpg file providing the encoding

Even forcing encoding do fail:

CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=WINDOWS-1252']);
CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=CP-1252']);
CREATE TABLE t_adresse AS SELECT * FROM st_read('source/t_adresse.shp', open_options=['ENCODING=CP1252']);

However, SELECT * FROM ST_READ('source/t_adresse.shp') does not gives error (in python)

Current workaround :

ogr2ogr -f parquet t_adresse.parquet source/t_adresse.shp

CREATE TABLE t_adresse AS SELECT * from 't_adresse.parquet'

OS:

MacOS

DuckDB Version:

1.1

DuckDB Client:

Python

Hardware:

No response

Full Name:

Valérian LEBERT

Affiliation:

Digi-Studio

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot easily share my data sets due to their large size

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Maxxen
Copy link
Member

Maxxen commented Sep 10, 2024

Hi! Thanks for opening this issue! Unfortunately duckdb does not support encodings other than utf-8, and even though st_read uses GDAL under the hood, i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.

We are trying to reduce the amount of depencies in the spatial extension, so it is unlikely this use case will ever be supported.

@vlebert
Copy link
Author

vlebert commented Sep 10, 2024

It's bad news as there are many shapefiles with exotic encoding in the nature :)

Could we at least have garbage text fields instead of a fatal error

In many case the characters with accent could be located in columns not even used in the dataflow

Currently

  • we don't know which column causes the error
  • we can't import the tables

In other tool, encoding is often an issue but not causing critical error

@Maxxen
Copy link
Member

Maxxen commented Sep 10, 2024

So spatial has its own experimental shape file reader, st_readshp where you should be able to pass an extra encoding := 'blob' optional argument which will read any string fields as DuckDB BLOB's which you can then decode() into VARCHAR if they are valid utf8.

@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Sep 11, 2024
@vlebert
Copy link
Author

vlebert commented Sep 11, 2024

Using the experimental st_readshp (without extra argument), I could load the dataset. All text fields are loaded as blob

@rouault
Copy link

rouault commented Sep 14, 2024

i think the issue is that we dont bundle the (optional) iconv library that gives GDAL the capability to re-encode text.

OSGeo/gdal#10799 should improve that situation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants