-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support spatial / geo types #696
Comments
@quassy it is quite easy to add a new data type that will load data into GEOGRAPHY column if destination supported that. I'm not sure it will really solve your problem. From your description it looks like the real problem is on Python side where there's no good lib to represent geospatial data, convert across different coordinates etc. |
Python has libraries to do that like geopandas, osgeo (GDAL), pyproj, shapely, geoparquet... They work for small datasets but for EL pipelines they are often not performant enough. I'm not sure how dlt works on the inside but geoparquet, binary (WKB) or geojson might be the best ways to represent spatial data. I would start with PostGIS as it's FLOSS, generally more popular and supports more geo types. You can self host it, so it's also much easier to start developing and do tests. (BigQuery just scales much better.) Use cases are wide and I can only speak from experience in a small area. An example would be working with openly available geodata like the EU Inspire datasets, like conservation areas in Ireland. Using the above libraries the data can be loaded to PostGIS. But then there are tasks were OLAP databases (like BigQuery) are better suited, others where OLTP (like PostGIS) are better and for others even flat file storage (as geojson, shp, WKT...), so you end up transferring data back and forth between sytems. |
I suggest we push as much operations to the target warehouse as possible. Libraries like gdal require library headers to be installed before a A simple in-memory representation like geojson (or geoarrow) should in theory suffice. support all sources, and even load to destination as jsontype as well. In the spirit of not dealing with the T in ELT, i don't think it's appropriate to do complex conversions, CSR projections etc in dlt that require extensive specialist knowledge and depends too much on individual use case. |
Re duplicate: #1101 Sepp Sepp Sepp |
Just to raise awareness for this topic again:
|
Our use-case would be exporting spatial data from MS SQL Server, Postgres with the PostGIS extension, shapefiles, or geoparquet/geoarrow types. The challenge is that some data formats, especially the old-school ones don't scale well and require batching. |
we'd like to start with the simplest possible case:
|
I can understand wanting to only support PostGIS--it's the easiest to pick apart and there can be some overlap with DuckDB's spatial extension if you go that route. |
Feature description
dlt should support geo types like shapes, geometries & geographies and different CRS as best as possible to allow data loading/transfer of such data in spatial databases like Postgis (Postgres addon), SpatiaLite (SQLite addon), BigQuery (only 2D geographies), H2, Oracle Spatial...
Are you a dlt user?
I'd consider using dlt, but it's lacking a feature I need.
Use case
Natively, Postgres supports basic geometries and Postgis adds support for georeferenced geometries/geographies with different CRS (coordinate reference systems) & even 3D. BigQuery supports 2D geographies and only in CRS WGS64/EPSG:4326. As different database systems (and coordinate reference systems) fit different usages, data is often transfered and transformed between systems for certain workloads. Loading spatial data is quite cumbersome because bad support from common libraries, different coordinate systems and expensive operations on geometries.
Proposed solution
Support spatial data. In some directions data has to be converted to certain CRS or to geojson/WKT because the target system might not support it otherwise. These conversions can be lossy for example by small changes due to transformation and rounding of coordinates. Also different DBMS enforce validation ((counter-)clockwise polygons, self-intersections, touching points) differently, so sometimes not all geometries might be transferable.
Related issues
No response
The text was updated successfully, but these errors were encountered: