Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geo extensions for DataFrame #875

Closed
AndreiKingsley opened this issue Sep 20, 2024 · 3 comments · Fixed by #909
Closed

Geo extensions for DataFrame #875

AndreiKingsley opened this issue Sep 20, 2024 · 3 comments · Fixed by #909
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@AndreiKingsley
Copy link
Collaborator

Add a module for working with geodata in DataFrame.

Essentially all geoformats store a dataframe of geometric objects (that is, regular objects with data, but also with their own geometry, which is defined in one way or another). Thus, it is natural to read geodata into a DataFrame - simple, linear (possibly with nested columns) with geometry column. We also need to do some transformations of this data/make calculations on it (see examples below). Working name - GeoDataFrame

Formats

To starts, let's support the two most popular data formats (rest, if necessary, will be easy to support):

GeoJSON is more prioritized because it is more modern, easier to share (unlike shpafile it is textual and also comes in one file). Nevertheless shapefile is definitely worth supporting as it is widely used (and if we use geotool, it will cost nothing).

Possible API’s:

GeoDataFrame.read("country.json"): GeoDataFrame
GeoDataFrame.read("country.geojson"): GeoDataFrame
GeoDataFrame.read("country.geo.json"): GeoDataFrame

GeoDataFrame.readGeoJSON("country.json"): GeoDataFrame
GeoDataFrame.readGeoJSON("country.geojson"): GeoDataFrame
GeoDataFrame.readGeoJSON("country.geo.json"): GeoDataFrame

GeoDataFrame.readJSON("country.json"): GeoDataFrame
GeoDataFrame.readJSON("country.geojson"): GeoDataFrame
GeoDataFrame.readJSON("country.geo.json"): GeoDataFrame

GeoDataFrame.read("country.shp"): GeoDataFrame
GeoDataFrame.readShapefile("country.shp"): GeoDataFrame

Reading result:

Reading following geojson should leads to DataFrame with the following schema:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "MultyPloygon",
        "coordinates": [102.0, 0.5]
      },
      "properties": {
        "name": "Netherlands",
        "gdp": {
           "in_2021": 34121.1,
           "in_2022": 65463.1
        }
      }
    },
...
| name | gdp               | geometry |
|      |-------------------|          |
|      | in_2021 | in_2022 |          |
---------------------------------------

name: String
gdp:
    in_2021: Int
    in_2022: Int
geometry: Geometry

Implementation:
Reading most popular types (including geojson and shapefile) is supported in geotools. The internal representation in geotools is similar to a DataFrame, so converting it to a GeoDataFrame should not be difficult.

Geometry

Further, to work with geometry we need the type - Geometry. And here it seems the only correct solution is to take the JTS library. This is a standard solution not only in Java, but even in C++ and Python. Moreover, it is used in GeoTools, so if we use geotools for data read implementation, we don't need to do any additional conversions.

Processing

Example of possible actions with GeoDataFrame:

  • Filter geometries in bounds
  • Move something (for example, move Iceland to plot more pretty Europe map)
  • Transform coordinate system
@Jolanrensen
Copy link
Collaborator

First of all, thanks! Good idea, certainly in combination with Kandy. I'll think about the proposal some more but I've already got a quick comment in terms of the proposed API:

For all formats, DataFrame follows the structure:

DataFrame.read`SOME TYPE`(input): DataFrame<*>

this is because DataFrame.read(input) should be able to guess any of the supported types, so, also for consistency, this should be adhered to.

I think it's fine to make an exception for the return type, so you could return something like DataFrame<GeoDataRow>. Making a subtype of DataFrame will likely cause many bugs due to the way the library is designed, but with extension functions I think many ideas you have will be possible with a simple typealias.

This would adjust the proposed API to:

typealias GeoDataFrame = DataFrame<GeoDataRow>

fun DataFrame.readGeoJson(input): GeoDataFrame

fun DataFrame.readShapeFile(input): GeoDataFrame

and if you use the non-specific read function:

DataFrame.read(geoJsonInput).cast<GeoDataRow>()
// or
DataFrame.read(geoJsonInput).asGeoDataFrame()

@Jolanrensen Jolanrensen added the enhancement New feature or request label Sep 20, 2024
@zaleslaw zaleslaw added this to the 0.15.0 milestone Sep 20, 2024
@Jolanrensen
Copy link
Collaborator

closed by #909

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants