Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readily available algorithms / overlap of interests #2

Open
Huite opened this issue Aug 2, 2024 · 4 comments
Open

Readily available algorithms / overlap of interests #2

Huite opened this issue Aug 2, 2024 · 4 comments

Comments

@Huite
Copy link

Huite commented Aug 2, 2024

Hi, maybe relevant to your interests: I ended up porting celltree to numba because I rarely use C or C++ and I needed extra functionality (search for polygon and line intersects, not just points).

To do so, I needed a bunch of algorithms. The small ones can be found here: https://github.com/Deltares/numba_celltree/blob/main/numba_celltree/geometry_utils.py

E.g. the same point in polygon algorithm: https://github.com/Deltares/numba_celltree/blob/9da5d941f27cae2b42498206b8625ff3ff42b9e3/numba_celltree/geometry_utils.py#L93

I also included a number of other algorithms:
https://github.com/Deltares/numba_celltree/tree/main/numba_celltree/algorithms

│ 1 │ barycentric_triangle.py   │
│ 2 │ barycentric_wachspress.py │
│ 3 │ cohen_sutherland.py       │
│ 4 │ cyrus_beck.py             │
│ 5 │ liang_barsky.py           │
│ 6 │ separating_axis.py        │
│ 7 │ sutherland_hodgman.py     |

I use the barycentric stuff for interpolation, cyrus_beck, liang_barsky, sutherland_hodgman are for clipping. Separating axes to identify whether two (convex) polygons intersect.

It's all pretty lightweight, I defined some Point, Vector named tuples. Everything has a higher up function which vectorizes it.
Test coverage is okay -- but I'm sure it's still missing a lot of (literal) edge cases.

It uses numba instead of Cython, because I find it a lot easier to distribute and to debug (just switch JIT compiling off temporarily).
If you were okay with numba, you could just copy it. But porting to Cython should also be straightforward -- depends on where you'd like to go with this project. (There's quite some active development on the Python compilation front, see e.g. https://lpython.org/ or mojo, etc. -- I might end up switching compilers). Requirements are minimal: just numba and numpy.

Shared interests

It's worth mentioning: I think my interests are very similar, I have numpy arrays that represent spatial data and I'd like to operate on them efficiently. Shapely is generally too expensive. Currently, whenever I need something new I just add it to numba_celltree because most of it goes through there. But ideally, we'd have a nice Python package that generally available. I'm also open to renaming numba_celltree and accepting PRs of course!
Ideally, I'd also like to support 3D geometry to add Ugrid3D unstructured grids support to numba_celltree. My gut feeling is that that's a lot more work than 2D, and it would be a shame to have it under-generalized or under-utilized. It's also possible that using vtk is just the much easier approach for 3D, but I haven't researched that very carefully.

Julia

Finally: whenever you're looking for readable implementations of computational geometry, it's worth checking out the Julia ecosystem:

https://github.com/JuliaGeometry
https://github.com/JuliaGeo/GeometryOps.jl

The Julia community, though small, seems to be moving forward a lot faster, because a straightforward choice to implement things in Julia and get good performance. In Python you have to make choices about Cython, C, C++, Fortran, Rust, or some JIT compilers which support a subset of Python -- all of which seems to generate so much friction and scatters efforts so that things don't get off the ground. Calling Julia from Python isn't very attractive at this moment, but the Julia people seem to making exciting progress at static compilation. That would probably be my preferred solution -- at that point I'd port the celltree stuff again, but then to Julia. Until then, porting from Julia to Python is often very straightforward.

@ChrisBarker-NOAA
Copy link
Contributor

thanks for reaching out!

I had seen your numba cell_tree code -- and intended to ask about merging those projects.

I'm still wary of relying on numba for "production" work (e.g. as paret of web service, or ...) -- it's great for interactive data analysis, and, as you point out algorithm development / debugging. And maybe I'm overly cautious -- I commited to Cython long before numba existed :-)

And with conda-forge, at least I no longer have to expect my users to compile stuff :-)

All that to say that I'm not ready to commit to pure-numba, but I think they can co-exist.

My thought for the cell_tree2D project:

We put them in on repo, and then they can share all. the. tests, etc. I think that will help both projects, and keep the API consistent, which I think is a good thing.

and two options for packaging -- two. packages:
cell_tree2D and cell_tree2D_numba -- or, even. better, one pacakge tha auto selects, depending on what's installed:

By default, on import it would look for the cython version, if it's there, it would use it -- if not, then it would use the numba version.

Alternatively, you could select which version you want to use:

import cell_tree2d

cell_tree2d.select("numba" | "cython" | "python")

And, now that you mention it -- we could merge cell_tree with geometry_utils.

I had though they were kinds of separate, but there really is a lot of intertwining -- cell_tree needs point_in_polygon (at least a simple version) and geometry could use a cell_tree, for, e.g. fast multiple points in a complex polygon:

  • triangulate the polygon
  • build a cell_tree
  • do the point in polygon check with the cell_tree.

Speaking of which -- do you have traingulation algorithm? we've been using "mapbox_earcut":

https://github.com/skogler/mapbox_earcut_python

which is a python wrapper around C++ code, ported from. JS, I think. A numba implimentaion might make sense.

Honestly -- I have no idea if it's the best way to go -- one of our team found it, and started using it -- it's worked so far :-)

As for Julia -- I had high hopes for it as a way to write compiled extension for Python way back when -- it seemed there as a lot of pollination of the communities -- but somehow that was never made as easy as I thought it could be -- oh well.

LPython looks promising -- I hadn't seen that before -- doesn't look ready for production work, but good to keep an eye on it.

All this to say:

I propose we join forces and create a single project for geometry utilities -- combo of python, numpy and cython (optionally calling C/C++) (for now).

We can use this project if you like -- I don't think it's had any real uptake at this point.

NOTE: I recently was working on another project that needed a fast point in polygon routine. I thought that would be a good chance to finalize, and make use of, this project.

But in that case: many points, small number of polygon vertices, a pure-numpy implementation was just as fast -- so why introduce the dependency?

https://github.com/asascience-open/xarray-subset-grid/blob/main/profiling/p_in_p_perf.py

(and the Shapely version was slower in all cases)

Thanks for the interest!

@Huite
Copy link
Author

Huite commented Aug 2, 2024

I'm still wary of relying on numba for "production" work (e.g. as paret of web service, or ...) -- it's great for interactive data analysis, and, as you point out algorithm development / debugging. And maybe I'm overly cautious -- I commited to Cython long before numba existed :-)
And with conda-forge, at least I no longer have to expect my users to compile stuff :-)
All that to say that I'm not ready to commit to pure-numba, but I think they can co-exist.

The webservices argument is a good argument, I honestly hadn't considered that! numba_celltree takes some time to compile (more than 20s I think), but then everything is cached. I force everything to 64bit floats anyway, so there's no JIT'ing required afterwards. But at least one downstream user was getting a timeout on one of their CI tests due to compilation.
But then again, deploying scientific Python on a webservice without any caching probably gives pretty bad results due to all the binary stuff that's required?

We put them in on repo, and then they can share all. the. tests, etc. I think that will help both projects, and keep the API consistent, which I think is a good thing.

I've added a few extra methods; they could be added to the C version as well, it's a little bit of work. Would also be a nice opportunity to make the C version run in parallel too; it should also be twice as fast as the numba version because numba lacks a parallel concurrent data structure to append to; so I search first, count, allocate, then search again and actually store results...
Because most machines have more than two cores now, it's still a net gain, but it feels pretty lame.

I'm a bit more experienced nowadays, so I think I'm probably more open to non-numba approaches now. I also had the (naive) hope that people might get involved more easily if it's "just Python", but "laypeople" aren't interested in computational geometry by a long shot...

Speaking of which -- do you have traingulation algorithm? we've been using "mapbox_earcut":

I've been using mapbox_earcut too. It seems nice and robust and extremely fast; I found out that actually combines pretty nicely with the cell tree. In our case, we have workflow where we need to know the wetted area (of rivers/lakes/etc.) for each grid cell of a model. This requires millions of intersections and shapely/GEOS are just too slow. But converting the polygons to triangles, then storing them in a celltree, followed by the polygon intersection seems to work pretty well!
earcut does generate a lot of sliver triangles though, so the resulting tree had some very deep branches, probably wasn't very well balanced.

The first alternative that comes to mind is Shewchuks triangle library. I've wrapped the python bindings for easier use with geopandas and such in pandamesh; I should add mapbox_earcut to it as well now that I think of it. I didn't do a lot of testing, but I remember it being at least an order of magnitude slower (which makes sense on account of all the work it does); gmsh is again an order of magnitude slower than triangle.

EDIT: reminds me, there's also this one: https://github.com/mourner/delaunator-rs
Same guy as earcut, but it will generate new nodes. As such, it'll probably produce a lot less slivers. Not sure it'll be very different than an unconstrained triangle run.
Python-rust interop is pretty easy to setup, I should try generating some bindings and do a few test runs!
It would be a reasonable candidate for a numba port (much more so than earcut).

which is a python wrapper around C++ code, ported from. JS, I think. A numba implimentaion might make sense.

Yes; incidentally, I tried this mostly motivated by (perverse) curiosity:
Deltares/numba_celltree#16

It was not a good experience. Mapbox earcut uses a doubly linked list because it has to insert a lot of values. This is easy in Python (just store references), but not with numba. I could simulate this with a numpy structured array / vector and using integers instead of pointers; but it needed a lot more workarounds. (Funnily enough, I found that people have done the same in Rust because Rust's borrow checker doesn't like the ownership ambiguity of linked lists.)

I saw your comment on the mapbox_earcut repo PR about more maintainers; if you're compiling C(++) and generating Python bindings for celltree already anyway, you could consider moving in the mapbox earcut code in as well; including the triangle code + python bindings would be an idea too, since neither the mapbox_earcut nor the triangle maintainer is very active anymore. I'm a maintainer on the conda triangle bindings recipe. Triangle has a weird custom license though, it might make sense to isolate it for those reasons.

A few marginal improvements for earcut came to mind:

  • A more or less vectorized operation to earcut many polygons. It doesn't seem like the Python overhead is all that big when earcutting millions of polygons, but it's nice not to have to worry about it at all.
  • In the same vein: parallel earcutting of many polygons. Is trivial in Python too with multiprocessing, but again I expect C OMP to have less overhead.
  • The Julia earcut implementation I've seen has a shuffle argument. That's an interesting idea, apparently it may reduce the amount of sliver triangles. The shuffle is random though; deterministic would be a lot nicer.

NOTE: I recently was working on another project that needed a fast point in polygon routine. I thought that would be a good chance to finalize, and make use of, this project.
But in that case: many points, small number of polygon vertices, a pure-numpy implementation was just as fast -- so why introduce the dependency?

I see it's processing all points per polygon side at once; makes sense. Can't vectorize it reasonably for multiple polygons though!

@ChrisBarker-NOAA
Copy link
Contributor

We put them in on repo, and then they can share all. the. tests, etc. I think that will help both projects, and keep the API consistent, which I think is a good thing.

I've added a few extra methods; they could be added to the C version as well

Worst case, you get a NotImplimentedError if an implementation doesn't have a given method.

Have you added any sanity checking? The C code can fail badly if the input data aren't consistent :-) -- even if it doesn't hard-crash, it doesn't give the user any idea why it didn't work :-(

So I'd like to have an optional check-the-input-for-correctness method of some. sort.

I see it's processing all points per polygon side at once; makes sense. Can't vectorize it reasonably for multiple polygons though!

exactly -- and it's putting the "points" loop in numpy, but the "vertexes" loop is in Python -- so a really large polygon would be a lot slower.

As far as I have found Shewchuck's code is still by far the best (only) out there for constrained delauney -- which we very much need. We actually have our own wrappers around it -- I thought we'd published that, but maybe didn't because of the licensing issue -- that is painful, I tried reaching out to him about it. some years ago, but never got a reply - maybe one of these days when in Berkeley I could find him in person ...

One of the key things we found what that code was that it would call sys.exit() on some error conditions, which would crash Python :-( -- so we called it in a sub-process -- other than that, I don't know if our code has any advantages. I haven't tested the py-triangle code to that particular problem. (I think it mostly showed up if you did a "quality" triangulation).

BTW: the https://github.com/asascience-open/xarray-subset-grid project may be of interest to Deltares -- it is yet ANOTHER attempt at a problem that's been addressed a number of times (including a Deltares project, I think). We provided some funding to kick that off -- we are looking for a robust and universal solution that can hopefully continue to be extended. None of the existing codes did everything we need well. I'm not entirely sure why the developer decided to start from scratch, but he did -- but now it's in pretty useful shape.

So where to go from here?

It sound like you're interested in more collaboration -- how about we put all this in one project? We can decide later to what extent it should be one package or multiple more focused packages.

We could either start a new one or use this one. your choice?

And we could put in NOAA's org, or Deltares' org, or even start a new org for this.

@Huite
Copy link
Author

Huite commented Aug 2, 2024

Worst case, you get a NotImplimentedError if an implementation doesn't have a given method.

Fair enough.

Have you added any sanity checking? The C code can fail badly if the input data aren't consistent :-) -- even if it doesn't hard-crash, it doesn't give the user any idea why it didn't work :-(

Numba isn't memory safe and doesn't bound check by default either, so it can crash Python pretty easily. I haven't seen that happen with the celltree though. The whole thing feels pretty robust to me since it the tree is built from the bounding boxes of the input data. I guess it'll go wrong if you have overlapping faces, but does it really spin out of control?

I only ended up adding this: https://deltares.github.io/numba_celltree/api.html#numba_celltree.CellTree2d.validate_node_bounds
Because there's an error with the bounding planes: NOAA-ORR-ERD/cell_tree2d#13
But it's a manual thing I ran for the bad answers and decided to keep it around.

One of the key things we found what that code was that it would call sys.exit() on some error conditions, which would crash Python :-( -- so we called it in a sub-process -- other than that, I don't know if our code has any advantages. I haven't tested the py-triangle code to that particular problem. (I think it mostly showed up if you did a "quality" triangulation).

The conda-recipe just uses this one: https://github.com/drufat/triangle
So it'll probably fail just as bad. The subprocess thing is a good idea.
I've had some people using pandamesh to drive triangle; I haven't gotten to writing good input checks yet. A lot of GIS vector operations tend to produce nasty input (touches, overlaps, hanging edges, etc.).

By the way, the MODFLOW folks are just writing the triangle input files and calling the executable: https://flopy.readthedocs.io/en/latest/source/flopy.utils.triangle.html

BTW: the https://github.com/asascience-open/xarray-subset-grid project may be of interest to Deltares -- it is yet ANOTHER attempt at a problem that's been addressed a number of times (including a Deltares project, I think). We provided some funding to kick that off -- we are looking for a robust and universal solution that can hopefully continue to be extended. None of the existing codes did everything we need well.

I still think the best solution on the long term is via xarray with custom index objects. It requires the explicit indexes (which has been in progress from quite some time): https://github.com/pydata/xarray/projects/1
Xugrid currently wraps xarray objects -- this brings a lot of downsides, but once the explicit indexes are done, I should be able to move all functionality into a Ugrid Index and a .ugrid accessor. That will make things far more interoperable between packages. In principle, you could add more index types and then add an accessor which can deal with all of them, abstracting the underlying grid.

I should mention, some of my colleagues are also developing this C++ library:
https://github.com/Deltares/MeshKernel
With python bindings:
https://github.com/Deltares/MeshKernelPy
It's main use is aimed at a C# GUI though, so e.g. it doesn't even have conda feedstock yet. Xugrid is being quite a lot at Deltares now, so I've added some utilities for meshkernel as an optional (pypi) dependency.

So where to go from here?
It sound like you're interested in more collaboration -- how about we put all this in one project? We can decide later to what extent it should be one package or multiple more focused packages.

Yes I would prefer collaboration. I'm pretty happy with the mesh related packages I've setup so far (numba_celltree, pandamesh, xugrid), I've been fairly productive (if I say so myself), but I'm the primary maintainer for all.

Currently, from the perspective of "geometry operations on plain numpy arrays", numba_celltree provides what I need and I can easily extend it (I basically don't have to explain myself to anybody either...). However, we currently have two celltree implementations that are largely identical; it would make sense to move those together. I think consolidating stuff like mapbox_earcut would also be reasonable (as long as it takes plain numpy arrays), if we see a need.
We should set things up nicely, e.g. pixi for managing Python environments, use maybe meson to build binaries, sphinx for the docs, CI to run, test coverage, ruff for linting and formatting, etc., etc.

I think in the end we would have to make a decision on numba versus C/C++/Cython. I don't think it's tenable to have duplicate classes for everything; somebody compromising seems more reasonable (in terms of accepting numba / accepting ahead of time compilation). That might take some time.

I think the question is: how we do make it worthwhile to get started in the mean time?

For example, I need a celltree not for polygons, but for line segments (river networks): Deltares/numba_celltree#11
Recently, I've also needed to snap line geometries to a grid (although this uses a scipy KDTree as well):
https://deltares.github.io/xugrid/examples/vector_conversion.html#snap-to-grid
This would be a reasonable addition too, I think (again, just geometry in numpy array form).

I still need more robust barycentric interpolation weights as well.

Which currently missing functionality is most pressing for you? We'd start benefitting immediately if there's some shared stake there.

(geometry_utils might not cut it by the way, since the name seems taken on pypi? https://pypi.org/project/geometry-utils/)

Having said that, I also don't have all that much time on my hands at the moment to get things off the ground...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants