Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: how to handle nodata #31

Open
alexgleith opened this issue Nov 26, 2023 · 7 comments
Open

Discussion: how to handle nodata #31

alexgleith opened this issue Nov 26, 2023 · 7 comments

Comments

@alexgleith
Copy link
Contributor

We currently have a mix of ways of handling nodata annotation in xarrays.

Some code is using rio-xarray and some is using the convention of storing the nodata value on xr.atts["nodata".

We should ensure we consume and produce the nodata metadata the same, so we don't need multiple ways of checking, asserting, setting, and consuming nodata values.

@alexgleith alexgleith changed the title Discussion: how to handle Nodata Discussion: how to handle nodata Nov 26, 2023
@jessjaco
Copy link
Collaborator

Yeah it looks like odc looks for a nodata attribute and rioxarray looks for a .rio.nodata value so I'm inclined to support each. I will check if rioxarray respects the .nodata attribute.

I do want to clean up the loaders, they're kind of a mess right now. But part of that is pending this convo

@jessjaco
Copy link
Collaborator

jessjaco commented Jan 8, 2024

I'm trying to work through the current handling of nodata values in the OdcLoaderMixin, and understand how odc.stac.load handles things at the same time.

It looks like odc.stac.load sets the nodata attribute at the variable level, not the dataset level. This makes sense, as different variables can have different nodata values (see qa_pixel vs the other landsat bands). Which makes this assignment in the mixin redundant, I think:

if self.nodata is not None:
xr.attrs["nodata"] = self.nodata

But I think this may be used in places, so we should check before removing.

So if the metadata exists or the nodata argument is passed, then each variable will have a .nodata attribute. I don't know what happens if neither occurs.

I would like to maintain compatibility with rioxarray, which describes its approach to nodata here. They too keep track of nodata at the DataArray / variable level. Additionally, the .rio.nodata accessor will read .attrs['nodata'] so I think we will be ok maintaining that as the goto. My only concern (and something worth testing) is if .attrs['nodata'] is set and .rio.write_nodata is called with another value, does .attrs['nodata'] get overwritten, or are they at odds?

@jessjaco
Copy link
Collaborator

jessjaco commented Jan 8, 2024

Another thing to consider is xarray. Officially, xarray only supports nan as the nodata value.

An additional concern is propagation of .attrs['nodata'] (or other attributes). Even innocuous operations like da * 3 will erase the attribute (see https://docs.xarray.dev/en/stable/user-guide/computation.html#missing-values). At they link, they do suggest changing one of the global xarray options (keep_attrs) to override this behavior, though we have to decide if we want to override a default option as a matter of course! Also see rioxarray's discussion of this.

@alexgleith
Copy link
Contributor Author

There's a mix of compatibility in our current implementation!

I know that odc-stac handles nodata per-band, and you can pass in information to help (this kind of config stuff). We should just hardcode that for S-2 and Landsat probably, and then once it's done, we can just assume it's there.

Xarray is hard to influence... they do a few things that break the work we're doing. Some operations lose the CRS, for example. Being consistent with keep_attrs is a good idea...

Maybe it's worth coming up with a consolidated picture between us, and then checking in with others like Robbi and Kirill and influencing in the right place to encourage broader consistency! That's a longer term goal, but will benefit many.

the .rio.nodata accessor will read .attrs['nodata'] so I think we will be ok maintaining that as the goto

This sounds good to me.

@jessjaco
Copy link
Collaborator

jessjaco commented Jan 8, 2024

Right, I intend to take out a lot of our nodata handling, because odc.stac.load (and stackstac, if I ever update that interface) already handle it in a fairly robust manner. In fact, I am planning to take out of a lot of our wrappers around odc.stac.load, because they don't really help.

The only operation I'd like to keep is if data are loaded as a float type (potentially non-nan) float values are converted to nan. I may make that optional. Otherwise it's up to us to support propagation of non-nan nodata attribtues. I'll do my best, but I don't really work with non-float xarray objects, until I'm at or very close to writing.

@alexgleith
Copy link
Contributor Author

In fact, I am planning to take out of a lot of our wrappers around odc.stac.load

Good idea :-)

I don't really work with non-float xarray objects

I think we have to in some situations, unfortunately. I really don't want to write intermediate files, and as such, want to keep everything in memory and saving half of a big number of memory makes a significant difference! The GeoMAD for Sentinel-2 for example keeps everything as uint16 and still uses something like 150 GB of memory. The Coastlines process I'm running in Vietnam is similar, at the most extreme case (hilariously with 1 year of data for S-2 and ~35 years with Landsat!)

So, I'm pretty firm on us needing both an int and a float path.

@jessjaco
Copy link
Collaborator

jessjaco commented Jan 8, 2024

Well it's a matter of approach too. I almost never bring big float arrays into memory, just chunks. (That was the biggest reason I ended up modifying a lot of the existing coastlines code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants