GeoDataset: avoid unnecessary reprojection #409

adamjstewart · 2022-02-16T22:08:55Z

This issue is meant to serve as a proposal for how to fix #278. Please try to poke holes in this proposal and find ways in which it won't work, or suggest alternative solutions.

Rationale

Geospatial data files are stored in a particular CRS. In order to build a single R-tree index of a geospatial dataset, the bounds of all files must be reprojected to the same CRS. During sampling, we also reproject the file itself to a common CRS so that we can stitch together multiple files from the same layer, or combine multiple layers in an IntersectionDataset.

Unfortunately, reprojection/resampling is very slow, and can negatively affect data loader performance (see section 4.2 of https://arxiv.org/abs/2111.08872). It can also lead to large distortions of images in polar regions. We would like to be able to avoid reprojection if possible.

Proposal

I propose to only reproject files during sampling when absolutely necessary. This includes any situation in which we sample from multiple files at once and all files are not in the same CRS, such as:

a single GeoDataset containing overlapping files on the border of two UTM zones
an IntersectionDataset with datasets in different CRSes
a UnionDataset with overlapping files and datasets in different CRSes

The details of how this can be done is left to the "Implementation" section below, but the basic idea is to add an additional parameter for __getitem__ that contains the CRS to project to. The sampler is then responsible for setting this CRS in a way that minimizes reprojection.

There is one issue that arises with this implementation. If we choose our bounding box in one CRS and reproject it to a different CRS, the bounding box will be rotated, and the enclosing bounding box may be larger than the original. This can result in tensors of different size that cannot be concatenated without padding. The "Implementation" section describes how this problem can be avoided.

Implementation

The details of what will need to be changed are as follows:

GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}
{Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}
GeoDataset.__getitem__: instead of taking a BoundingBox(minx, maxx, miny, maxy, mint, maxt), accept a BoundingBox(x, y, t, width, height, time) and CRS, then project (x, y) to that CRS before sampling
GeoSampler: look at the native CRS of all files in the R-tree and select the mode as the CRS to return

By reprojecting only (x, y) instead of (minx, maxx, miny, maxy), we avoid issues with inconsistent tensor sizes during concatenation. The width/height can be applied directly in the new CRS without distortion.

Note that the R-tree index will still use (minx, maxx, miny, maxy, mint, maxt) as its index, while __getitem__ will take a BoundingBox(x, y, t, width, height, time). This is quite different from the previous approach that used the former for both.

Alternatives

There have been a few other alternatives proposed, each with their own pros/cons:

Always project to shared CRS—This is our current implementation. It can be slow and cause spatial distortion, but it's simple and straightforward.
Always sample in native CRS—This is the implementation we use for ChesapeakeCVPR. It is the fastest possible solution with the least distortion, but it disallows datasets with overlapping images in different CRS, so it's somewhere between VisionDataset and GeoDataset. It also suffers from the aforementioned bug with tensor concatenation shape mismatches, although we could employ the same solution as proposed here to fix that.
Add flag to control reprojection—We could add a flag that tells TorchGeo whether or not to reproject files. Unfortunately, this just punts the problem to the user and allows them to shoot themselves in the foot. This also results in more code paths to test and debug.
Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding). Fast at loading time, but removes most of what makes TorchGeo cool. Also increases storage demands.

@RitwikGupta

The text was updated successfully, but these errors were encountered:

weiji14 · 2022-02-17T16:08:55Z

Just to give my two cents as I've been following #278 closely and trying to figure out how to use torchgeo to load multiple Sentinel-2 and higher resolution datasets from different UTM zones without reprojection for an object detection project 😃

GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}

{Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}

Ok with this, but instead of storing tuple(filename, CRS) in the rtree index, I would suggest storing a Python dict ({"filename": filename, "CRS": epsg}) or better, a namedtuple ((filename=filename, CRS=epsg)) instead. This would be a bit more future proof as there might be more metadata we want to store, e.g. the resolution of the raster GeoDataset (c.f. #74, e.g. when people want to store different Sentinel-2 images in their native resolution of 10m, 20m, 60m).

GeoDataset.__getitem__: instead of taking a BoundingBox(minx, maxx, miny, maxy, mint, maxt), accept a BoundingBox(x, y, t, width, height, time) and CRS, then project (x, y) to that CRS before sampling

Just a minor (but important) detail. I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner. I've had problems before with having to flip data up-down because some people use minx/miny from the bottom left and some people use minx/miny from the top left 🙃

GeoSampler: look at the native CRS of all files in the R-tree and select the mode as the CRS to return

Could you elaborate more on this? By 'mode', do you mean having a flag where the default is reproject to the R-tree index's CRS (as with torchgeo v0.2.0), and a mode='native' that just samples at the raster's native projection?

Either way, I'm happy to help review and/or open a PR to implement this in torchgeo!

adamjstewart · 2022-02-17T20:01:19Z

Thanks for the feedback @weiji14!

I would suggest storing a Python dict... or better, a namedtuple... instead.

A named tuple could work. Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this. I'm not super worried about being future proof because we rebuild the index each time, so it's not like we need to support old indices. I don't think many users directly pass a value to __getitem__, I think most users instantiate a GeoSampler and let that determine the value to pass.

I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner.

I actually proposed this exact thing to @calebrob6 because I was worried that we would end up with more distortion if we use a corner instead of a centroid. However, the problem is that you can run into off-by-one errors if the width/height is odd vs even. It also adds a bit of complexity to the code if you want to specify a bounding box that rasterio can actually read. Again, I'm hoping that most users won't need to think about BoundingBox at all, they will just get it from a GeoSampler.

Could you elaborate more on this?

I mean "mode" in the mathematical sense. If I'm sampling from an IntersectionDataset and the area of overlap has files in (CRS1, CRS1), I will use CRS1. If the area has files in (CRS1, CRS2, CRS2), I'll use CRS2. If the area has files in (CRS1, CRS2, CRS3), I'll randomly choose one since they're all equally common. This minimizes the number of files we need to reproject.

adamjstewart · 2022-02-18T04:05:25Z

Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this.

Actually, that's for TypedDict, not NamedTuple. The former was added in Python 3.7, the latter is available in all versions of Python we support.

adamjstewart · 2022-02-18T04:17:50Z

GeoDataset.index: instead of storing {bbox: filename}, store {bbox: [(filename, CRS), ...]}

{Intersection|Union}Dataset.index: instead of storing {bbox}, store {bbox: [(filename, CRS), ...]}

We might have a problem with this. I was hoping to contribute our intersection/union logic directly to rtree: Toblerity/rtree#210. However, it isn't clear what the behavior should be when computing an intersection for general rtree indices. For example, if an rtree stores general objects, the only possible way to compute an intersection would be to store the tuple of objects from both indices. So a dataset like ((ds1 & ds2) & ds3) would end up with an object like [[(bbox1, CRS1), (bbox2, CRS2)], (bbox3, CRS3)]. This would be less fun to parse than a single list, but we can't use a single list if we want general rtree support.

weiji14 · 2022-02-18T04:21:31Z

I would suggest storing a Python dict... or better, a namedtuple... instead.

A named tuple could work. Proper type hints weren't added until Python 3.7 but we could use typing-extensions to backport this. I'm not super worried about being future proof because we rebuild the index each time, so it's not like we need to support old indices. I don't think many users directly pass a value to __getitem__, I think most users instantiate a GeoSampler and let that determine the value to pass.

Is torchgeo still supporting Python 3.6? Considering that Python 3.6 is EOL already and many Python scientific/geospatial packages are moving towards adopting NEP29 (e.g. rasterio/rasterio#1813, Toblerity/Fiona#1034) which is technically Python 3.8+, I really don't feel the need to do the backporting. I did find b6925f6 and the closed PR #326 though, so if there's a reason to have Python 3.6 support for a few more months, that's ok too.

I insist that the x and y in BoundingBox(x, y, t, width, height, time) be the middle x and y, and not some arbitrary corner.

I actually proposed this exact thing to @calebrob6 because I was worried that we would end up with more distortion if we use a corner instead of a centroid. However, the problem is that you can run into off-by-one errors if the width/height is odd vs even. It also adds a bit of complexity to the code if you want to specify a bounding box that rasterio can actually read. Again, I'm hoping that most users won't need to think about BoundingBox at all, they will just get it from a GeoSampler.

Right, I'm thinking of large full scene satellite images several km wide that would definitely get distorted if we choose a corner vs a centroid. Perhaps raise a warning if the user sets an odd number width/height, saying that there will be some offset or resampling going on? Let's try to draft an implementation and see how it goes, like you say, most of this should be backend math calculations casual users shouldn't have to worry about.

Could you elaborate more on this?

I mean "mode" in the mathematical sense. If I'm sampling from an IntersectionDataset and the area of overlap has files in (CRS1, CRS1), I will use CRS1. If the area has files in (CRS1, CRS2, CRS3), I'll use CRS2. If the area has files in (CRS1, CRS2, CRS3), I'll randomly choose one since they're all equally common. This minimizes the number of files we need to reproject.

Ah ok, that statistical 'mode'! Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned, and whether or not any reprojection should happen. To your point on tensor concatenation shape mismatches (mentioned in alternative point 2 #409 (comment)), I think this will become a non-issue with #294, so we would be left with the problem on how to sample from overlapping regions, in which a 'mode' based sampling or 'first-raster' based sampling approach would be ok.

adamjstewart · 2022-02-18T04:38:27Z

Is torchgeo still supporting Python 3.6?

For the time being, yes. I would prefer to stick to Python's EOL timeline, but there are many systems that still use Python 3.6. I strongly disagree with numpy's aggressive deprecation timeline, as Google Colab still runs Python 3.7 by default. I think we'll likely compromise and stick to whatever versions of Python that PyTorch supports (currently 3.6–3.9 in the latest release, soon to be 3.7–3.9 in the next release). Unless I can convince @calebrob6 to drop 3.6 support since PyTorch has already dropped it in master.

Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned

This sounds like alternative 1 (our current behavior). If we do this, we require all samples to be projected to the same CRS, which leads to extreme distortion and slow data loading times.

For example, consider a dataset composed of files from 4 different UTM zones. If a sample contains files in zones A and B, we want to return a sample in one of those zones. If another sample contains files in zones C and D, we don't want to warp both files to zone A or B, we want to choose from C and D in order to minimize reprojection.

and whether or not any reprojection should happen.

This sounds like alternative 3. As I mentioned above, there are many issues with this. If I specify reproject=False and a sample contains files in two different CRSes, what should the behavior be? Crash? Only sample from one file? If we do the latter, the dataset basically becomes a VisionDataset and can no longer be combined intelligently with other GeoDatasets.

weiji14 · 2022-02-18T16:25:46Z

Is torchgeo still supporting Python 3.6?

For the time being, yes. I would prefer to stick to Python's EOL timeline, but there are many systems that still use Python 3.6. I strongly disagree with numpy's aggressive deprecation timeline, as Google Colab still runs Python 3.7 by default. I think we'll likely compromise and stick to whatever versions of Python that PyTorch supports (currently 3.6–3.9 in the latest release, soon to be 3.7–3.9 in the next release). Unless I can convince @calebrob6 to drop 3.6 support since PyTorch has already dropped it in master.

Yep, I think keeping Python 3.7 is definitely good since Pytorch still supports it. Let's continue this discussion at #413 since typing.NamedTuple should work in Python 3.6 as you said. I've got a draft implementation at #411 for this ready.

Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned

This sounds like alternative 1 (our current behavior). If we do this, we require all samples to be projected to the same CRS, which leads to extreme distortion and slow data loading times.

For example, consider a dataset composed of files from 4 different UTM zones. If a sample contains files in zones A and B, we want to return a sample in one of those zones. If another sample contains files in zones C and D, we don't want to warp both files to zone A or B, we want to choose from C and D in order to minimize reprojection.

and whether or not any reprojection should happen.

This sounds like alternative 3. As I mentioned above, there are many issues with this. If I specify reproject=False and a sample contains files in two different CRSes, what should the behavior be? Crash? Only sample from one file? If we do the latter, the dataset basically becomes a VisionDataset and can no longer be combined intelligently with other GeoDatasets.

Consider a semantic segmentation task (X -> Y) with:

Sentinel-2 images (X_1) in UTM zone 11N
Sentinel-2 image (X_2) in UTM zone 12N
Land use classification mask (Y) in EPSG:3857

I would say that the training input X should not be reprojected. The sampler should ideally take be taking 256x256 crops of the UTM zone 11N image and UTM zone 12N image, both in their native projections. The land use classification mask (which is of lower quality) in EPSG:3857 would then be reprojected to either UTM zone 11N or UTM zone 12N, depending on where the sampler is sampling from. The key is that the training image X should remain in the native UTM projection, because we will always be running inference on the native UTM projection of Sentinel-2 images (X), and not the projection of the mask file (EPSG:3857).

My initial idea of doing this is as follows, note that this assumes pixel sampling mode (c.f. #294):

## Land use classification (LUC) dataset stored in EPSG:3857
mask_dataset = RasterDataset(filepath="landuseclassification.tif")
# mask_sample = {
#        "image": torch.tensor(mask),  # shape: (1, 256, 256)
#        "crs": self.crs,  # in EPSG:3857
#        "bbox": query,  # in EPSG:3857
#  }

## Sentinel-2 image with 12 bands across UTM zone 11N and 12N
image_dataset11 = RasterDataset(filepath="S2_UTM_Zone_11N.tif")
image_dataset12 = RasterDataset(filepath="S2_UTM_Zone_12N.tif")
# image_sample = {
#        "image": torch.tensor(image),  # shape: (12, 256, 256)
#        "crs": self.crs,  # in UTM projection
#        "bbox": query,  # bounds in UTM projection
#  }

## Pair Sentinel-2 dataset with LUC dataset
utm11N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset11, dataset2=mask_dataset)
utm12N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset12, dataset2=mask_dataset)
# image_mask_sample1 = {
#        "image": torch.tensor(image_mask),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 11N
#        "bbox": query,  # bounds in UTM projection 11N
#  }
# image_mask_sample2 = {
#        "image": torch.tensor(image_mask),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 12N
#        "bbox": query,  # bounds in UTM projection 12N
#  }

## Combine UTM Zone 11N and Zone 12N GeoDatasets
full_image_dataset = UnionGeoDataset(
    dataset1=utmzone11N_mask_dataset, dataset2=utmzone12N_mask_dataset
)
# full_image_sample = {
#        "image": torch.tensor(image),  # shape: (13, 256, 256)
#        "crs": self.crs,  # in UTM projection 11N or 12N, depending on area
#        "bbox": query,  # bounds in what projection?? UTM 11N, 12N, or EPSG:3857?
#  }

So the problem is with the last Union operation - what projection should the Rtree index be in? UTM Zone 11N, 12N, or EPSG:3857? Maybe you're right and what I'm after might not be a job for IntersectionGeoDataset/UnionGeoDataset, but just a VisionDataset, or something else entirely like a ConcatGeoDataset or CombinedGeoDataset (for lack of a better term) that does not do reprojection at all (i.e. the user has to handle it manually).

It did occur to me that there is an inherent limitation with the rtree library that only allows for one projection type. I had an idea to store 3 rtree indexes, one for each GeoDataset being combined, and have GeoSampler do a for-loop through all of those. But not sure how well that would work in practice, and probably more trouble than it is worth.

adamjstewart · 2022-02-18T16:42:16Z

I would say that the training input X should not be reprojected.

What if the training dataset X contains a file A in UTM zone 11N and a file B in UTM zone 12N and both A and B intersect with the sampler bounding box? Do you a) crash, or b) only return one of the two images? How do you decide which image to return?

You do make a good point that we should avoid reprojecting float images and prefer reprojecting int label masks when possible, although it's going to be difficult to assert that preference in my proposed solution.

weiji14 · 2022-02-18T17:27:18Z

I would say that the training input X should not be reprojected.

What if the training dataset X contains a file A in UTM zone 11N and a file B in UTM zone 12N and both A and B intersect with the sampler bounding box? Do you a) crash, or b) only return one of the two images? How do you decide which image to return?

How about taking the image where the sampled bounding box overlaps with most? Or if the box happens to be in the middle and the overlap areas are the same, I would just go with returning the first one in the case of RandomGeoSampler (since it is random anyway). Not sure how to handle it for GridGeoSampler though, probably should return both images, but not sure if that is possible looking at the code.

You do make a good point that we should avoid reprojecting float images and prefer reprojecting int label masks when possible, although it's going to be difficult to assert that preference in my proposed solution.

Taking a step back, I think we can both agree that implementation details 1, 2 and 3 are good enough, and can be done right now just with some code refactoring. Most of the debate appears to be on implementation detail 4 (how GeoSampler should behave), and given that there are probably a whole lot of combinations we haven't even though about, maybe we should come up with the simplest method that makes sense (e.g. taking the 'mode' projection, or projection of the 'first' GeoDataset), and make it easier somehow for people to subclass or adapt GeoSampler to fit their own specific problem.

adamjstewart · 2022-02-18T17:34:17Z

The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let's decide if this proposal is the behavior we want to achieve before we worry about making any changes. I'm still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.

weiji14 · 2022-02-18T18:29:35Z

The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let's decide if this proposal is the behavior we want to achieve before we worry about making any changes. I'm still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3.

Well, I would argue that implementation details 1 and 2 is actually quite useful regardless of what happens. At the very least, having the CRS information in the Rtree index would allow me to create a custom Dataset/DataModule containing data from different projections more easily, instead of having to subclass RasterDataset and rewrite these 70 lines of logic:

torchgeo/torchgeo/datasets/geo.py

Lines 313 to 383 in fa6e4e1

    
               def __init__( 
        
                   self, 
        
                   root: str, 
        
                   crs: Optional[CRS] = None, 
        
                   res: Optional[float] = None, 
        
                   transforms: Optional[Callable[[Dict[str, Any]], Dict[str, Any]]] = None, 
        
                   cache: bool = True, 
        
               ) -> None: 
        
                   """Initialize a new Dataset instance. 
        
                   Args: 
        
                       root: root directory where dataset can be found 
        
                       crs: :term:`coordinate reference system (CRS)` to warp to 
        
                           (defaults to the CRS of the first file found) 
        
                       res: resolution of the dataset in units of CRS 
        
                           (defaults to the resolution of the first file found) 
        
                       transforms: a function/transform that takes an input sample 
        
                           and returns a transformed version 
        
                       cache: if True, cache file handle to speed up repeated sampling 
        
                   Raises: 
        
                       FileNotFoundError: if no files are found in ``root`` 
        
                   """ 
        
                   super().__init__(transforms) 
        
                   self.root = root 
        
                   self.cache = cache 
        
                   # Populate the dataset index 
        
                   i = 0 
        
                   pathname = os.path.join(root, "**", self.filename_glob) 
        
                   filename_regex = re.compile(self.filename_regex, re.VERBOSE) 
        
                   for filepath in glob.iglob(pathname, recursive=True): 
        
                       match = re.match(filename_regex, os.path.basename(filepath)) 
        
                       if match is not None: 
        
                           try: 
        
                               with rasterio.open(filepath) as src: 
        
                                   # See if file has a color map 
        
                                   try: 
        
                                       self.cmap = src.colormap(1) 
        
                                   except ValueError: 
        
                                       pass 
        
                                   if crs is None: 
        
                                       crs = src.crs 
        
                                   if res is None: 
        
                                       res = src.res[0] 
        
                                   with WarpedVRT(src, crs=crs) as vrt: 
        
                                       minx, miny, maxx, maxy = vrt.bounds 
        
                           except rasterio.errors.RasterioIOError: 
        
                               # Skip files that rasterio is unable to read 
        
                               continue 
        
                           else: 
        
                               mint: float = 0 
        
                               maxt: float = sys.maxsize 
        
                               if "date" in match.groupdict(): 
        
                                   date = match.group("date") 
        
                                   mint, maxt = disambiguate_timestamp(date, self.date_format) 
        
                               coords = (minx, maxx, miny, maxy, mint, maxt) 
        
                               self.index.insert(i, coords, filepath) 
        
                               i += 1 
        
                   if i == 0: 
        
                       raise FileNotFoundError( 
        
                           f"No {self.__class__.__name__} data was found in '{root}'" 
        
                       ) 
        
                   self._crs = cast(CRS, crs) 
        
                   self.res = cast(float, res)

adamjstewart · 2022-03-06T20:57:23Z

When we make this switch, we should also switch from interleaved=True to interleaved=False in our rtree index. The former is known to be buggy and actually discouraged by the rtree maintainers: Toblerity/rtree#204 (comment)

adamjstewart · 2023-03-08T15:25:29Z

If anyone ever tries to implement this, we should use a dict, not a tuple. There are a lot of things we would like to optionally store in the R-tree that aren't always available but could be used by the sampler. For example, we could store percent cloud cover and have a sampler weighted by cloud cover.

AdeelH · 2024-08-10T20:40:32Z

Require single CRS—We could require all users to reproject their data themselves and ensure that all files are in the same CRS. Then we never have to worry about reprojection. This is similar to what Raster Vision does (from my understanding).

Sorry to resurrect this old issue, but re: Raster Vision: the above is incorrect.

The RV approach (which has not changed since this issue was created) is to have one GeoDataset per scene (i.e. per file) and then concatenate them using PyTorch's ConcatDataset for training etc. This means you can even concatenate GeoDatasets with ImageDatasets (RV's counterpart to NonGeoDatasets) or even with non-RV Datasets. So not only can you sample from a mix of scenes with different projections, but you can even do it from a mix of georeferenced and non-georeferenced data sources. At no point* is any file reprojected.

You can even take it a step further and stack bands from files with different projections. Needless to say, the labels can come from a source with different projection than the imagery as well.

The concatenation is possible because RV GeoDatasets have a length and are indexable by integers. Unlike TorchGeo, RV does not use specialized Samplers. The spatial sampling is part of the definition of GeoDataset subclasses, SlidingWindowGeoDataset and RandomWindowGeoDataset, and is done in pixel-space. The map-to-pixel coordinate and vice versa transformation is performed at a different level of abstraction, namely at the RasterSource and the VectorSource level.

Happy to provide more details about RV's approach, if needed.

*An exception is if you create a GeoDataset (or RasterioSource) by passing multiple URIs; in that case, those are mosaiced together into a single VRT using gdal.

adamjstewart · 2024-08-11T09:26:43Z

Thanks for the clarification @AdeelH! If I understand correctly, RV never does reprojection unless you stack bands or combine images and masks from different CRS? I think this is similar to what I'm proposing here.

AdeelH · 2024-08-11T20:18:03Z

Yes, if by reprojection you mean transforming the bounding box being read between pixel and map coordinates. E.g. when sampling from a MultiRasterSource, the query bounding box (if I'm using the TorchGeo terminology correctly), which is initially in pixel coords, will

get transformed to map coords
and then, for each sub-RasterSource:
- will get transformed to pixel coords in the sub-RasterSource's respective pixel space
- which will then be used to read a chip
finally, the sub-chips will be concatenated along the channel dimension

Each RasterSource has an associated CRSTransformer which can transform bounding boxes between pixel coords and coords in some map CRS (EPSG:4326 by default).

adamjstewart · 2024-08-11T20:54:04Z

By reprojection, I mean using gdalwarp (or rasterio.warp) to project the data from one CRS to another. That's what this issue is about.

AdeelH · 2024-08-11T21:28:50Z

In that case, no, RV never does reprojection nor requires users to do it.

adamjstewart · 2024-08-12T06:33:17Z

If users want to combine images and masks in the same location but two different CRSs, who does the reprojection? The user?

AdeelH · 2024-08-12T15:29:19Z

It's supposed to work the same way as when stacking bands from different sources that I described above. Since RV samples windows in pixel coords e.g. Box(0, 0, 100, 100), if we're sampling a window from the image RasterSource, that window will be transformed to the pixel coords of the mask RasterSource such that it still references the same spatial area. Although as I'm looking at the code now, the band-stacking implementation is currently more general than the image-mask alignment implementation. The former can handle different CRSs, extents, and resolutions while the latter can only handle different CRSs and extents.

If the masks are coming from a vector data source (e.g. GeoJSON), the geometries are transformed to pixel coords in-memory, as shown here.

Edit: Note: It's entirely possible that this approach is not fully general and does not work for all CRS combinations. If you can think of cases where it might fail, I'd be eager to know.

adamjstewart · 2024-08-12T17:48:20Z

If the images are in different coordinate systems, don't you end up with rotated/warped images like in Figure 2 of https://arxiv.org/abs/2111.08872? How do you even compute pixel coordinates if the images don't have the same bounds?

AdeelH · 2024-08-12T19:15:49Z

Example noteboook showing stacking of bands from NAIP, Sentinel-2, and Landsat with geospatial alignment without reprojection: https://colab.research.google.com/drive/1trjhV2ohQN9AkJb889XuOymr-Z9F7YLt?usp=sharing

Sampled chip:

What the scenes look like in QGIS:

adamjstewart added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Feb 16, 2022

calebrob6 mentioned this issue Feb 16, 2022

Add pixel sampling mode #294

Merged

weiji14 mentioned this issue Feb 17, 2022

Store projection metadata in rtree index of GeoDataset #411

Closed

weiji14 mentioned this issue Feb 18, 2022

Timeline for dropping Python 3.6 support #413

Closed

adamjstewart added this to the 0.3.0 milestone Feb 18, 2022

weiji14 mentioned this issue Mar 10, 2022

Integration tests fail on macOS #349

Closed

This was referenced Mar 23, 2022

Add dataset table to the documentation #435

Merged

Add PreChippedGeoSampler for pre-chipped geospatial datasets #479

Merged

calebrob6 mentioned this issue Jul 2, 2022

Cheasapeake test set image and label size are 2x bigger #619

Closed

adamjstewart mentioned this issue Jul 6, 2022

Add prediction utilities #560

Closed

adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022

adamjstewart mentioned this issue Nov 9, 2022

ERROR 1: Point outside of projection domain #893

Closed

adamjstewart removed this from the 0.4.0 milestone Dec 6, 2022

adamjstewart mentioned this issue May 21, 2023

TileDatasets #1353

Open

adamjstewart mentioned this issue Jun 9, 2023

Stitching: Access patch geo-transform in callback after predict #1407

Open

adriantre mentioned this issue Feb 13, 2024

How to avoid nodata-only patches #1330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeoDataset: avoid unnecessary reprojection #409

GeoDataset: avoid unnecessary reprojection #409

adamjstewart commented Feb 16, 2022

weiji14 commented Feb 17, 2022 •

edited

Loading

adamjstewart commented Feb 17, 2022

adamjstewart commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022 •

edited

Loading

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Mar 6, 2022

adamjstewart commented Mar 8, 2023

AdeelH commented Aug 10, 2024 •

edited

Loading

adamjstewart commented Aug 11, 2024

AdeelH commented Aug 11, 2024

adamjstewart commented Aug 11, 2024

AdeelH commented Aug 11, 2024 •

edited

Loading

adamjstewart commented Aug 12, 2024

AdeelH commented Aug 12, 2024 •

edited

Loading

adamjstewart commented Aug 12, 2024

AdeelH commented Aug 12, 2024 •

edited

Loading

GeoDataset: avoid unnecessary reprojection #409

GeoDataset: avoid unnecessary reprojection #409

Comments

adamjstewart commented Feb 16, 2022

Rationale

Proposal

Implementation

Alternatives

weiji14 commented Feb 17, 2022 • edited Loading

adamjstewart commented Feb 17, 2022

adamjstewart commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022 • edited Loading

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Feb 18, 2022

weiji14 commented Feb 18, 2022

adamjstewart commented Mar 6, 2022

adamjstewart commented Mar 8, 2023

AdeelH commented Aug 10, 2024 • edited Loading

adamjstewart commented Aug 11, 2024

AdeelH commented Aug 11, 2024

adamjstewart commented Aug 11, 2024

AdeelH commented Aug 11, 2024 • edited Loading

adamjstewart commented Aug 12, 2024

AdeelH commented Aug 12, 2024 • edited Loading

adamjstewart commented Aug 12, 2024

AdeelH commented Aug 12, 2024 • edited Loading

weiji14 commented Feb 17, 2022 •

edited

Loading

weiji14 commented Feb 18, 2022 •

edited

Loading

AdeelH commented Aug 10, 2024 •

edited

Loading

AdeelH commented Aug 11, 2024 •

edited

Loading

AdeelH commented Aug 12, 2024 •

edited

Loading

AdeelH commented Aug 12, 2024 •

edited

Loading