-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splitting utilities for GeoDatasets #866
Add splitting utilities for GeoDatasets #866
Conversation
This feature serves the same purpose as The reason I originally chose to put this parameter in the sampler instead of the dataset is in order to save time. Instantiating a dataset is kinda slow (you need to recursively search the filesystem for files), so if you only need to instantiate 1 dataset and 3 samplers, that's much faster than 3 datasets and 1 sampler. That doesn't mean our current design is ideal, and I agree we at least need an easier way to easily split a dataset without knowledge of geospatial metadata. In addition to the split strategy you propose here, I would also love a way to overlay a grid over a dataset and randomly distribute cells to different splits (see Figure 2 in https://arxiv.org/abs/1805.02855 for an example of what I mean). I feel like we should add a |
One big difference with the GeoSampler's roi is when the dataset has non-contiguos rasters. In our specific problem, we have one raster per city so all the raster files in the RasterDataset are non-contiguos. Using one roi for the full dataset through the GeoSampler doesn't make sense in this case. Or at least I wouldn't know how to make it work. With this method, when the RasterDataset is loaded, each file is cropped by reducing the bounds that get added to the index. So for our specific case with cities, we can easily divide each raster/city into train/val/test datasets. Also, not needing the CRS beforehand is a big advantage. I don't know where it would be best to put all these splitting utilities but I think doing it on the dataset level gives you more flexibility. The GeoSampler's roi parameter can be added to the RasterDataset, but I don't think it's possible to do the same we are doing here. |
@adamjstewart I used the same original logic for splitting and turned it into a function that could go into utils. What do you think of this approach? It might be possible to do the same with a GeoSampler by splitting its underlying dataset
Also added a split() method to BoundingBox:
|
I'm currently preparing for a conference, but I'll try to look at this in more detail next week. I'll have to decide whether splitting should happen at the dataset or sampler level. |
Hi @adamjstewart, any thoughts on this dataset/sampler splitting strategy? Let me know what you think so I can finalize it or adapt it to work for samplers instead. One pro of splitting at the dataset level is that later one could use different sampling strategies for train/val/test, like using a RandomSampler for training and then a GridSampler for validation and testing. |
Apologies for taking so long to get back to this. Yes, this implementation is looking much better now! Note that we do also have some splitting stuff in At this point, I think the exact implementation details are less of a blocker than the naming and file location. I think the best path forward would be to sit down and brainstorm all potential ways that someone might want to split our datasets, and compare them to splitting utilities in other libraries. Just trying to avoid adding functions here and then moving or renaming them later. We should also consider if we want to merge the sampler roi functionality into this. I've seen a lot of users find sampler splitting less intuitive than dataset splitting simply because torchvision has no samplers and everything must be done at the dataset level. Sorry this isn't a quick and straightforward addition. This is how it always is when you want to make a big change. Adding a new dataset is routine, but adding a new way of interacting with datasets requires a lot of thought and careful design. |
No worries! I've been kind of busy and just now I have some time I can put into this. I hadn't seen the What do you think about merging both into one For the sampler's roi functionality, we could add another function in the same file and call it something like Let me know what you think. |
I'm worried about this because in the case of GeoDataset there are several different ways you might want to split the data. So I don't think having a single function will work. That's why we need to enumerate all possible splitting strategies so we can decide on function names. |
Ok, I see your point. Let's list them then and see how we can move forward. I'll start with some we have discussed here:
|
I went forward and re-wrote the definitions into a more functional form and included a proposal for the function names. I think it would make a lot of sense to have this in a new What are your thoughts @adamjstewart? Any other strategy you would add at this stage? Regular splits:
Spatial splits:
Time splits:
|
Those names sound good to me. How do you propose these splitters work? Would they accept a dataset as input and create multiple datasets? Or would they be arguments to a sampler? |
I'm thinking on a function that takes a dataset as an argument and returns multiple datasets, like the I'm not sure whether to ask for |
The latest version of random_split accepts both a list of lengths and a list of percentages. I would like to do the same, both for consistency with PyTorch and because it's more useful this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a ton of tedious error-prone code required here (a couple releases back we fixed a ton of floating point rounding issues for stuff like this) but your unit tests are so thorough it gives me confidence that things are likely bug-free, nice job!
@calebrob6 is planning on drawing up some pictures to illustrate what each splitter does in a follow-up PR, should help with the docstrings.
Most of my review comments are clarification comments or type hint suggestions, I couldn't actually find any bugs or missing features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a super convenient utility, thanks a lot @pmandiola! Looking ahead with the WIP of #877 there are likely to be some other changes/adaptions since we have not settled on an implementation for Time Series Support. However, I think this is a great starting point and convenient to generally create time series splits from more difficult to inspect raster data as opposed to a dataframe where you can just slice your data. Another case important to spatio-temporal data looking ahead would be to create splits both across time and geospatial location.
With respect to your question about time-series analysis and rolling windows @adamjstewart , I have made some comments in the time-series support PR. So I think we can discuss that there.
Thanks @adamjstewart and @nilsleh for your reviews! I think I addressed most of your comments but happy to change anything else if needed. |
Thanks again for the hard work on this! Sorry it took so long to review! |
No worries, happy to help! |
* add extent_crop to BoundingBox * add extent_crop param to RasterDataset * train_test_split function * minor changes * fix circular import * remove extent_crop * move existing functions to new file * refactor random_nongeo_split * refactor random_bbox_splitting * add roi_split * add random_bbox_assignment * add input checks * fix input type * minor reorder * add tests * add non-overlapping test * more tests * fix tests * additional tests * check overlapping rois * add time_series_split with tests * fix random_nongeo_split to work with fractions in torch 1.9 * modify random_nongeo_split test for coverage * add random_grid_cell_assignment with tests * add test * insert object into new indexes * check grid_size * better tests * small type fix * fix again * rm .DS_Store * fix typo Co-authored-by: Adam J. Stewart <[email protected]> * bump version added * add to __init__ * add to datasets.rst * use accumulate from itertools * clarify grid_size * remove random_nongeo_split * remove _create_geodataset_like * black reformatting * Update tests/datasets/test_splits.py Co-authored-by: Adam J. Stewart <[email protected]> * change import * docstrings * undo intersection change * use microsecond * use isclose * black * fix typing * add comments --------- Co-authored-by: Adam J. Stewart <[email protected]>
Some of our users have commented that the documentation isn't clear enough about the difference between some of these splitting utilities. Also that we don't provide recommendations of which splitting utility to use for which situation. @calebrob6 I know you're passionate about this. Any interest/time in making diagrams to explain the differences between these functions and adding some text to explain how to choose which one is right for you? |
This PR adds the ability to create train/val/test RasterDatasets from the same underlying raster files, inspired from a functionality in the rastervision package.
An
extent_crop
param is added to RasterDataset to indicate the portion of the raster files to be skipped. This way, one could create a train RasterDataset using only the top 80% of the files by specifyingextent_crop=(0.2, 0, 0, 0)
, and a test RasterDataset using the remaining 20% withextent_crop=(0, 0, 0.8, 0)
.I got a bit stuck with how to implement a test for this, but before figuring it out I wanted to check if this looks ok to you.