Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added substation segementation dataset #2352

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

rijuld
Copy link

@rijuld rijuld commented Oct 17, 2024

No description provided.

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels Oct 17, 2024
@adamjstewart adamjstewart added this to the 0.7.0 milestone Oct 17, 2024
@adamjstewart
Copy link
Collaborator

Hi @rijuld, thanks for the contribution! If you're new to creating PyTorch datasets, I highly recommend reading the following tutorials:

The only difference between datasets in torchvision and NonGeoDatasets in TorchGeo is that our __getitem__ returns a dictionary instead of a tuple. Other than that, they share all the same basic components.

Most of your issues seem to be due to the use of args. I think you just need to remove this and explicitly list all parameters in the function signature. This will also simplify your testing code. Take a look at other existing datasets, we have about 75 examples to choose from. If you find one that is similar to your dataset, it shouldn't actually require that many changes to get them working.

@rijuld
Copy link
Author

rijuld commented Oct 22, 2024

Hi @adamjstewart , thanks a ton for the feedback! I will go through this tutorial.

image = image[:4, :, :, :] if self.use_timepoints else image[0]
return torch.from_numpy(image)

def _apply_transforms(self, image: torch.Tensor, mask: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
Copy link
Collaborator

@nilsleh nilsleh Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rijuld, thank you for contributing this dataset. As another pointer, torchgeo datasets usually have an accompanying datamodule that defines things like the train/val/test split, but also common data augmentations, like flips, color augmentations etc through the kornia package. So in essence, torchgeo datasets simply load a particular sample and the augmentations are applied on GPU over the batch.

For example in this dataset, the getitem method loads the image and mask, and then we have a corresponding datamodule where we define augmentations like resizing and others, which will automatically be applied with a lightning training setup. This helps streamlining the datasets and keep them "minimal" and also make use of existing augmentation implementations like Kornia.

Let me know if I can help with any further questions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nilsleh , thank you for the detailed explanation!

That makes perfect sense. I will try to make this minimal, implement this today and reach out if I have any further questions.

Thanks again!

Copy link
Author

@rijuld rijuld Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nilsleh,

Hope you're doing well! I wanted to clarify if it's essential to shift all data augmentations to the datamodule. If so, could you guide me on which specific parts of the dataset should be moved there?

I've already removed the geotransform and color transform and plan to add them to the datamodule in my next pull request. If there are other elements you’d suggest removing, I can address those too. Once these adjustments are made, would it be possible to merge this PR (pending review) without the datamodule updates?

Thank you very much for your help!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late response. Adam prefers having all data normalization in the datamodule for consistency, But I also don't think it is terrible to do in the dataset. If you move it to the datamodule, you can use the kornina Normalize module, that you can add to the augmentation series. Then it will be applied to on_after_batch_transfer in the LightningDataModule.

@rijuld
Copy link
Author

rijuld commented Oct 30, 2024

@microsoft-github-policy-service agree

@rijuld rijuld requested a review from nilsleh October 30, 2024 17:55
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad data loader, it just doesn't match a single other data loader in TorchGeo. I highly recommend looking at some of the 80+ existing data loaders and unit tests for those data loaders already builtin before adding a new one from scratch. Especially for unit testing, you can probably just copy-n-paste most of the existing test code for a similar dataset.

@@ -0,0 +1,170 @@
"""This module handles the Substation segmentation dataset."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-segmentation version of this dataset? If not, let's just name the file substation.py.



class SubstationDataset(NonGeoDataset):
"""SubstationDataset is responsible for handling the loading and transformation of substation segmentation datasets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are likely over the 88 character line length limit.

Can you include a URL to link to the homepage and a more detailed description of the dataset? See other dataset files as examples of what things we like to document.

directory: str = 'Substation'
filename_images: str = 'image_stack.tar.gz'
filename_masks: str = 'mask.tar.gz'
url_for_images: str = 'https://urldefense.proofpoint.com/v2/url?u=https-3A__storage.googleapis.com_tz-2Dml-2Dpublic_substation-2Dover-2D10km2-2Dcsv-2Dmain-2D444e360fd2b6444b9018d509d0e4f36e_image-5Fstack.tar.gz&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=ypwhORbsf5rB8FTl-SAxjfN_U0jrVqx6UDyBtJHbKQY&m=-2QXCp-gZof5HwBsLg7VwQD-pnLedAo09YCzdDCUTqCI-0t789z0-HhhgwVbYtX7&s=zMCjuqjPMHRz5jeEWLCEufHvWxRPdlHEbPnUE7kXPrc&e='
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the real URL instead of this url defense wrapper?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These downloads also need MD5 checksums


def __init__(
self,
args: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of args and instead have specific parameters for each valid input. This helps with type checking and documenting. At the moment, there is absolutely no documentation suggesting that args.use_time_stamp is a required attribute of this mysterious Any class that is not documented anywhere.

"""Returns the number of items in the dataset."""
return len(self.image_filenames)

def plot(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plot method takes a sample as input and plots it. See every other dataset for an example of this.

image_dir_exists = os.path.exists(self.image_dir)
mask_dir_exists = os.path.exists(self.mask_dir)
if not (image_dir_exists and mask_dir_exists):
self._download()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading random files from the internet without checksum verification should not happen by default, the user should have to pass download=True if they really want things to be downloaded. This violates the principle of least surprise.

@@ -369,6 +369,11 @@ PASTIS

.. autoclass:: PASTIS

SubstationDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be in alphabetical order

@@ -49,6 +49,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
`SSL4EO`_-S12,T,Sentinel-1/2,"CC-BY-4.0",1M,-,264x264,10,"SAR, MSI"
`SSL4EO-L Benchmark`_,S,Lansat & CDL,"CC0-1.0",25K,134,264x264,30,MSI
`SSL4EO-L Benchmark`_,S,Lansat & NLCD,"CC0-1.0",25K,17,264x264,30,MSI
`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC BY-SA 2.0", 27K, 2, 228x228, 10, MSI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC BY-SA 2.0", 27K, 2, 228x228, 10, MSI
`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC-BY-SA 2.0", 27K, 2, 228x228, 10, MSI

Should be a valid SPDX identifier



if __name__ == '__main__':
pytest.main([__file__])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, the file is run by pytest, not the other way around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants