Added substation segementation dataset #2352

rijuld · 2024-10-17T15:01:05Z

No description provided.

adamjstewart · 2024-10-22T11:54:13Z

Hi @rijuld, thanks for the contribution! If you're new to creating PyTorch datasets, I highly recommend reading the following tutorials:

The only difference between datasets in torchvision and NonGeoDatasets in TorchGeo is that our __getitem__ returns a dictionary instead of a tuple. Other than that, they share all the same basic components.

Most of your issues seem to be due to the use of args. I think you just need to remove this and explicitly list all parameters in the function signature. This will also simplify your testing code. Take a look at other existing datasets, we have about 75 examples to choose from. If you find one that is similar to your dataset, it shouldn't actually require that many changes to get them working.

rijuld · 2024-10-22T12:05:16Z

Hi @adamjstewart , thanks a ton for the feedback! I will go through this tutorial.

nilsleh · 2024-10-24T07:48:17Z

torchgeo/datasets/substation_seg.py

+            image = image[:4, :, :, :] if self.use_timepoints else image[0]
+        return torch.from_numpy(image)
+
+    def _apply_transforms(self, image: torch.Tensor, mask: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:


Hi @rijuld, thank you for contributing this dataset. As another pointer, torchgeo datasets usually have an accompanying datamodule that defines things like the train/val/test split, but also common data augmentations, like flips, color augmentations etc through the kornia package. So in essence, torchgeo datasets simply load a particular sample and the augmentations are applied on GPU over the batch.

For example in this dataset, the getitem method loads the image and mask, and then we have a corresponding datamodule where we define augmentations like resizing and others, which will automatically be applied with a lightning training setup. This helps streamlining the datasets and keep them "minimal" and also make use of existing augmentation implementations like Kornia.

Let me know if I can help with any further questions.

Hi @nilsleh , thank you for the detailed explanation!

That makes perfect sense. I will try to make this minimal, implement this today and reach out if I have any further questions.

Thanks again!

Hi @nilsleh,

Hope you're doing well! I wanted to clarify if it's essential to shift all data augmentations to the datamodule. If so, could you guide me on which specific parts of the dataset should be moved there?

I've already removed the geotransform and color transform and plan to add them to the datamodule in my next pull request. If there are other elements you’d suggest removing, I can address those too. Once these adjustments are made, would it be possible to merge this PR (pending review) without the datamodule updates?

Thank you very much for your help!

Apologies for the late response. Adam prefers having all data normalization in the datamodule for consistency, But I also don't think it is terrible to do in the dataset. If you move it to the datamodule, you can use the kornina Normalize module, that you can add to the augmentation series. Then it will be applied to on_after_batch_transfer in the LightningDataModule.

rijuld · 2024-10-30T17:54:24Z

@microsoft-github-policy-service agree

adamjstewart

Not a bad data loader, it just doesn't match a single other data loader in TorchGeo. I highly recommend looking at some of the 80+ existing data loaders and unit tests for those data loaders already builtin before adding a new one from scratch. Especially for unit testing, you can probably just copy-n-paste most of the existing test code for a similar dataset.

adamjstewart · 2024-11-10T12:29:32Z

torchgeo/datasets/substation_seg.py

@@ -0,0 +1,170 @@
+"""This module handles the Substation segmentation dataset."""


All files need a copyright, see https://torchgeo.readthedocs.io/en/stable/user/contributing.html#licensing

adamjstewart · 2024-11-10T12:30:02Z

torchgeo/datasets/substation_seg.py

Is there a non-segmentation version of this dataset? If not, let's just name the file substation.py.

adamjstewart · 2024-11-10T12:31:38Z

torchgeo/datasets/substation_seg.py

+
+
+class SubstationDataset(NonGeoDataset):
+    """SubstationDataset is responsible for handling the loading and transformation of substation segmentation datasets.


These lines are likely over the 88 character line length limit.

Can you include a URL to link to the homepage and a more detailed description of the dataset? See other dataset files as examples of what things we like to document.

adamjstewart · 2024-11-10T12:32:01Z

torchgeo/datasets/substation_seg.py

+    directory: str = 'Substation'
+    filename_images: str = 'image_stack.tar.gz'
+    filename_masks: str = 'mask.tar.gz'
+    url_for_images: str = 'https://urldefense.proofpoint.com/v2/url?u=https-3A__storage.googleapis.com_tz-2Dml-2Dpublic_substation-2Dover-2D10km2-2Dcsv-2Dmain-2D444e360fd2b6444b9018d509d0e4f36e_image-5Fstack.tar.gz&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=ypwhORbsf5rB8FTl-SAxjfN_U0jrVqx6UDyBtJHbKQY&m=-2QXCp-gZof5HwBsLg7VwQD-pnLedAo09YCzdDCUTqCI-0t789z0-HhhgwVbYtX7&s=zMCjuqjPMHRz5jeEWLCEufHvWxRPdlHEbPnUE7kXPrc&e='


Can you use the real URL instead of this url defense wrapper?

These downloads also need MD5 checksums

adamjstewart · 2024-11-10T12:33:56Z

torchgeo/datasets/substation_seg.py

+
+    def __init__(
+        self,
+        args: Any,


Let's get rid of args and instead have specific parameters for each valid input. This helps with type checking and documenting. At the moment, there is absolutely no documentation suggesting that args.use_time_stamp is a required attribute of this mysterious Any class that is not documented anywhere.

adamjstewart · 2024-11-10T12:34:56Z

torchgeo/datasets/substation_seg.py

+        """Returns the number of items in the dataset."""
+        return len(self.image_filenames)
+
+    def plot(self) -> None:


The plot method takes a sample as input and plots it. See every other dataset for an example of this.

adamjstewart · 2024-11-10T12:35:41Z

torchgeo/datasets/substation_seg.py

+        image_dir_exists = os.path.exists(self.image_dir)
+        mask_dir_exists = os.path.exists(self.mask_dir)
+        if not (image_dir_exists and mask_dir_exists):
+            self._download()


Downloading random files from the internet without checksum verification should not happen by default, the user should have to pass download=True if they really want things to be downloaded. This violates the principle of least surprise.

adamjstewart · 2024-11-10T12:36:06Z

docs/api/datasets.rst

@@ -369,6 +369,11 @@ PASTIS

 .. autoclass:: PASTIS

+SubstationDataset


These should be in alphabetical order

adamjstewart · 2024-11-10T12:36:37Z

docs/api/datasets/non_geo_datasets.csv

@@ -49,6 +49,7 @@ Dataset,Task,Source,License,# Samples,# Classes,Size (px),Resolution (m),Bands
 `SSL4EO`_-S12,T,Sentinel-1/2,"CC-BY-4.0",1M,-,264x264,10,"SAR, MSI"
 `SSL4EO-L Benchmark`_,S,Lansat & CDL,"CC0-1.0",25K,134,264x264,30,MSI
 `SSL4EO-L Benchmark`_,S,Lansat & NLCD,"CC0-1.0",25K,17,264x264,30,MSI
+`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC BY-SA 2.0", 27K, 2, 228x228, 10, MSI


Suggested change

`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC BY-SA 2.0", 27K, 2, 228x228, 10, MSI

`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC-BY-SA 2.0", 27K, 2, 228x228, 10, MSI

Should be a valid SPDX identifier

adamjstewart · 2024-11-10T12:38:07Z

tests/datasets/test_substation_seg.py

+
+
+if __name__ == '__main__':
+    pytest.main([__file__])


This is not needed, the file is run by pytest, not the other way around.

Added substation segementation dataset

7dff61c

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels Oct 17, 2024

adamjstewart added this to the 0.7.0 milestone Oct 17, 2024

rijuld added 2 commits October 21, 2024 15:42

resolved bugs

10637af

a

2cb0842

rijuld force-pushed the main branch from e73392c to 2cb0842 Compare October 21, 2024 19:51

rijuld added 4 commits October 21, 2024 15:52

Resolved error

608f76a

fixed ruff errors

288e8b1

fixed mypy errors for substation seg py file

2e9bf83

removed more errors

78c494d

nilsleh reviewed Oct 24, 2024

View reviewed changes

rijuld added 15 commits October 24, 2024 10:11

resolved ruff errors and mypy errors

75ca32c

fixed length and data size along with ruff and mypy errors

e2326cc

resolved float error

9832db4

organized imports

ef79cd7

changed to float

83f2eb4

resolved mypy errors

69f5815

resolved further tests

898e6b3

sorted imports

d14eca6

more test coverage

d6ae700

ruff format

8892f0d

increased test code coverage

3f135b4

added formatting

9a05811

removed transformations so that I can add them in data module

4e65b04

increased underline length

9a9d555

corrected csv row length

3e12e7e

rijuld requested a review from nilsleh October 30, 2024 17:55

adamjstewart requested changes Nov 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added substation segementation dataset #2352

Added substation segementation dataset #2352

rijuld commented Oct 17, 2024

adamjstewart commented Oct 22, 2024

rijuld commented Oct 22, 2024

nilsleh Oct 24, 2024 •

edited

Loading

rijuld Oct 24, 2024

rijuld Oct 30, 2024 •

edited

Loading

nilsleh Nov 7, 2024

rijuld commented Oct 30, 2024

adamjstewart left a comment

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

adamjstewart Nov 10, 2024

		@@ -0,0 +1,170 @@
		"""This module handles the Substation segmentation dataset."""



		class SubstationDataset(NonGeoDataset):
		"""SubstationDataset is responsible for handling the loading and transformation of substation segmentation datasets.

		@@ -369,6 +369,11 @@ PASTIS

		.. autoclass:: PASTIS

		SubstationDataset

	`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC BY-SA 2.0", 27K, 2, 228x228, 10, MSI
	`SubstationDataset`_,S,OpenStreetMap & Sentinel-2, "CC-BY-SA 2.0", 27K, 2, 228x228, 10, MSI

Added substation segementation dataset #2352

Are you sure you want to change the base?

Added substation segementation dataset #2352

Conversation

rijuld commented Oct 17, 2024

adamjstewart commented Oct 22, 2024

rijuld commented Oct 22, 2024

nilsleh Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rijuld Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rijuld commented Oct 30, 2024

adamjstewart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh Oct 24, 2024 •

edited

Loading

rijuld Oct 30, 2024 •

edited

Loading