Implemented custom dataset creator class #962

NMBridges · 2023-09-16T20:04:52Z

Added CustomDatasetCreator class

What user problem are we solving?
#913
What solution does this PR provide?
Allows backend to use user-uploaded tabular datasets for training
Testing Methodology
Uploaded testing dataset to dlp-upload-bucket/nolan/tabular/antennae-lengths.csv. Performed function call on endpoint and verified that the data was preserved during download. Also implemented automatic string encoding for non-numeric labels.

Any other considerations

NMBridges · 2023-09-16T20:05:25Z

teehee = CustomDatasetCreator.read_s3("nolan", "antennae-length.csv", 0.2, "label", True)
train = teehee.createTrainDataset()
test = teehee.createTestDataset()

training/training/core/dataset.py

farisdurrani · 2023-09-18T13:17:24Z

training/training/core/dataset.py

+        shuffle: bool = True,
+    ):
+        s3 = boto3.client("s3")
+        obj = s3.get_object(Bucket="dlp-upload-bucket", Key=f"{uid}/tabular/{name}")


We should allow models other than tabular to access the uploaded datasets but this is okay for now

TODO: add error handling if such a directory doesn't exist

karkir0003

respond to my questions/comments

karkir0003 · 2023-09-18T14:31:02Z

training/training/core/dataset.py

+        y = data[target_name]
+        X = data.drop(target_name, axis=1)
+        if y.apply(pd.to_numeric, errors="coerce").isnull().any():
+            le = LabelEncoder()


@farisdurrani not sure if we need this? If so, should we have a way to track label encoder so that when we build confusion matrix, we have a mapping of number to label?

im having a hard time finding how we solved this problem in the past version of our code?

I can't recall on top of my head but I believe we did encode the headers in our original code, since the confusion matrix generated only contains numbers. It has been a WIP to map the encodes back to the original labels

Do we need to store label encoder object for this case in order to recover the original labels? @farisdurrani

If not, any simpler way?

Yes, we do. We may be able to store an encoding in the metadata of the uploaded dataset but that's unnecessarily complicated. So just do it manually, passing along the encoder object down the functions

karkir0003 · 2023-09-18T14:32:19Z

training/training/core/dataset.py

@@ -98,3 +102,64 @@ def getCategoryList(self) -> list[str]:
        if self._category_list is None:
            raise Exception("Category list not available")
        return self._category_list
+
+
+class CustomDatasetCreator(TrainTestDatasetCreator):


@farisdurrani should we name this class TabularCustomDatasetCreator if the scope is tabular? then we can still preserve extensibility

@NMBridges food for thought

That's a good idea. Dataset can mean anything, adding Tabular to the name makes it more specific

karkir0003 · 2023-09-18T14:33:12Z

along with addressing any necessary changes in pr

karkir0003

added other questions

karkir0003 · 2023-09-18T14:35:10Z

training/training/core/dataset.py

+        s3 = boto3.client("s3")
+        obj = s3.get_object(Bucket="dlp-upload-bucket", Key=f"{uid}/tabular/{name}")
+        data = pd.read_csv(io.BytesIO(obj["Body"].read()))
+        y = data[target_name]


@farisdurrani @dwu359 can we guarantee that at the invocation of this function, name of the target col from user uploaded csv dataset to s3 would be available?

It should already be a part of the frontend to peek the headers of the csv files from s3 so the users can select the target and feature names. The frontend will send the Trainspace data which includes the target/feature names to training. So, yes. The target col names should be available at this point

Can you confirm? @farisdurrani

We'll need to test this code to make sure it works fine for default and uploaded datasets but I will say yes for now

github-actions · 2023-09-27T02:45:43Z

training/training/core/dataset.py

+from abc import ABC, abstractmethod
+from typing import Callable, Optional, Union, cast
+
+from numpy import ndarray


🚫 [pyright] _{reported by reviewdog 🐶}
Import "numpy" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:43Z

training/training/core/dataset.py

+from typing import Callable, Optional, Union, cast
+
+from numpy import ndarray
+from sklearn.model_selection import train_test_split


🚫 [pyright] _{reported by reviewdog 🐶}
Import "sklearn.model_selection" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:43Z

training/training/core/dataset.py

+
+from numpy import ndarray
+from sklearn.model_selection import train_test_split
+from sklearn.utils import Bunch


🚫 [pyright] _{reported by reviewdog 🐶}
Import "sklearn.utils" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:43Z

training/training/core/dataset.py

+from numpy import ndarray
+from sklearn.model_selection import train_test_split
+from sklearn.utils import Bunch
+from sklearn.conftest import fetch_california_housing


🚫 [pyright] _{reported by reviewdog 🐶}
Import "sklearn.conftest" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:43Z

training/training/core/dataset.py

+from sklearn.model_selection import train_test_split
+from sklearn.utils import Bunch
+from sklearn.conftest import fetch_california_housing
+from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris, load_wine


🚫 [pyright] _{reported by reviewdog 🐶}
Import "sklearn.datasets" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:44Z

training/training/core/dataset.py

+from torch.utils.data import TensorDataset
+import numpy as np
+import pandas as pd
+import torch


🚫 [pyright] _{reported by reviewdog 🐶}
Import "torch" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:44Z

training/training/core/dataset.py

+import numpy as np
+import pandas as pd
+import torch
+from torch.utils.data import Dataset


🚫 [pyright] _{reported by reviewdog 🐶}
Import "torch.utils.data" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:44Z

training/training/core/dataset.py

+import pandas as pd
+import torch
+from torch.utils.data import Dataset
+from torch.autograd import Variable


🚫 [pyright] _{reported by reviewdog 🐶}
Import "torch.autograd" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:44Z

training/training/core/dataset.py

+from torch.utils.data import Dataset
+from torch.autograd import Variable
+
+from sklearn.preprocessing import LabelEncoder


🚫 [pyright] _{reported by reviewdog 🐶}
Import "sklearn.preprocessing" could not be resolved (reportMissingImports)

github-actions · 2023-09-27T02:45:44Z

training/training/core/dataset.py

+from torch.autograd import Variable
+
+from sklearn.preprocessing import LabelEncoder
+import boto3


🚫 [pyright] _{reported by reviewdog 🐶}
Import "boto3" could not be resolved (reportMissingImports)

sonarqubecloud · 2023-10-23T02:01:38Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
56.9% Duplication

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

farisdurrani

Fix the build errors and resolve all comments, this is good to go for me

Implemented custom dataset creator class

7b243d7

NMBridges requested review from karkir0003, farisdurrani and dwu359 as code owners September 16, 2023 20:04

github-actions bot reviewed Sep 16, 2023

View reviewed changes

training/training/core/dataset.py Outdated Show resolved Hide resolved

training/training/core/dataset.py Outdated Show resolved Hide resolved

NMBridges and others added 3 commits September 16, 2023 20:05

🎨 Format Python code with psf/black

dfb61fe

Implemented custom dataset creator class

a26fe41

🎨 Format Python code with psf/black

0650679

farisdurrani reviewed Sep 18, 2023

View reviewed changes

farisdurrani approved these changes Sep 18, 2023

View reviewed changes

karkir0003 requested changes Sep 18, 2023

View reviewed changes

karkir0003 requested changes Sep 19, 2023

View reviewed changes

Renamed tabular custom dataset creator class

f80d62d

NMBridges requested a review from a team as a code owner September 27, 2023 02:45

github-actions bot reviewed Sep 27, 2023

View reviewed changes

farisdurrani approved these changes Oct 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented custom dataset creator class #962

Implemented custom dataset creator class #962

NMBridges commented Sep 16, 2023

NMBridges commented Sep 16, 2023

farisdurrani Sep 18, 2023

NMBridges Sep 27, 2023

karkir0003 left a comment

karkir0003 Sep 18, 2023

farisdurrani Sep 19, 2023

karkir0003 Sep 20, 2023

farisdurrani Oct 24, 2023

karkir0003 Sep 18, 2023

karkir0003 Sep 18, 2023

farisdurrani Sep 19, 2023

karkir0003 commented Sep 18, 2023

karkir0003 left a comment

karkir0003 Sep 18, 2023

farisdurrani Sep 19, 2023

karkir0003 Sep 20, 2023

farisdurrani Oct 24, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

github-actions bot Sep 27, 2023

sonarqubecloud bot commented Oct 23, 2023

farisdurrani left a comment

Implemented custom dataset creator class #962

Are you sure you want to change the base?

Implemented custom dataset creator class #962

Conversation

NMBridges commented Sep 16, 2023

NMBridges commented Sep 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karkir0003 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karkir0003 commented Sep 18, 2023

karkir0003 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

github-actions bot Sep 27, 2023

Choose a reason for hiding this comment

sonarqubecloud bot commented Oct 23, 2023

farisdurrani left a comment

Choose a reason for hiding this comment