Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented custom dataset creator class #962

Open
wants to merge 5 commits into
base: nextjs
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions training/training/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
from torch.utils.data import Dataset
from torch.autograd import Variable

from sklearn.preprocessing import LabelEncoder
karkir0003 marked this conversation as resolved.
Show resolved Hide resolved
import boto3
karkir0003 marked this conversation as resolved.
Show resolved Hide resolved
import io


class TrainTestDatasetCreator(ABC):
"Creator that creates train and test PyTorch datasets"
Expand Down Expand Up @@ -98,3 +102,64 @@ def getCategoryList(self) -> list[str]:
if self._category_list is None:
raise Exception("Category list not available")
return self._category_list


class CustomDatasetCreator(TrainTestDatasetCreator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farisdurrani should we name this class TabularCustomDatasetCreator if the scope is tabular? then we can still preserve extensibility

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NMBridges food for thought

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Dataset can mean anything, adding Tabular to the name makes it more specific

"""Pulls user-uploaded dataset from S3 bucket and converts it to readable format"""

def __init__(
self,
X: pd.DataFrame,
y: pd.Series,
test_size: float,
shuffle: bool,
category_list: Optional[list[str]],
) -> None:
super().__init__()
self._category_list = category_list
self._X_train, self._X_test, self._y_train, self._y_test = cast(
tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series],
train_test_split(X, y, test_size=test_size, shuffle=shuffle),
)

@classmethod
def read_s3(
cls,
uid: str,
name: str,
test_size: float,
target_name: str,
shuffle: bool = True,
):
s3 = boto3.client("s3")
obj = s3.get_object(Bucket="dlp-upload-bucket", Key=f"{uid}/tabular/{name}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow models other than tabular to access the uploaded datasets but this is okay for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add error handling if such a directory doesn't exist

data = pd.read_csv(io.BytesIO(obj["Body"].read()))
y = data[target_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farisdurrani @dwu359 can we guarantee that at the invocation of this function, name of the target col from user uploaded csv dataset to s3 would be available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should already be a part of the frontend to peek the headers of the csv files from s3 so the users can select the target and feature names. The frontend will send the Trainspace data which includes the target/feature names to training. So, yes. The target col names should be available at this point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm? @farisdurrani

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to test this code to make sure it works fine for default and uploaded datasets but I will say yes for now

X = data.drop(target_name, axis=1)
if y.apply(pd.to_numeric, errors="coerce").isnull().any():
le = LabelEncoder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farisdurrani not sure if we need this? If so, should we have a way to track label encoder so that when we build confusion matrix, we have a mapping of number to label?

im having a hard time finding how we solved this problem in the past version of our code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall on top of my head but I believe we did encode the headers in our original code, since the confusion matrix generated only contains numbers. It has been a WIP to map the encodes back to the original labels

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to store label encoder object for this case in order to recover the original labels? @farisdurrani

If not, any simpler way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. We may be able to store an encoding in the metadata of the uploaded dataset but that's unnecessarily complicated. So just do it manually, passing along the encoder object down the functions

le.fit(y)
y = pd.Series(np.array(le.transform(y)))
return cls(X, y, test_size, shuffle, [target_name])

def createTrainDataset(self) -> Dataset:
X_train_tensor = Variable(torch.Tensor(self._X_train.to_numpy()))
X_train_tensor = torch.reshape(
X_train_tensor, (X_train_tensor.size()[0], 1, X_train_tensor.size()[1])
)
X_train_tensor.requires_grad_(True)

y_train_tensor = Variable(torch.Tensor(self._y_train.to_numpy()))
y_train_tensor = torch.reshape(y_train_tensor, (y_train_tensor.size()[0], 1))
return TensorDataset(X_train_tensor, y_train_tensor)

def createTestDataset(self) -> Dataset:
X_test_tensor = Variable(torch.Tensor(self._X_test.to_numpy()))
X_test_tensor = torch.reshape(
X_test_tensor, (X_test_tensor.size()[0], 1, X_test_tensor.size()[1])
)
X_test_tensor.requires_grad_(True)

y_test_tensor = Variable(torch.Tensor(self._y_test.to_numpy()))
y_test_tensor = torch.reshape(y_test_tensor, (y_test_tensor.size()[0], 1))
return TensorDataset(X_test_tensor, y_test_tensor)