Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented custom dataset creator class #962
base: nextjs
Are you sure you want to change the base?
Implemented custom dataset creator class #962
Changes from 4 commits
7b243d7
dfb61fe
a26fe41
0650679
f80d62d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farisdurrani should we name this class TabularCustomDatasetCreator if the scope is tabular? then we can still preserve extensibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NMBridges food for thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Dataset can mean anything, adding Tabular to the name makes it more specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow models other than tabular to access the uploaded datasets but this is okay for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add error handling if such a directory doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farisdurrani @dwu359 can we guarantee that at the invocation of this function, name of the target col from user uploaded csv dataset to s3 would be available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should already be a part of the frontend to peek the headers of the csv files from s3 so the users can select the target and feature names. The frontend will send the Trainspace data which includes the target/feature names to training. So, yes. The target col names should be available at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm? @farisdurrani
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to test this code to make sure it works fine for default and uploaded datasets but I will say yes for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farisdurrani not sure if we need this? If so, should we have a way to track label encoder so that when we build confusion matrix, we have a mapping of number to label?
im having a hard time finding how we solved this problem in the past version of our code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall on top of my head but I believe we did encode the headers in our original code, since the confusion matrix generated only contains numbers. It has been a WIP to map the encodes back to the original labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to store label encoder object for this case in order to recover the original labels? @farisdurrani
If not, any simpler way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. We may be able to store an encoding in the metadata of the uploaded dataset but that's unnecessarily complicated. So just do it manually, passing along the encoder object down the functions