-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Exact Deduplication Feature to Preprocessing Pipeline #2072
Open
olivermolenschot
wants to merge
10
commits into
axolotl-ai-cloud:main
Choose a base branch
from
olivermolenschot:deduplicate-datasets
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Exact Deduplication Feature to Preprocessing Pipeline #2072
olivermolenschot
wants to merge
10
commits into
axolotl-ai-cloud:main
from
olivermolenschot:deduplicate-datasets
+767
−51
Commits on Nov 11, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 22d6c6c - Browse repository at this point
Copy the full SHA 22d6c6cView commit details
Commits on Nov 14, 2024
-
Configuration menu - View commit details
-
Copy full SHA for c76b4f6 - Browse repository at this point
Copy the full SHA c76b4f6View commit details
Commits on Nov 18, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 3d06ca9 - Browse repository at this point
Copy the full SHA 3d06ca9View commit details -
Configuration menu - View commit details
-
Copy full SHA for 58f02e1 - Browse repository at this point
Copy the full SHA 58f02e1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4423403 - Browse repository at this point
Copy the full SHA 4423403View commit details
Commits on Nov 19, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 0c3ebbc - Browse repository at this point
Copy the full SHA 0c3ebbcView commit details
Commits on Nov 28, 2024
-
Improve deduplication for train/eval overlap
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
Configuration menu - View commit details
-
Copy full SHA for 5e245cb - Browse repository at this point
Copy the full SHA 5e245cbView commit details -
Improve deduplication for train/eval overlap
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
Configuration menu - View commit details
-
Copy full SHA for c465e0e - Browse repository at this point
Copy the full SHA c465e0eView commit details
Commits on Nov 29, 2024
-
Apply suggestions from code review
To handle the original case where we do not do deduplication Co-authored-by: Wing Lian <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a885105 - Browse repository at this point
Copy the full SHA a885105View commit details
Commits on Nov 30, 2024
-
Improve false collision detection to ensure dataset integrity
- Added test cases to simulate and verify handling of forced hash collisions between datasets. - Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication. - Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset.
Configuration menu - View commit details
-
Copy full SHA for dbe779d - Browse repository at this point
Copy the full SHA dbe779dView commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.