Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Commits on Nov 11, 2024

  1. Configuration menu
    Copy the full SHA
    22d6c6c View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2024

  1. Configuration menu
    Copy the full SHA
    c76b4f6 View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2024

  1. Configuration menu
    Copy the full SHA
    3d06ca9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    58f02e1 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4423403 View commit details
    Browse the repository at this point in the history

Commits on Nov 19, 2024

  1. Configuration menu
    Copy the full SHA
    0c3ebbc View commit details
    Browse the repository at this point in the history

Commits on Nov 28, 2024

  1. Improve deduplication for train/eval overlap

    Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
    olivermolenschot committed Nov 28, 2024
    Configuration menu
    Copy the full SHA
    5e245cb View commit details
    Browse the repository at this point in the history
  2. Improve deduplication for train/eval overlap

    Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
    olivermolenschot committed Nov 28, 2024
    Configuration menu
    Copy the full SHA
    c465e0e View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2024

  1. Apply suggestions from code review

    To handle the original case where we do not do deduplication
    
    Co-authored-by: Wing Lian <[email protected]>
    olivermolenschot and winglian authored Nov 29, 2024
    Configuration menu
    Copy the full SHA
    a885105 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2024

  1. Improve false collision detection to ensure dataset integrity

    - Added test cases to simulate and verify handling of forced hash collisions between datasets.
    - Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication.
    - Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset.
    olivermolenschot committed Nov 30, 2024
    Configuration menu
    Copy the full SHA
    dbe779d View commit details
    Browse the repository at this point in the history