Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Adding dataset deduplication process #1946

Open
5 tasks done
Weyaxi opened this issue Oct 5, 2024 · 5 comments · May be fixed by #2072
Open
5 tasks done

Feature Request: Adding dataset deduplication process #1946

Weyaxi opened this issue Oct 5, 2024 · 5 comments · May be fixed by #2072
Labels
enhancement New feature or request

Comments

@Weyaxi
Copy link

Weyaxi commented Oct 5, 2024

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.

✔️ Solution

In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.

❓ Alternatives

There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@Weyaxi Weyaxi added the enhancement New feature or request label Oct 5, 2024
@olivermolenschot
Copy link
Contributor

olivermolenschot commented Nov 13, 2024

@Weyaxi Are you talking about exact deduplication or fuzzy deduplication? I think exact deduplication is more revelant.

@Weyaxi
Copy link
Author

Weyaxi commented Nov 13, 2024

Hi @olivermolenschot,

I was referring to exact deduplication here, but it might be worth discussing the addition of fuzzy deduplication later on as well :)

The use case for this is as follows:

There are large curated datasets that have been published, but when developers want to use both of them, they have to work on deduplicating the merged datasets. This is because there will often be some duplication (e.g., Big Dataset A contains samples from small dataset x, and Big Dataset B also contains samples from small dataset x for example). But this can give devs some hard time because of the format diffrences etc.

@olivermolenschot
Copy link
Contributor

Can you give examples on what would be format differences @Weyaxi ? I think I can easily provide a de-duplication feature for when the rows are an exact match. however if the format changes, we need to be more precise about what type of format changes are occurring. Covering all possible format changes might be tedious.

@Weyaxi
Copy link
Author

Weyaxi commented Nov 13, 2024

The format difference I mentioned is that, for example, I use both ShareGPT and Alpaca-type datasets at the same time when writing my config, but Axolotl merges those datasets into a single format in the end, right?

So, if I wanted to handle deduplication on my own, I would need to follow these steps in a typical scenario:

  1. Convert all datasets to a single format.
  2. Perform deduplication on my own.
  3. Create a new dataset.
  4. Input that dataset into Axolotl.

If Axolotl could handle that with a single line of change on the config file, it would be very beneficial IMO.

That's what I'm talking about.

@olivermolenschot
Copy link
Contributor

I'm working on this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants