Feature Request: Adding dataset deduplication process #1946

Weyaxi · 2024-10-05T11:07:05Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.

✔️ Solution

In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.

❓ Alternatives

There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.

📝 Additional Context

No response

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

olivermolenschot · 2024-11-13T20:11:13Z

@Weyaxi Are you talking about exact deduplication or fuzzy deduplication? I think exact deduplication is more revelant.

Weyaxi · 2024-11-13T23:03:48Z

Hi @olivermolenschot,

I was referring to exact deduplication here, but it might be worth discussing the addition of fuzzy deduplication later on as well :)

The use case for this is as follows:

There are large curated datasets that have been published, but when developers want to use both of them, they have to work on deduplicating the merged datasets. This is because there will often be some duplication (e.g., Big Dataset A contains samples from small dataset x, and Big Dataset B also contains samples from small dataset x for example). But this can give devs some hard time because of the format diffrences etc.

olivermolenschot · 2024-11-13T23:07:42Z

Can you give examples on what would be format differences @Weyaxi ? I think I can easily provide a de-duplication feature for when the rows are an exact match. however if the format changes, we need to be more precise about what type of format changes are occurring. Covering all possible format changes might be tedious.

Weyaxi · 2024-11-13T23:25:54Z

The format difference I mentioned is that, for example, I use both ShareGPT and Alpaca-type datasets at the same time when writing my config, but Axolotl merges those datasets into a single format in the end, right?

So, if I wanted to handle deduplication on my own, I would need to follow these steps in a typical scenario:

Convert all datasets to a single format.
Perform deduplication on my own.
Create a new dataset.
Input that dataset into Axolotl.

If Axolotl could handle that with a single line of change on the config file, it would be very beneficial IMO.

That's what I'm talking about.

olivermolenschot · 2024-11-16T05:56:27Z

I'm working on this feature.

Weyaxi added the enhancement New feature or request label Oct 5, 2024

olivermolenschot linked a pull request Nov 18, 2024 that will close this issue

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Adding dataset deduplication process #1946

Feature Request: Adding dataset deduplication process #1946

Weyaxi commented Oct 5, 2024 •

edited

Loading

olivermolenschot commented Nov 13, 2024 •

edited

Loading

Weyaxi commented Nov 13, 2024

olivermolenschot commented Nov 13, 2024

Weyaxi commented Nov 13, 2024

olivermolenschot commented Nov 16, 2024

Feature Request: Adding dataset deduplication process #1946

Feature Request: Adding dataset deduplication process #1946

Comments

Weyaxi commented Oct 5, 2024 • edited Loading

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

olivermolenschot commented Nov 13, 2024 • edited Loading

Weyaxi commented Nov 13, 2024

olivermolenschot commented Nov 13, 2024

Weyaxi commented Nov 13, 2024

olivermolenschot commented Nov 16, 2024

Weyaxi commented Oct 5, 2024 •

edited

Loading

olivermolenschot commented Nov 13, 2024 •

edited

Loading