Add Exact Deduplication Feature to Preprocessing Pipeline #2072

olivermolenschot · 2024-11-18T04:13:39Z

Description

I created a feature that will deduplicate datasets (exact matches) and maintain the same order. This feature is activated by setting exact_deduplication = true in the yaml configuration file. The feature works for both RL tasks and non-RL tasks. However, it will not be triggered for pretraining_dataset. The feature is triggered only via the preprocess command and not the train command.

Motivation and Context

Users often have datasets from different sources, which may contain duplicate rows. This feature automates the deduplication process, reducing the manual effort previously required by users to remove duplicate entries.
Resolves #1946

How has this been tested?

Unit Tests: Provided and verified for all deduplication functions, and functions that directly call deduplication functions.
Manual Testing:
- Tested with two YAML configurations (exact_deduplication = true/false/None) by running python -m axolotl.cli.preprocess
Environment: Python 3.10.15, Torch 2.3.1+cu121, Conda environment.

Screenshots (if appropriate)

N/A

Types of changes

New feature (non-breaking change which adds functionality)

olivermolenschot · 2024-11-18T06:37:30Z

@NanoCode012 the pre-commit is failing on code that I did not change. Any insights on this?

NanoCode012 · 2024-11-18T09:30:32Z

@NanoCode012 the pre-commit is failing on code that I did not change. Any insights on this?

I think it's detecting duplicates as test configs are similar across. It's fine to ignore this case in tests.

You can add a # pylint: disable=duplicate-code at the top of the duplicated section. For ex:

axolotl/tests/test_packed_dataset.py

Lines 20 to 28 in 0c8b1d8

    
           # pylint: disable=duplicate-code 
        
           self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b") 
        
           self.tokenizer.add_special_tokens( 
        
               { 
        
                   "bos_token": "<s>", 
        
                   "eos_token": "</s>", 
        
                   "unk_token": "<unk>", 
        
               } 
        
           )

You can find the section from the CI / locally running pre-commit.

olivermolenschot · 2024-11-20T00:10:23Z

@NanoCode012 Your siggestion worked. The pre-commit now passes, thanks.

For the pytests, they are passing locally (I skipped the e2e folder as done here too), but they fail on github. It seems that it has to do with loading the dataset from huggingface. A test file that I did not create/modify is also failing:

FAILED tests/test_datasets.py::TestDatasetPreparation::test_load_hub - datasets.data_files.EmptyDatasetError: The directory at /home/runner/work/axolotl/axolotl/mhenrichsen/alpaca_2k_test doesn't contain any data files

I think it might have to do with loading the dataset from huggingface. Did you experience this or have any insights on quick fixes?

NanoCode012 · 2024-11-20T04:48:40Z

I ran your test locally, and it does pass.

Huggingface sometimes bugs in our CI esp for dataset loading.

winglian · 2024-11-20T05:02:35Z

We recently added some pytest fixtures to pre-download the popular datasets and models used in the tests, but it was mostly for the e2e tests. We could consider moving those up to the top level tests. https://github.com/axolotl-ai-cloud/axolotl/blob/main/tests/e2e/conftest.py

olivermolenschot · 2024-11-21T06:38:17Z

It looks like many of the previous failed tests now pass, @winglian. Only one test failed, because it exceeded the maximum allotted time:
The job running on runner GitHub Actions 230 has exceeded the maximum execution time of 20 minutes.
I'm guessing it has something to do with pre-storing the datasets and models.

olivermolenschot · 2024-11-21T23:25:54Z

On Attempt 4, the only test that failed is the e2e-tests. I thought that folder needed to be skipped, @NanoCode012. Can we skip it or is it needed?

NanoCode012 · 2024-11-22T10:26:58Z

Hey @olivermolenschot , sorry for this. I'm not sure why the tests are acting up for this PR. I restarted yours again, but it is rare for us to have to do this more than once.

Can we skip it or is it needed?

E2E test are quite important to run. We run it last after our other quick tests as these are slower and more expensive.

winglian

Just a few minor nits, but this is a great addition!

winglian · 2024-11-22T15:30:57Z

src/axolotl/utils/data/sft.py

@@ -178,7 +183,7 @@ def load_tokenized_prepared_datasets(
                + "|".join(
                    sorted(
                        [
-                            f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"
+                            f"{d.path}: {d.type}: {d.shards}: {d.conversation}{d.split}"


The spaces here are probably not necessary for consistency since this gets hashed.

flake8 fails when I try to commit because of these spaces, that's why I included them. What should I do instead?

src/axolotl/utils/data/utils.py

tests/test_exact_deduplication.py

src/axolotl/utils/config/models/input/v0_4_1/__init__.py

NanoCode012 · 2024-11-23T03:50:15Z

src/axolotl/utils/data/utils.py

+def collect_unique_rows(*, dataset: Dataset) -> list:
+    """Converts each row to a tuple, keeping only unique rows."""
+    # Convert dict_values to list for JSON serialization
+    unique_row_tuples = list(
+        dict.fromkeys(tuple(json.dumps(list(row.values())) for row in dataset))
+    )
+    return unique_row_tuples


Is this the best way for getting the unique rows? It seems very complex and expensive.

What if we use their built-in .unique()? (Please test if it flattens to the data or not) https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.unique

Another concern is memory usage from having to load the entire data in memory like in your method.

From what I saw online, there is no inbuilt method in huggingface datasets to get the unique rows of a dataset (see this. Their inbuilt .unique() function takes in a column and provides a sorted list of unique values for that row.

Regarding whether there is an easier way, if there is no inbuilt function I think this might be the easiest way.

Regarding the cost of memory, this method only gets activated in areas of the code where the data is already fully loaded (for pretraining_datasets it is not activated). Since we are creating a list of unique rows, we are using at most roughly 2x the dataset size as memory (this includes the space taken by the original dataset).

Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.

Please let me know what you think so I can start implementing those changes.

Their inbuilt .unique() function takes in a column and provides a sorted list of unique values for that row.

Hm I see. I misunderstood the docs and thought if given multiple columns, it would adjust.

Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.

By expensive, I mean more on computation cost as it gets converted between multiple data types.

How about using pandas with their df.drop_duplicates? You should be able to convert between Dataset and DataFrame.

Yeah that can also work. I'll try it out and commit the changes to the PR soon.

I tried using df.drop_duplicates. Initially it fails when the cell content is a list, because it is not hashable. From my research the most reasonable way to proceed would be to cast the whole dataframe as string and then proceed. Something like this:

unique_indices = df.astype(str).drop_duplicates().index deduplicated_df = df.loc[unique_indices] deduplicated_dataset = Dataset.from_pandas(deduplicated_df, preserve_index=False)

But this would be fairly expensive because apparently df.astype(str) will create a duplicate of the dataset. Instead, we can use hashing, which would be more memory efficient for our task, especially for large datasets. Here we would only need to store one hash per row, and one integer:

seen_hashes = set() unique_indices = [] for idx, row in enumerate(dataset): row_hash = hash(str(row)) if row_hash not in seen_hashes: seen_hashes.add(row_hash) unique_indices.append(idx) deduplicated_dataset = dataset.select(unique_indices)

I changed it to the method where I hash, using sha256 to have enough expressiveness for diverse dataset sizes. Using this method also provides us with the benefit of being able to remove duplicates between train_dataset and eval_dataset because we can use the same set for both. This is what I did in the last commit (7-8).

tests/test_exact_deduplication.py

Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.

olivermolenschot · 2024-11-28T08:56:21Z

src/axolotl/utils/data/constants.py

Note that this file avoids pylint issues between other files, as there were repeating constants between some tests. @winglian

winglian · 2024-11-29T22:57:05Z

FAILED tests/e2e/test_packing_loss.py::TestPackedLlama::test_loss_packed - Un...
FAILED tests/e2e/test_relora_llama.py::TestReLoraLlama::test_relora - Unbound...
FAILED tests/e2e/test_reward_model_llama.py::TestRewardModelLoraLlama::test_rm_fft

		else:
			if cfg.dataset_exact_deduplication:
				train_dataset, _, _ = deduplicate_and_log_datasets(train_dataset=dataset)
>           eval_dataset = None
E           UnboundLocalError: cannot access local variable 'train_dataset' where it is not associated with a value

src/axolotl/utils/data/sft.py:607: UnboundLocalError

src/axolotl/utils/data/sft.py

To handle the original case where we do not do deduplication Co-authored-by: Wing Lian <[email protected]>

winglian · 2024-11-30T00:12:17Z

~~Hmmm. Doesn't make sense why installing from sdist results in a test failure of~~

~~FAILED tests/test_exact_deduplication.py::TestDeduplicateNonRL::test_prepare_dataset_without_deduplication - ValueError: data_files must be either a string or list of strings~~

flaky test it seems, rerunning seems to pass now

- Added test cases to simulate and verify handling of forced hash collisions between datasets. - Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication. - Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset.

olivermolenschot · 2024-11-30T10:20:47Z

Any idea why the pytest fails @winglian? It works locally.

winglian · 2024-11-30T11:44:08Z

Looks like it passed on some of the other versions of Python. Usually this indicates it's a flaky test. Best guess is that it failed to download dataset directly from HF leading to ending up in that else clause. I'll trigger it to run again shortly

EDIT: but the failure was the same as the previous one:

FAILED tests/test_exact_deduplication.py::TestDeduplicateNonRL::test_prepare_dataset_with_deduplication_eval - ValueError: data_files must be either a string or list of strings

olivermolenschot added 5 commits November 10, 2024 18:50

Add example YAML file for training Mistral using DPO

22d6c6c

added deduplication code

c76b4f6

Add exact deduplication feature and update examples

3d06ca9

Add exact deduplication feature and update examples

58f02e1

Add exact deduplication feature and update examples

4423403

Add exact deduplication feature and update examples

0c3ebbc

olivermolenschot force-pushed the deduplicate-datasets branch from e52137d to 0c3ebbc Compare November 19, 2024 09:50

winglian reviewed Nov 22, 2024

View reviewed changes

NanoCode012 reviewed Nov 23, 2024

View reviewed changes

olivermolenschot added 2 commits November 28, 2024 00:48

Improve deduplication for train/eval overlap

5e245cb

Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.

Improve deduplication for train/eval overlap

c465e0e

Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.

olivermolenschot commented Nov 28, 2024

View reviewed changes

winglian reviewed Nov 29, 2024

View reviewed changes

src/axolotl/utils/data/sft.py Outdated Show resolved Hide resolved

winglian reviewed Nov 29, 2024

View reviewed changes

src/axolotl/utils/data/sft.py Outdated Show resolved Hide resolved

Apply suggestions from code review

a885105

To handle the original case where we do not do deduplication Co-authored-by: Wing Lian <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

olivermolenschot commented Nov 18, 2024

olivermolenschot commented Nov 18, 2024

NanoCode012 commented Nov 18, 2024

olivermolenschot commented Nov 20, 2024

NanoCode012 commented Nov 20, 2024

winglian commented Nov 20, 2024 •

edited

Loading

olivermolenschot commented Nov 21, 2024

olivermolenschot commented Nov 21, 2024

NanoCode012 commented Nov 22, 2024

winglian left a comment

winglian Nov 22, 2024

olivermolenschot Nov 28, 2024

NanoCode012 Nov 23, 2024

NanoCode012 Nov 23, 2024

olivermolenschot Nov 25, 2024

NanoCode012 Nov 26, 2024

olivermolenschot Nov 26, 2024

olivermolenschot Nov 27, 2024 •

edited

Loading

olivermolenschot Nov 28, 2024

olivermolenschot Nov 28, 2024

winglian commented Nov 29, 2024

winglian commented Nov 30, 2024 •

edited

Loading

olivermolenschot commented Nov 30, 2024

winglian commented Nov 30, 2024 •

edited

Loading

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Are you sure you want to change the base?

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Conversation

olivermolenschot commented Nov 18, 2024

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

olivermolenschot commented Nov 18, 2024

NanoCode012 commented Nov 18, 2024

olivermolenschot commented Nov 20, 2024

NanoCode012 commented Nov 20, 2024

winglian commented Nov 20, 2024 • edited Loading

olivermolenschot commented Nov 21, 2024

olivermolenschot commented Nov 21, 2024

NanoCode012 commented Nov 22, 2024

winglian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olivermolenschot Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winglian commented Nov 29, 2024

winglian commented Nov 30, 2024 • edited Loading

olivermolenschot commented Nov 30, 2024

winglian commented Nov 30, 2024 • edited Loading

winglian commented Nov 20, 2024 •

edited

Loading

olivermolenschot Nov 27, 2024 •

edited

Loading

winglian commented Nov 30, 2024 •

edited

Loading

winglian commented Nov 30, 2024 •

edited

Loading