Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Exact Deduplication Feature to Preprocessing Pipeline #2072

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

olivermolenschot
Copy link
Contributor

Description

I created a feature that will deduplicate datasets (exact matches) and maintain the same order. This feature is activated by setting exact_deduplication = true in the yaml configuration file. The feature works for both RL tasks and non-RL tasks. However, it will not be triggered for pretraining_dataset. The feature is triggered only via the preprocess command and not the train command.

Motivation and Context

Users often have datasets from different sources, which may contain duplicate rows. This feature automates the deduplication process, reducing the manual effort previously required by users to remove duplicate entries.
Resolves #1946

How has this been tested?

  • Unit Tests: Provided and verified for all deduplication functions, and functions that directly call deduplication functions.
  • Manual Testing:
    • Tested with two YAML configurations (exact_deduplication = true/false/None) by running python -m axolotl.cli.preprocess
  • Environment: Python 3.10.15, Torch 2.3.1+cu121, Conda environment.

Screenshots (if appropriate)

N/A

Types of changes

  • New feature (non-breaking change which adds functionality)

@olivermolenschot
Copy link
Contributor Author

@NanoCode012 the pre-commit is failing on code that I did not change. Any insights on this?

@NanoCode012
Copy link
Collaborator

@NanoCode012 the pre-commit is failing on code that I did not change. Any insights on this?

I think it's detecting duplicates as test configs are similar across. It's fine to ignore this case in tests.

You can add a # pylint: disable=duplicate-code at the top of the duplicated section. For ex:

# pylint: disable=duplicate-code
self.tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
self.tokenizer.add_special_tokens(
{
"bos_token": "<s>",
"eos_token": "</s>",
"unk_token": "<unk>",
}
)

You can find the section from the CI / locally running pre-commit.

@olivermolenschot
Copy link
Contributor Author

@NanoCode012 Your siggestion worked. The pre-commit now passes, thanks.

For the pytests, they are passing locally (I skipped the e2e folder as done here too), but they fail on github. It seems that it has to do with loading the dataset from huggingface. A test file that I did not create/modify is also failing:

FAILED tests/test_datasets.py::TestDatasetPreparation::test_load_hub - datasets.data_files.EmptyDatasetError: The directory at /home/runner/work/axolotl/axolotl/mhenrichsen/alpaca_2k_test doesn't contain any data files

I think it might have to do with loading the dataset from huggingface. Did you experience this or have any insights on quick fixes?

@NanoCode012
Copy link
Collaborator

I ran your test locally, and it does pass.

Huggingface sometimes bugs in our CI esp for dataset loading.

@winglian
Copy link
Collaborator

winglian commented Nov 20, 2024

We recently added some pytest fixtures to pre-download the popular datasets and models used in the tests, but it was mostly for the e2e tests. We could consider moving those up to the top level tests. https://github.com/axolotl-ai-cloud/axolotl/blob/main/tests/e2e/conftest.py

@olivermolenschot
Copy link
Contributor Author

It looks like many of the previous failed tests now pass, @winglian. Only one test failed, because it exceeded the maximum allotted time:
The job running on runner GitHub Actions 230 has exceeded the maximum execution time of 20 minutes.
I'm guessing it has something to do with pre-storing the datasets and models.

@olivermolenschot
Copy link
Contributor Author

On Attempt 4, the only test that failed is the e2e-tests. I thought that folder needed to be skipped, @NanoCode012. Can we skip it or is it needed?

@NanoCode012
Copy link
Collaborator

Hey @olivermolenschot , sorry for this. I'm not sure why the tests are acting up for this PR. I restarted yours again, but it is rare for us to have to do this more than once.

Can we skip it or is it needed?

E2E test are quite important to run. We run it last after our other quick tests as these are slower and more expensive.

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor nits, but this is a great addition!

@@ -178,7 +183,7 @@ def load_tokenized_prepared_datasets(
+ "|".join(
sorted(
[
f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}"
f"{d.path}: {d.type}: {d.shards}: {d.conversation}{d.split}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spaces here are probably not necessary for consistency since this gets hashed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8 fails when I try to commit because of these spaces, that's why I included them. What should I do instead?

src/axolotl/utils/data/utils.py Outdated Show resolved Hide resolved
tests/test_exact_deduplication.py Outdated Show resolved Hide resolved
tests/test_exact_deduplication.py Outdated Show resolved Hide resolved
src/axolotl/utils/config/models/input/v0_4_1/__init__.py Outdated Show resolved Hide resolved
Comment on lines 18 to 24
def collect_unique_rows(*, dataset: Dataset) -> list:
"""Converts each row to a tuple, keeping only unique rows."""
# Convert dict_values to list for JSON serialization
unique_row_tuples = list(
dict.fromkeys(tuple(json.dumps(list(row.values())) for row in dataset))
)
return unique_row_tuples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way for getting the unique rows? It seems very complex and expensive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use their built-in .unique()? (Please test if it flattens to the data or not) https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.unique

Another concern is memory usage from having to load the entire data in memory like in your method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw online, there is no inbuilt method in huggingface datasets to get the unique rows of a dataset (see this. Their inbuilt .unique() function takes in a column and provides a sorted list of unique values for that row.

Regarding whether there is an easier way, if there is no inbuilt function I think this might be the easiest way.

Regarding the cost of memory, this method only gets activated in areas of the code where the data is already fully loaded (for pretraining_datasets it is not activated). Since we are creating a list of unique rows, we are using at most roughly 2x the dataset size as memory (this includes the space taken by the original dataset).

Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.

Please let me know what you think so I can start implementing those changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their inbuilt .unique() function takes in a column and provides a sorted list of unique values for that row.

Hm I see. I misunderstood the docs and thought if given multiple columns, it would adjust.

Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.

By expensive, I mean more on computation cost as it gets converted between multiple data types.

How about using pandas with their df.drop_duplicates? You should be able to convert between Dataset and DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that can also work. I'll try it out and commit the changes to the PR soon.

Copy link
Contributor Author

@olivermolenschot olivermolenschot Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using df.drop_duplicates. Initially it fails when the cell content is a list, because it is not hashable. From my research the most reasonable way to proceed would be to cast the whole dataframe as string and then proceed. Something like this:

    unique_indices = df.astype(str).drop_duplicates().index
    deduplicated_df = df.loc[unique_indices]
    deduplicated_dataset = Dataset.from_pandas(deduplicated_df, preserve_index=False)

But this would be fairly expensive because apparently df.astype(str) will create a duplicate of the dataset. Instead, we can use hashing, which would be more memory efficient for our task, especially for large datasets. Here we would only need to store one hash per row, and one integer:

seen_hashes = set()
unique_indices = []

for idx, row in enumerate(dataset):
    row_hash = hash(str(row))
    if row_hash not in seen_hashes:
        seen_hashes.add(row_hash)
        unique_indices.append(idx)

deduplicated_dataset = dataset.select(unique_indices)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to the method where I hash, using sha256 to have enough expressiveness for diverse dataset sizes. Using this method also provides us with the benefit of being able to remove duplicates between train_dataset and eval_dataset because we can use the same set for both. This is what I did in the last commit (7-8).

tests/test_exact_deduplication.py Outdated Show resolved Hide resolved
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this file avoids pylint issues between other files, as there were repeating constants between some tests. @winglian

@winglian
Copy link
Collaborator

FAILED tests/e2e/test_packing_loss.py::TestPackedLlama::test_loss_packed - Un...
FAILED tests/e2e/test_relora_llama.py::TestReLoraLlama::test_relora - Unbound...
FAILED tests/e2e/test_reward_model_llama.py::TestRewardModelLoraLlama::test_rm_fft

		else:
			if cfg.dataset_exact_deduplication:
				train_dataset, _, _ = deduplicate_and_log_datasets(train_dataset=dataset)
>           eval_dataset = None
E           UnboundLocalError: cannot access local variable 'train_dataset' where it is not associated with a value

src/axolotl/utils/data/sft.py:607: UnboundLocalError

To handle the original case where we do not do deduplication

Co-authored-by: Wing Lian <[email protected]>
@winglian
Copy link
Collaborator

winglian commented Nov 30, 2024

Hmmm. Doesn't make sense why installing from sdist results in a test failure of

FAILED tests/test_exact_deduplication.py::TestDeduplicateNonRL::test_prepare_dataset_without_deduplication - ValueError: data_files must be either a string or list of strings

flaky test it seems, rerunning seems to pass now

- Added test cases to simulate and verify handling of forced hash collisions between datasets.
- Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication.
- Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset.
@olivermolenschot
Copy link
Contributor Author

Any idea why the pytest fails @winglian? It works locally.

@winglian
Copy link
Collaborator

winglian commented Nov 30, 2024

Looks like it passed on some of the other versions of Python. Usually this indicates it's a flaky test. Best guess is that it failed to download dataset directly from HF leading to ending up in that else clause. I'll trigger it to run again shortly

EDIT: but the failure was the same as the previous one:

FAILED tests/test_exact_deduplication.py::TestDeduplicateNonRL::test_prepare_dataset_with_deduplication_eval - ValueError: data_files must be either a string or list of strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Adding dataset deduplication process
3 participants