-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Exact Deduplication Feature to Preprocessing Pipeline #2072
base: main
Are you sure you want to change the base?
Add Exact Deduplication Feature to Preprocessing Pipeline #2072
Conversation
@NanoCode012 the pre-commit is failing on code that I did not change. Any insights on this? |
I think it's detecting duplicates as test configs are similar across. It's fine to ignore this case in tests. You can add a axolotl/tests/test_packed_dataset.py Lines 20 to 28 in 0c8b1d8
You can find the section from the CI / locally running pre-commit. |
e52137d
to
0c3ebbc
Compare
@NanoCode012 Your siggestion worked. The pre-commit now passes, thanks. For the pytests, they are passing locally (I skipped the e2e folder as done here too), but they fail on github. It seems that it has to do with loading the dataset from huggingface. A test file that I did not create/modify is also failing:
I think it might have to do with loading the dataset from huggingface. Did you experience this or have any insights on quick fixes? |
I ran your test locally, and it does pass. Huggingface sometimes bugs in our CI esp for dataset loading. |
We recently added some pytest fixtures to pre-download the popular datasets and models used in the tests, but it was mostly for the e2e tests. We could consider moving those up to the top level tests. https://github.com/axolotl-ai-cloud/axolotl/blob/main/tests/e2e/conftest.py |
It looks like many of the previous failed tests now pass, @winglian. Only one test failed, because it exceeded the maximum allotted time: |
On Attempt 4, the only test that failed is the e2e-tests. I thought that folder needed to be skipped, @NanoCode012. Can we skip it or is it needed? |
Hey @olivermolenschot , sorry for this. I'm not sure why the tests are acting up for this PR. I restarted yours again, but it is rare for us to have to do this more than once.
E2E test are quite important to run. We run it last after our other quick tests as these are slower and more expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor nits, but this is a great addition!
@@ -178,7 +183,7 @@ def load_tokenized_prepared_datasets( | |||
+ "|".join( | |||
sorted( | |||
[ | |||
f"{d.path}:{d.type}:{d.shards}:{d.conversation}{d.split}" | |||
f"{d.path}: {d.type}: {d.shards}: {d.conversation}{d.split}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spaces here are probably not necessary for consistency since this gets hashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flake8 fails when I try to commit because of these spaces, that's why I included them. What should I do instead?
src/axolotl/utils/data/utils.py
Outdated
def collect_unique_rows(*, dataset: Dataset) -> list: | ||
"""Converts each row to a tuple, keeping only unique rows.""" | ||
# Convert dict_values to list for JSON serialization | ||
unique_row_tuples = list( | ||
dict.fromkeys(tuple(json.dumps(list(row.values())) for row in dataset)) | ||
) | ||
return unique_row_tuples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way for getting the unique rows? It seems very complex and expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we use their built-in .unique()
? (Please test if it flattens to the data or not) https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Dataset.unique
Another concern is memory usage from having to load the entire data in memory like in your method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw online, there is no inbuilt method in huggingface datasets to get the unique rows of a dataset (see this. Their inbuilt .unique()
function takes in a column and provides a sorted list of unique values for that row.
Regarding whether there is an easier way, if there is no inbuilt function I think this might be the easiest way.
Regarding the cost of memory, this method only gets activated in areas of the code where the data is already fully loaded (for pretraining_datasets
it is not activated). Since we are creating a list of unique rows, we are using at most roughly 2x the dataset size as memory (this includes the space taken by the original dataset).
Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.
Please let me know what you think so I can start implementing those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their inbuilt .unique() function takes in a column and provides a sorted list of unique values for that row.
Hm I see. I misunderstood the docs and thought if given multiple columns, it would adjust.
Regarding how expensive/fast the method is, one thing I could do to make it faster is to directly use a hashing algorithm rather than converting every row to a JSON. Originally I was converting a row to a JSON because it makes it hashable, but I could simply use a hashing algorithm which should remove the overhead of converting to a JSON and back.
By expensive, I mean more on computation cost as it gets converted between multiple data types.
How about using pandas with their df.drop_duplicates
? You should be able to convert between Dataset
and DataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that can also work. I'll try it out and commit the changes to the PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using df.drop_duplicates
. Initially it fails when the cell content is a list, because it is not hashable. From my research the most reasonable way to proceed would be to cast the whole dataframe as string and then proceed. Something like this:
unique_indices = df.astype(str).drop_duplicates().index
deduplicated_df = df.loc[unique_indices]
deduplicated_dataset = Dataset.from_pandas(deduplicated_df, preserve_index=False)
But this would be fairly expensive because apparently df.astype(str)
will create a duplicate of the dataset. Instead, we can use hashing, which would be more memory efficient for our task, especially for large datasets. Here we would only need to store one hash per row, and one integer:
seen_hashes = set()
unique_indices = []
for idx, row in enumerate(dataset):
row_hash = hash(str(row))
if row_hash not in seen_hashes:
seen_hashes.add(row_hash)
unique_indices.append(idx)
deduplicated_dataset = dataset.select(unique_indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to the method where I hash, using sha256 to have enough expressiveness for diverse dataset sizes. Using this method also provides us with the benefit of being able to remove duplicates between train_dataset
and eval_dataset
because we can use the same set for both. This is what I did in the last commit (7-8).
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
Changed the deduplication function to use a more memory-efficient hashing method. Applied Git suggestions to improve clarity and maintainability.\n\nThe deduplication now handles cases where train and eval datasets have overlapping elements.
src/axolotl/utils/data/constants.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this file avoids pylint issues between other files, as there were repeating constants between some tests. @winglian
FAILED tests/e2e/test_packing_loss.py::TestPackedLlama::test_loss_packed - Un...
|
To handle the original case where we do not do deduplication Co-authored-by: Wing Lian <[email protected]>
flaky test it seems, rerunning seems to pass now |
- Added test cases to simulate and verify handling of forced hash collisions between datasets. - Ensured that datasets with identical hashes but different content are correctly identified, preventing incorrect deduplication. - Updated unit tests to include scenarios where collisions occur across both training and evaluation datasets, as well as within a single dataset.
Any idea why the pytest fails @winglian? It works locally. |
Looks like it passed on some of the other versions of Python. Usually this indicates it's a flaky test. Best guess is that it failed to download dataset directly from HF leading to ending up in that else clause. I'll trigger it to run again shortly EDIT: but the failure was the same as the previous one:
|
Description
I created a feature that will deduplicate datasets (exact matches) and maintain the same order. This feature is activated by setting
exact_deduplication = true
in the yaml configuration file. The feature works for both RL tasks and non-RL tasks. However, it will not be triggered forpretraining_dataset
. The feature is triggered only via the preprocess command and not the train command.Motivation and Context
Users often have datasets from different sources, which may contain duplicate rows. This feature automates the deduplication process, reducing the manual effort previously required by users to remove duplicate entries.
Resolves #1946
How has this been tested?
exact_deduplication = true/false/None
) by runningpython -m axolotl.cli.preprocess
Screenshots (if appropriate)
N/A
Types of changes