This repository is heavily inspired by the BigCode repository and is mostly a refactoring of their code. Specifically, the main person who worked on this repository is Chenghao Mou (Awesome work!).
pip install decontamination
First you need to specify which benchmarks you want to clean your data
of. You can do this by creating dictionary with the benchmark name in
huggingface’s datasets repository as the key and the name of the column
containing the benchmark data as the value. For example, if you want to
clean your data of the HumanEval
and LAMBADA
benchmarks, you would
do the following:
!export HF_ACCESS_TOKEN=<TOKEN>
from datasets import load_dataset
from decontamination.core import BenchmarkCleaner
# load your dataset
dataset = load_dataset("bigcode/the-stack-smol", data_dir="data/python", split="train")
benchmarks = ["openai_humaneval", "lambada"]
cleaner = BenchmarkCleaner(benchmarks, "/tmp/benchmarks", threshold=0.1, num_perm=128)
# clean the dataset
cleaned_dataset = cleaner.clean(dataset, column="content", check_for_fp=True)
[01/24/23 00:27:37] INFO Benchmark datasets already exist. Skipping hashing. core.py:181
/home/nathan/miniconda3/envs/decontamination/lib/python3.10/site-packages/datasets/arrow_dataset.py:1533: FutureWarning: 'fs' was is deprecated in favor of 'storage_options' in version 2.8.0 and will be removed in 3.0.0.
You can remove this warning by passing 'storage_options=fs.storage_options' instead.
warnings.warn(
Checking for false positives...: 100%|██████████| 8636/8636 [00:33<00:00, 261.25it/s]
Checking for false positives...: 100%|██████████| 8805/8805 [06:58<00:00, 21.06it/s]
Checking for false positives...: 100%|██████████| 8722/8722 [06:39<00:00, 21.82it/s]
Filtering duplicates... #0: 100%|██████████| 1/1 [00:00<00:00, 140.36ba/s]
Filtering duplicates... #1: 100%|██████████| 1/1 [00:00<00:00, 123.28ba/s]
Filtering duplicates... #2: 100%|██████████| 1/1 [00:00<00:00, 169.47ba/s]
Filtering duplicates... #3: 100%|██████████| 1/1 [00:00<00:00, 141.77ba/s]
Filtering duplicates... #4: 100%|██████████| 1/1 [00:00<00:00, 142.31ba/s]
Filtering duplicates... #5: 100%|██████████| 1/1 [00:00<00:00, 139.13ba/s]
Filtering duplicates... #6: 100%|██████████| 1/1 [00:00<00:00, 156.00ba/s]
Filtering duplicates... #7: 100%|██████████| 1/1 [00:00<00:00, 139.18ba/s]
Filtering duplicates... #8: 100%|██████████| 1/1 [00:00<00:00, 162.53ba/s]
Filtering duplicates... #9: 100%|██████████| 1/1 [00:00<00:00, 140.68ba/s]
Filtering duplicates... #10: 100%|██████████| 1/1 [00:00<00:00, 138.69ba/s]
Filtering duplicates... #11: 100%|██████████| 1/1 [00:00<00:00, 145.31ba/s]
Filtering duplicates... #12: 100%|██████████| 1/1 [00:00<00:00, 144.74ba/s]
Filtering duplicates... #13: 100%|██████████| 1/1 [00:00<00:00, 157.68ba/s]
Filtering duplicates... #14: 100%|██████████| 1/1 [00:00<00:00, 95.45ba/s]
Filtering duplicates... #15: 100%|██████████| 1/1 [00:00<00:00, 135.26ba/s]
Filtering duplicates... #16: 100%|██████████| 1/1 [00:00<00:00, 136.07ba/s]
Filtering duplicates... #17: 100%|██████████| 1/1 [00:00<00:00, 107.33ba/s]
Filtering duplicates... #18: 100%|██████████| 1/1 [00:00<00:00, 141.83ba/s]
Filtering duplicates... #19: 100%|██████████| 1/1 [00:00<00:00, 139.11ba/s]
Filtering duplicates... #20: 100%|██████████| 1/1 [00:00<00:00, 137.10ba/s]
Filtering duplicates... #21: 100%|██████████| 1/1 [00:00<00:00, 146.80ba/s]
Filtering duplicates... #22: 100%|██████████| 1/1 [00:00<00:00, 147.25ba/s]
Filtering duplicates... #23: 100%|██████████| 1/1 [00:00<00:00, 149.84ba/s]
Filtering duplicates... #24: 100%|██████████| 1/1 [00:00<00:00, 132.19ba/s]
Filtering duplicates... #25: 100%|██████████| 1/1 [00:00<00:00, 24.02ba/s]
Filtering duplicates... #30: 100%|██████████| 1/1 [00:00<00:00, 119.37ba/s]
Filtering duplicates... #29: 100%|██████████| 1/1 [00:00<00:00, 98.58ba/s]
Filtering duplicates... #28: 100%|██████████| 1/1 [00:00<00:00, 85.76ba/s]
Filtering duplicates... #26: 100%|██████████| 1/1 [00:00<00:00, 76.09ba/s]
Filtering duplicates... #31: 100%|██████████| 1/1 [00:00<00:00, 69.66ba/s]
Filtering duplicates... #27: 100%|██████████| 1/1 [00:00<00:00, 62.54ba/s]
[01/24/23 00:41:50] INFO Data Number : 10000 core.py:277
INFO Duplicate Number : 3932 core.py:278
INFO Duplicate Rate : 39.32% core.py:279
INFO Total Time : 853.66 seconds core.py:280