From 8e97415fd7105705b9a6ae8b4fed7f680a9b533c Mon Sep 17 00:00:00 2001
From: Christian Copeland <93938308+christiancopeland@users.noreply.github.com>
Date: Tue, 16 Jan 2024 20:05:26 -0500
Subject: [PATCH] Update README.md

grammar fixes
---
 pii/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/pii/README.md b/pii/README.md
index 63d1a20..23d447f 100644
--- a/pii/README.md
+++ b/pii/README.md
@@ -4,12 +4,12 @@ We provide code to detect Names, Emails, IP addresses, Passwords API/SSH keys in
 ## NER approach
 For the **NER** model based approach (e.g [StarPII](https://huggingface.co/bigcode/starpii)), please go to the `ner` folder. 
 
-We provide the code used for training a PII NER model to detect : Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)).  You will also find the code (and `slurm` scripts) used for running PII Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in 800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
+We provide the code used for training a PII NER model to detect : Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)).  You will also find the code (and `slurm` scripts) used for running PII Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in 800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used the following tokens:
 `<NAME>, <EMAIL>, <KEY>, <PASSWORD>`
 To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
 
 ## Regex approach
-Below we explain the regex based approach to dectect Emails, IP addresses adn keys only:
+Below we explain the regex based approach to dectect Emails, IP addresses and keys only:
 We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated.
 
 ## Usage of the regex approach