Personally Identifiable Information (PII) refers to any data that could potentially identify a specific individual. This includes details like name, address, phone number, email address, Social Security number, IP addresses, etc.
This repo serves as a starting point for an (almost) complete solution - a containerized API endpoint for a classifier that can detect whether certain text can be classified as PII or not. The code includes downloading and fine-tuning locally an open source pre-trained LLM (Large Language Model) with an open source dataset on PII, then saving the model weights and deploying it by building a TorchServe server in a Docker container that can be accessed from the command line to get a classification on user input text.
This repo follows an MVP (Minimal Viable Product) approach, meaning that it consists of minimal components for an idea / experiment that can be easily shared with other users / developers, which could then be adapted / extended to certain use cases. Step 1 is to fine-tune a pre-trained LLM with a PII data. Step 2 is to deploy the model to a server for users to query. Step 1 and 2 need to be executed sequentially but can be executed independentaly and either locally or on a remote server. For example you could first fine-tune the LLM on a local machine and then deploy the model on a server in GCP or AWS.
Built with
- Python (3.10 or higher)
- Docker (installation required)
- HuggingFace (private account is necessary)
- PyTorch + HF's Transformers + TorchServe
For the purpose of making this demo easily accessible, one of the most established, open-source PII datasets was used from the company AI4Privacy. However, one could relatively easily use their own data by modifying the class PIIDataset. The dataset pii-masking includes 103 classes of different PII use cases such as names, IP addresses, emails etc. More technical information about the dataset:
- 5.6m tokens with 65k PII examples.
- Multiple languages
- Human-in-the-loop validated high quality dataset
- Synthetic data generated using proprietary algorithms
For more information about this dataset, please follow this link.
To build a text classifier, Meta's opt-350m LLM was chosen. It is a pre-trained, decoder-only model of quite a small size (350million parameters, i.e less than 1GB in storage) making it ideal to run locally on a laptop and iterate quickly. It is by no means an ideal model for a final product, which is why the code is written in a way that it can be easily be replaced by another HF model (either public or private).
During the initialization of the server, a simple function is executed to approximate the optimal batch size of the model for inference on the available GPU of that server. After loading the model on the available GPU (if it exists!), it will run inference on the provided dataset with a very big batch size, expecting an OOM (Out Of Memory) error and will gradually decrease it until the whole dataset can pass through.
The project consists of two sequential steps: first locally fine-tuning the model, then serving it with Docker.
- Clone the repo to your desired location
git clone https://github.com/Kostis-S-Z/pii_detector.git
cd pii_detector
- Create a virtual environment and install the dependencies
python3 -m venv ./venv # Or use any other virtual environment manager such as poetry
source venv/bin/activate
pip install -r requirements.txt
- Run the training script
train_detector.py
locally to generate the fine-tuned model weights. Verify the training was succesful by checking if apytorch_model.bin
file exists inside thepii_detector
directory.
python3 train_detector.py
- Add your HF access token to
config.json
in thepii_detector
directory. Find your token by following the instructions here.
"hf_token": "hf_XXX"
- Build image with Dockerfile. Supply as build arg DEVICE=cpu or DEVICE=gpu depending on the device you want to run it on.
docker build --build-arg DEVICE=cpu -t my_pii_detector_image .
- Deploy image locally or to a server / cloud of your choice for inference.
docker run -d -p 7080:80 --name my_pii_detector_container my_pii_detector_image
Note: When initializing the TorchServe handler on a GPU, it will run a search for the optimal batch size on an inference dataset. By default, this is the same dataset as used for training the model. To change this, either update the train_detector.py
script before training or manually update the field dataset_name
with a HF dataset ID in the config.json in the pii_detector
directory with the dataset of your choice. Note however that the dataset you provide needs to follow the same format as the one used for training.
- Test server by sending a request, e.g for local deployment:
curl -X POST http://localhost:7080/predictions/pii_detector -d '["My name is Bob"]'
You should expect back a response that looks like this:
[["O","O","O","FIRSTNAME"]]
which corresponds to "My"
-> "O"
(meaning non-PII text), "name"
-> "O"
, "is"
-> "O"
, "Bob"
-> "FIRSTNAME"
.
dataset_name
: HF ID to a labelled dataset of PII samples that will be used for fine-tuning the LLM.
inference_dataset_name
: HF ID to a labelled dataset (of the same format as the one above!) used for evaluation during serving.
model_name
: HF ID to a pre-trained LLM suitable for fine-tuning for text classification.
train_dataset_size
: Set the size of the training dataset.
eval_dataset_size
: Set the size of the training dataset.
max_len
: Set the maximum length of the input text sequence for the tokenizer.
batch_size
: Set the number of samples in a training batch.
PIIDataset
: You can either samples_to_use
OR slice_start
/slice_end
to explicitly define how many samples from the data to use for training OR which exact slice to train with.
TrainingArguments
: Check here for the extensive list of hyper-parameters you can modify for training.
Distributed under the MIT License. See LICENSE
for more information.