diff --git a/README.md b/README.md index 87af208..2490071 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,20 @@ +![ONS and DSC logos](https://github.com/datasciencecampus/awesome-campus/blob/master/ons_dsc_logo.png) + # `pprl_toolkit`: a toolkit for privacy-preserving record linkage +> "We find ourselves living in a society which is rich with data and the opportunities that comes with this. Yet, when disconnected, this data is limited in its usefulness. ... Being able to link data will be vital for enhancing our understanding of society, driving policy change for greater public good." Sir Ian Diamond, the National Statistician + +The Privacy Preserving Record Linkage (PPRL) toolkit demonstrates the feasibility of record linkage in difficult 'eyes off' settings. It has been designed for a situation where two organisations (perhaps in different jurisdictions) want to link their datasets at record level, to enrich the information they contain, but neither party is able to send sensitive personal identifiers -- such as names, addresses or dates of birth -- to the other. Building on [previous ONS research](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/privacy-preserving-record-linkage-in-the-context-of-a-national-statistics-institute), the toolkit implements a well-known privacy-preserving linkage method in a new way to improve performance, and wraps it in a secure cloud architecture to demonstrate the potential of a layered approach. + +The toolkit has been developed by data scientists at the [Data Science Campus](https://datasciencecampus.ons.gov.uk/) of the UK Office for National Statistics. This project has benefitted from early collaborations with colleagues at NHS England. + +The two parts of the toolkit are: + +* a Python package for privacy-preserving record linkage with Bloom filters and hash embeddings, that can be used locally with no cloud set-up +* instructions, scripts and resources to run record linkage in a cloud-based secure enclave. This part of the toolkit requires you to set up Google Cloud accounts with billing + +We're publishing the repo as a prototype and teaching tool. Please feel free to download, adapt and experiment with it in compliance with the open-source license. You can submit issues [here](https://github.com/datasciencecampus/pprl_toolkit/issues). However, as this is an experimental repo, the development team cannot commit to maintaining the repo or responding to issues. If you'd like to collaborate with us, to put these ideas into practice for the public good, please [get in touch](https://datasciencecampus.ons.gov.uk/contact/). + ## Installation To install the package from source, you must clone the repository before @@ -32,9 +47,11 @@ pre-commit install ## Getting started +The Python package implements the Bloom filter linkage method ([Schnell et al., 2009](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-9-41)), and can also implement pretrained Hash embeddings ([Miranda et al., 2022](https://arxiv.org/abs/2212.09255)), if a suitable large, pre-matched corpus of data is available. + Let us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a -fictional, German rock trio called "Verknüpfung". +fictional, German rock trio called "Verknüpfung". In this example we will see how to use untrained Bloom filters to match data. ### Loading the data @@ -51,7 +68,6 @@ matching. We will use the toolkit to identify these matches. ... "last_name": ["Daten", "Gorman", "Knopf"], ... "gender": ["f", "m", "f"], ... "instrument": ["bass", "guitar", "drums"], -... "vocals_ever": [True, True, True], ... } ... ) >>> df2 = pd.DataFrame( @@ -59,7 +75,6 @@ matching. We will use the toolkit to identify these matches. ... "name": ["Laura Datten", "Greta Knopf", "Casper Goreman"], ... "sex": ["female", "female", "male"], ... "main_instrument": ["bass guitar", "percussion", "electric guitar"], -... "vocals": ["yes", "sometimes", "sometimes"], ... } ... ) @@ -74,89 +89,60 @@ matching. We will use the toolkit to identify these matches. ### Creating and assigning a feature factory The next step is to decide how to process each of the columns in our datasets. - -To do this, we define a feature factory that maps column types to feature -generation functions, and a column specification for each dataset mapping our -columns to column types in the factory. +The `pprl.embedder.features` module provides functions that process different data types so that they can be embedded into the Bloom filter. We pass these functions into the embedder in a dictionary called a feature factory. We also provide a column specification for each dataset mapping our columns to column types in the factory. ```python >>> from pprl.embedder import features +>>> from functools import partial >>> >>> factory = dict( ... name=features.gen_name_features, ... sex=features.gen_sex_features, -... misc=features.gen_misc_features, +... instrument=partial(features.gen_misc_shingled_features, label="instrument"), ... ) >>> spec1 = dict( ... first_name="name", ... last_name="name", ... gender="sex", -... instrument="misc", -... vocals_ever="misc", +... instrument="instrument", ... ) ->>> spec2 = dict(name="name", sex="sex", main_instrument="misc", vocals="misc") +>>> spec2 = dict(name="name", sex="sex", main_instrument="instrument") ``` ### Embedding the data With our specifications sorted out, we can get to creating our Bloom filter -embedding. Before doing so, we need to decide on two parameters: the size of -the filter and the number of hashes. By default, these are `2**10` and `2`, -respectively. - -Once we've decided, we can create our `Embedder` instance and use it to embed -our data with their column specifications. +embedding. We can create our `Embedder` instance and use it to embed +our data with their column specifications. The `Embedder` object has two more parameters: the size of the filter and the number of hashes. We can use the defaults. ```python >>> from pprl.embedder.embedder import Embedder >>> ->>> embedder = Embedder(factory, bf_size=2**10, num_hashes=2) +>>> embedder = Embedder(factory, bf_size=1024, num_hashes=2) >>> edf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True) >>> edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True) ``` -If we take a look at one of these embedded datasets, we can see that it has a -whole bunch of new columns. There is a `_features` column for each of the -original columns containing their pre-embedding string features. Then there are -three additional columns: `bf_indices`, `bf_norms` and `thresholds`. - -```python ->>> edf1.columns -Index(['first_name', 'last_name', 'gender', 'instrument', 'vocals_ever', - 'first_name_features', 'last_name_features', 'gender_features', - 'instrument_features', 'vocals_ever_features', 'all_features', - 'bf_indices', 'bf_norms', 'thresholds'], - dtype='object') - -``` - - - ### Performing the linkage -We can now perform the linkage by comparing these Bloom filter embeddings. We -use the Soft Cosine Measure to calculate record-wise similarity and an adapted -Hungarian algorithm to match the records based on those similarities. +We can now perform the linkage by comparing these Bloom filter embeddings. The package +uses the Soft Cosine Measure to calculate record-wise similarity scores. ```python >>> similarities = embedder.compare(edf1, edf2) >>> similarities -SimilarityArray([[0.86017213, 0.14285716, 0.12803688], - [0.13216962, 0.13483999, 0.50067019], - [0.12126782, 0.76292716, 0.09240265]]) +SimilarityArray([[0.80074101, 0.18160957, 0.09722178], + [0.40124732, 0.1877348 , 0.58792979], + [0.13147656, 0.51426533, 0.11772856]]) ``` -This `SimilarityArray` object is an augmented `numpy.ndarray` that can perform -our matching. The matching itself has a number of parameters that allow you to -control how similar two embedded records must be to be matched. In this case, -let's say that two records can only be matched if their pairwise similarity is -at least `0.5`. +Lastly, we compute the matching using an adapted Hungarian algorithm with local match thresholds: ```python ->>> matching = similarities.match(abs_cutoff=0.5) +>>> matching = similarities.match() >>> matching (array([0, 1, 2]), array([0, 2, 1])) @@ -167,150 +153,30 @@ So, all three of the records in each dataset were matched correctly. Excellent! ## Working in the cloud -The toolkit is configured to work on Google Cloud Platform (GCP) provided you -have a team of users with Google Cloud accounts and the appropriate -permissions. In particular, `pprl_toolkit`'s cloud functionality is built on -top of a GCP Confidential Space. This setting means that nobody ever has direct -access to each other's data, and the datasets to be linked are only ever -brought together in a secure environment. - -Have a read through [this tutorial](https://cloud.google.com/confidential-computing/confidential-space/docs/create-your-first-confidential-space-environment) -if you would like to get to grips with how it all works on the inside. - -### Determining roles - -There are four roles to fill in a data linkage project: two data-owning -parties, a workload author, and a workload operator. A workload is how we refer -to the linkage operation itself. These roles can be summarised as follows: - -- A data-owning **party** is responsible for embedding and uploading their data - to the cloud. They also download their results. -- The workload **author** creates and uploads a Docker image to a GCP Artifact - Registry. -- The workload **operator** runs the uploaded Docker image on a Confidential - Space virtual machine. - -> [!NOTE] -> We have set up `pprl_toolkit` to allow any configuration of these roles among -> users. You could do it all yourself, split the workload roles between two -> data owning-parties, or use a third-party administrator to maintain the -> workload. - -### Creating your projects - -Once you have decided who will be filling which role(s), every member of your -linkage project will need to set up a GCP project. The names of these projects -will be used in file names and GCP storage buckets. As such, they need to be -descriptive and [unique](https://cloud.google.com/storage/docs/buckets#naming). - -> [!TIP] -> It may be worth appending a hash of some sort to every project name to help -> ensure their uniqueness. - -Each user will also need to have their Google Cloud administrator grant them -certain IAM roles on their project depending on which role(s) they are playing -in the linkage: - -- **Data-owning party**: - - Cloud KMS Admin (`roles/cloudkms.admin`) - - IAM Workload Identity Pool Admin (`roles/iam.workloadIdentityPoolAdmin`) - - Service Usage Admin (`roles/serviceusage.serviceUsageAdmin`) - - Service Account Admin (`roles/iam.serviceAccountAdmin`) - - Storage Admin (`roles/storage.admin`) -- **Workload author**: - - Artifact Registry Administrator (`roles/artifactregistry.admin`) -- **Workload operator**: - - Compute Admin (`roles/compute.admin`) - - Security Admin (`roles/securityAdmin`) - - Storage Admin (`roles/storage.admin`) - -### Toolkit configuration - -Now you've got your roles sorted out and projects set up, you (and all other -users) have to write down your project's configuration in an environment file -for `pprl_toolkit`. Make sure that everyone has installed `pprl_toolkit` first. - -We have provided an example in `.env.example`. All you need to do is copy that -file to `.env` and fill in your project's details. Everyone in your project -should have identical environment files. - -### Creating the other resources - -The last step in setting your linkage project up is to create and configure all -the other resources on GCP. We have packaged up these steps into a series of -`bash` scripts, located in the `scripts/` directory. They should be executed in -order from the `scripts/` directory: - -1. The data-owning parties set up a key encryption key, a bucket in which to - store their encrypted data, data encryption key and results, a service - account for accessing said bucket and key, and a workload identity pool to - allow impersonations under stringent conditions: - ```bash - sh ./01-setup-party-resources.sh - ``` -2. The workload operator sets up a bucket for the parties to put their - (non-sensitive) attestation credentials, and a service account for running - the workload: - ```bash - sh ./02-setup-workload-operator.sh - ``` -3. The workload author sets up an Artifact Registry on GCP, creates a Docker - image and uploads that image to their registry: - ```bash - sh ./03-setup-workload-author.sh - ``` -4. The data-owning parties authorise the workload operator's service account to - use the workload identity pool to impersonate their service account in a - Confidential Space: - ```bash - sh ./04-authorise-workload.sh - ``` - -### Processing and uploading the datasets - -> [!IMPORTANT] -> This section only applies to data-owning parties. The workload author is -> finished now, and the workload operator should wait for this section to be -> completed before moving on to the next section. - -Now that all the cloud infrastructure has been set up, we are ready to start -the first step in doing the actual linkage. Much like the toy example above, -that is to make a Bloom filter embedding of each dataset. - -For users who prefer a graphical user interface, we have included a Flask app -to handle the processing and uploading of data behind the scenes. This app will -also be used to download the results once the linkage has completed. - -To launch the app, run the following in your terminal: - -```bash -python -m flask --app src/pprl/app run -``` -You should now be able to find the app in your browser of choice at -[127.0.0.1:5000](http://127.0.0.1:5000). +![A diagram of the PPRL cloud architecture, with the secure enclave and key management services](https://github.com/datasciencecampus/pprl_toolkit/blob/main/assets/pprl_cloud_diagram.png?raw=true) -Once you have worked through the selection, processing, and GCP upload portions -of the app, you will be at a holding page. This page can be updated by clicking -the button, and when your results are ready you will be taken to another page -where you can download them. +The cloud demo uses a Google Cloud Platform (GCP) Confidential Space compute instance, which is a virtual machine (VM) using AMD [Secure Encrypted Virtualisation](https://www.amd.com/en/developer/sev.html) (AMD-SEV) technology to encrypt data in-memory. -### Running the linkage +The Confidential Space VM can also provide cryptographically signed documents, called attestations, which the server can use to prove that it is running in a secure environment before gaining access to data. -> [!IMPORTANT] -> This section only applies to the workload operator. +The cloud demo assigns four roles: two data-owning +parties, a workload author, and a workload operator. These roles can be summarised as follows: -Once the data-owning parties have uploaded their processed data, you are able -to begin the linkage. To do so, run the `05-run-workload.sh` bash script from -`scripts/`: +- Each data-owning **party** is responsible for embedding and uploading their data + to the cloud. They also download their results. +- The workload **author** audits and assures the source code of the server, and then builds and uploads the server as a Docker image. +- The workload **operator** sets up and runs the Confidential + Space virtual machine, which uses the Docker image to perform the record linkage. -```bash -cd /path/to/pprl_toolkit/scripts -sh ./05-run-workload.sh -``` +We have set up `pprl_toolkit` to allow any configuration of these roles among +users. You could do it all yourself, split the workload roles between two +data owning-parties, or ask a trusted third party to maintain the +workload. -You can follow the progress of the workload from the Logs Explorer on GCP. Once -it is complete, the data-owning parties will be able to download their results. +> [!WARNING] +> The cloud demo requires you to set up one or more Google Cloud accounts with billing. The cost of running the demo should be very small, or within your free quota. +> However, you should ensure that all resources are torn down after running the demo to avoid ongoing charges. ## Building the documentation @@ -332,7 +198,7 @@ the API reference material: python -m quartodoc build ``` -This will create a bunch of files under `docs/reference/`. You can render the +This will create a set of Quarto files under `docs/reference/`. You can render the documentation itself with the following command, opening a local version of the site in your browser: diff --git a/docs/_static/02-client-screenshot.png b/docs/_static/02-client-screenshot.png index 7a075a6..207e035 100644 Binary files a/docs/_static/02-client-screenshot.png and b/docs/_static/02-client-screenshot.png differ diff --git a/docs/assets/pprl_cloud_diagram.png b/docs/assets/pprl_cloud_diagram.png new file mode 100644 index 0000000..64db255 Binary files /dev/null and b/docs/assets/pprl_cloud_diagram.png differ diff --git a/docs/tutorials/linkage_example_verknupfung.qmd b/docs/tutorials/linkage_example_verknupfung.qmd new file mode 100644 index 0000000..904f7a3 --- /dev/null +++ b/docs/tutorials/linkage_example_verknupfung.qmd @@ -0,0 +1,175 @@ +--- +title: "Exploring a simple linkage example" +format: html +jupyter: + kernelspec: + name: "pprl" + language: "python" + display_name: "pprl" +--- + +The Python package implements the Bloom filter linkage method ([Schnell et al., 2009](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-9-41)), and can also implement pretrained Hash embeddings ([Miranda et al., 2022](https://arxiv.org/abs/2212.09255)), if a suitable large, pre-matched corpus of data is available. + +Let us consider a small example where we want to link two excerpts of data on +bands. In this scenario, we are looking at some toy data on the members of a +fictional, German rock trio called "Verknüpfung". In this example we will see how to use untrained Bloom filters to match data. + +### Loading the data + +First, we load our data into `pandas.DataFrame` objects. Here, the first +records align, but the other two records should be swapped to have an aligned +matching. We will use the toolkit to identify these matches. + +```{python} +import pandas as pd + +df1 = pd.DataFrame( + { + "first_name": ["Laura", "Kaspar", "Grete"], + "last_name": ["Daten", "Gorman", "Knopf"], + "gender": ["f", "m", "f"], + "date_of_birth": ["01/03/1977", "31/12/1975", "12/7/1981"], + "instrument": ["bass", "guitar", "drums"], + } +) +df2 = pd.DataFrame( + { + "name": ["Laura Datten", "Greta Knopf", "Casper Goreman"], + "sex": ["female", "female", "male"], + "main_instrument": ["bass guitar", "percussion", "electric guitar"], + "birth_date": ["1977-03-23", "1981-07-12", "1975-12-31"], + } +) +``` + +> [!NOTE] +> These datasets don't have the same column names or follow the same encodings, +> and there are several spelling mistakes in the names of the band members, as well as a typo in the dates. +> +> Thankfully, the `pprl_toolkit` is flexible enough to handle this! + +### Creating and assigning a feature factory + +The next step is to decide how to process each of the columns in our datasets. + +To do this, we define a feature factory that maps column types to feature +generation functions, and a column specification for each dataset mapping our +columns to column types in the factory. + +```{python} +from pprl.embedder import features +from functools import partial + +factory = dict( + name=features.gen_name_features, + sex=features.gen_sex_features, + misc=features.gen_misc_features, + dob=features.gen_dateofbirth_features, + instrument=partial(features.gen_misc_shingled_features, label="instrument") +) +spec1 = dict( + first_name="name", + last_name="name", + gender="sex", + instrument="instrument", + date_of_birth="dob", +) +spec2 = dict(name="name", sex="sex", main_instrument="instrument", birth_date="dob") +``` + +> [!TIP] +> The feature generation functions, `features.gen_XXX_features` have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. +> There are two ways to achieve this. Either use `functools.partial` to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the `Embedder` as `ff_args`. + +### Embedding the data + +With our specifications sorted out, we can get to creating our Bloom filter +embedding. Before doing so, we need to decide on two parameters: the size of +the filter and the number of hashes. By default, these are `1024` and `2`, +respectively. + +Once we've decided, we can create our `Embedder` instance and use it to embed +our data with their column specifications. + +```{python} +from pprl.embedder.embedder import Embedder + +embedder = Embedder(factory, bf_size=1024, num_hashes=2) +edf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True) +edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True) +``` + +If we take a look at one of these embedded datasets, we can see that it has a +whole bunch of new columns. There is a `_features` column for each of the +original columns containing their pre-embedding string features, and there's an `all_features` column that combines the features. Then there are +three additional columns: `bf_indices`, `bf_norms` and `thresholds`. + +```{python} +edf1.columns +``` + +The `bf_indices` column contains the Bloom filters, represented compactly as a list of non-zero indices for each record. The `bf_norms` column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to `np.sqrt(len(bf_indices[i]))` for record `i`. The norm is used to scale the similarity measures so that they take values between -1 and 1. + +The `thresholds` column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It's like a reserve price in an auction -- it stops a record being matched to another record when the similarity isn't high enough. In this feature, the method implemented here differs from other linkage methods, which typically only have one global threshold score for the entire dataset. + + + +### The processed features + +Let's take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how `pprl_toolkit` puts them into a format where they can be compared. + +First, we'll look at date of birth: + +```{python} +print(edf1.date_of_birth_features[0]) +print(edf2.birth_date_features[0]) +``` + +Python can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not. + +Then we'll look at name: + +```{python} +print(edf1.first_name_features[0] + edf1.last_name_features[0]) +print(edf2.name_features[0]) +``` + +The two datasets store the names differently, but this doesn't matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams, 3-grams and 4-grams. + +The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough: + +```{python} +print(edf1.gender_features[0]) +print(edf2.sex_features[0]) +``` + + +Finally, we'll see how our instrument feature function (`partial(features.gen_misc_shingled_features, label="instrument")`) processed the data: + +```{python} +print(edf1.instrument_features[0]) +print(edf2.main_instrument_features[0]) +``` + +Setting the `label` argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn't have allowed the features to match to each other. + +### Performing the linkage + +We can now perform the linkage by comparing these Bloom filter embeddings. We +use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted +Hungarian algorithm to match the records based on those similarities. + +```{python} +similarities = embedder.compare(edf1, edf2) +similarities +``` + +This `SimilarityArray` object is an augmented `numpy.ndarray` that can perform +our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn't need one. + +```{python} +matching = similarities.match() +matching +``` + +So, all three of the records in each dataset were matched correctly. Excellent! diff --git a/docs/tutorials/working_in_the_cloud.md b/docs/tutorials/working_in_the_cloud.md new file mode 100644 index 0000000..0ea3cf3 --- /dev/null +++ b/docs/tutorials/working_in_the_cloud.md @@ -0,0 +1,153 @@ +--- +title: "Using the cloud demo" +format: html +jupyter: + kernelspec: + name: "pprl" + language: "python" + display_name: "pprl" +--- + + + + +![A diagram of the PPRL cloud architecture, with the secure enclave and key management services](https://github.com/datasciencecampus/pprl_toolkit/blob/main/assets/pprl_cloud_diagram.png?raw=true) + +The cloud demo uses a Google Cloud Platform (GCP) Confidential Space compute instance, which is a virtual machine (VM) using AMD [Secure Encrypted Virtualisation](https://www.amd.com/en/developer/sev.html) (AMD-SEV) technology to encrypt data in-memory. +The Confidential Space VM can also provide cryptographically signed documents, called attestations, which the server can use to prove that it is running in a secure environment before gaining access to data. + +The cloud demo assigns four roles: two data-owning +parties, a workload author, and a workload operator. These roles can be summarised as follows: + +- Each data-owning **party** is responsible for embedding and uploading their data + to the cloud. They also download their results. +- The workload **author** audits and assures the source code of the server, and then builds and uploads the server as a Docker image. +- The workload **operator** sets up and runs the Confidential + Space virtual machine, which uses the Docker image to perform the record linkage. + +We have set up `pprl_toolkit` to allow any configuration of these roles among +users. You could do it all yourself, split the workload roles between two +data owning-parties, or ask a trusted third party to maintain the +workload. + +[This Google tutorial](https://cloud.google.com/confidential-computing/confidential-space/docs/create-your-first-confidential-space-environment) +provides a simple example to familiarise yourselves with the concepts and commands. + +> [!WARNING] The cloud demo requires you to set up one or more Google Cloud accounts with billing. The cost of running the demo should be very small, or within your free quota. +> However, you should ensure that all resources are torn down after running the demo to avoid ongoing charges. + +### Creating your projects + +Once you have decided who will be filling which role(s), every member of your +linkage project will need to set up a GCP project. The names of these projects +will be used in file names and GCP storage buckets. As such, they need to be +descriptive and [globally unique](https://cloud.google.com/storage/docs/buckets#naming). + +> [!TIP] +> It may be worth appending a hash of some sort to every project name to help +> ensure their uniqueness. + +Each user will also need to have their Google Cloud administrator grant them +certain IAM roles on their project depending on which role(s) they are playing +in the linkage: + +- **Data-owning party**: + - Cloud KMS Admin (`roles/cloudkms.admin`) + - IAM Workload Identity Pool Admin (`roles/iam.workloadIdentityPoolAdmin`) + - Service Usage Admin (`roles/serviceusage.serviceUsageAdmin`) + - Service Account Admin (`roles/iam.serviceAccountAdmin`) + - Storage Admin (`roles/storage.admin`) +- **Workload author**: + - Artifact Registry Administrator (`roles/artifactregistry.admin`) +- **Workload operator**: + - Compute Admin (`roles/compute.admin`) + - Security Admin (`roles/securityAdmin`) + - Storage Admin (`roles/storage.admin`) + +### Toolkit configuration + +Now you've got your roles sorted out and projects set up, you (and all other +users) have to write down your project's configuration in an environment file +for `pprl_toolkit`. Make sure that everyone has installed `pprl_toolkit` first. + +We have provided an example in `.env.example`. All you need to do is copy that +file to `.env` and fill in your project's details. Everyone in your project +should have identical environment files. + +### Creating the other resources + +The last step in setting your linkage project up is to create and configure all +the other resources on GCP. We have packaged up these steps into a series of +`bash` scripts, located in the `scripts/` directory. They should be executed in +order from the `scripts/` directory: + +1. The data-owning parties set up a key encryption key, a bucket in which to + store their encrypted data, data encryption key and results, a service + account for accessing said bucket and key, and a workload identity pool to + allow impersonations under stringent conditions: + ```bash + sh ./01-setup-party-resources.sh + ``` +2. The workload operator sets up a bucket for the parties to put their + (non-sensitive) attestation credentials, and a service account for running + the workload: + ```bash + sh ./02-setup-workload-operator.sh + ``` +3. The workload author sets up an Artifact Registry on GCP, creates a Docker + image and uploads that image to their registry: + ```bash + sh ./03-setup-workload-author.sh + ``` +4. The data-owning parties authorise the workload operator's service account to + use the workload identity pool to impersonate their service account in a + Confidential Space: + ```bash + sh ./04-authorise-workload.sh + ``` + +### Processing and uploading the datasets + +> [!IMPORTANT] +> This section only applies to data-owning parties. The workload author is +> finished now, and the workload operator should wait for this section to be +> completed before moving on to the next section. + +Now that all the cloud infrastructure has been set up, we are ready to start +the first step in doing the actual linkage. Much like the toy example above, +that is to make a Bloom filter embedding of each dataset. + +For users who prefer a graphical user interface, we have included a Flask app +to handle the processing and uploading of data behind the scenes. This app will +also be used to download the results once the linkage has completed. + +To launch the app, run the following in your terminal: + +```bash +python -m flask --app src/pprl/app run +``` + +You should now be able to find the app in your browser of choice at +[127.0.0.1:5000](http://127.0.0.1:5000). + +Once you have worked through the selection, processing, and GCP upload portions +of the app, you will be at a holding page. This page can be updated by clicking +the button, and when your results are ready you will be taken to another page +where you can download them. + +### Running the linkage + +> [!IMPORTANT] +> This section only applies to the workload operator. + +Once the data-owning parties have uploaded their processed data, you are able +to begin the linkage. To do so, run the `05-run-workload.sh` bash script from +`scripts/`: + +```bash +cd /path/to/pprl_toolkit/scripts +sh ./05-run-workload.sh +``` + +You can follow the progress of the workload from the Logs Explorer on GCP. Once +it is complete, the data-owning parties will be able to download their results.