diff --git a/README.md b/README.md index fcdb7fc..d4a0543 100644 --- a/README.md +++ b/README.md @@ -175,8 +175,14 @@ data owning-parties, or ask a trusted third party to maintain the workload. > [!WARNING] -> The cloud demo requires you to set up one or more Google Cloud accounts with billing. The cost of running the demo should be very small, or within your free quota. -> However, you should ensure that all resources are torn down after running the demo to avoid ongoing charges. +> The cloud demo requires you to set up one or more Google Cloud accounts with +> billing. The cost of running the demo should be very small, or within your +> free quota. However, you should ensure that all resources are torn down after +> running the demo to avoid ongoing charges. + +Please refer to our +[cloud tutorial](https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/in-the-cloud) +for further details on how to get working in the cloud. ## Building the documentation diff --git a/_quarto.yml b/_quarto.yml index 7b52771..14530c7 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -18,6 +18,14 @@ website: url: https://github.com/datasciencecampus/pprl - text: Open an issue url: https://github.com/datasciencecampus/pprl/issues + sidebar: + style: docked + search: true + contents: + - text: About + href: index.qmd + - auto: "*.qmd" + reader-mode: true page-footer: left: > All content is available under the @@ -29,7 +37,9 @@ website: format: html: mainfont: Arial - theme: solar + theme: + light: flatly + dark: darkly lang: en-GB metadata-files: diff --git a/docs/_static/app-home-screenshot.png b/docs/_static/app-home-screenshot.png new file mode 100644 index 0000000..008798f Binary files /dev/null and b/docs/_static/app-home-screenshot.png differ diff --git a/docs/tutorials/00-getting-started.qmd b/docs/tutorials/00-getting-started.qmd deleted file mode 100644 index 4c54f76..0000000 --- a/docs/tutorials/00-getting-started.qmd +++ /dev/null @@ -1,110 +0,0 @@ ---- -title: Getting started -description: > - Make yourself familiar with how PPRL works and how to get up and running. ---- - - -This tutorial provides an overview of how the PPRL methodology works. We split -the process into X steps, focusing on the bigger picture; we point you in the -direction of relevant tutorials when we gloss over some of the finer details. - - -## Assembling a linkage team - -There are three roles in any linkage project: two data-owning _parties_ and a -linkage _administrator_. These roles need not be fulfilled by three people. It -is perfectly possible to perform PPRL on your own, or perhaps you are working -under a trust model that allows one of the data-owning parties to also be the -administrator. - - -From here on, we will assume you are running PPRL with more than one person and -in the cloud. If you intend on using PPRL on your own or locally, please follow -this tutorial. Otherwise, you must decide who will be doing what from the -outset. Each role comes with different responsibilities, but all roles require -a Google Cloud Platform account and access to the `gcloud` command-line tool. - -### Data-owning party - -Often referred to as a _party_, a data owner is responsible for the -storage and preparation of some confidential data. They create a Bloom filter -embedding of their confidential data using an agreed configuration, and then -share that with the administrator to process. Once the administrator is -finished, they return the results of the linkage to the respective parties. - -### Linkage administrator - -The _administrator_ runs the linkage itself given some embedded data from the -parties. They are responsible for setting up and running a -[Confidential Space](https://cloud.google.com/docs/security/confidential-space) -in which to perform the linkage. This setting means that the administrator -never has access to the parties' data directly, and they can only be accessed -via the PPRL code itself. - - -## Names and projects - -Once you have decided who will be playing which role(s), you need to decide on -a naming structure and make some projects. You will need a name for each party -and the administrator. These names will be used in configuration files, Google -Cloud projects and buckets. As such, they need to be descriptive and unique. - -### Choosing a unique name - -Since Google Cloud bucket names must be -[globally unique](https://cloud.google.com/storage/docs/buckets#naming), we -recommend using a hash in your names to ensure that they are unique. - -For example, say the US Census Bureau and UK Office for National Statistics -(ONS) are looking to link some data on ex-patriated residents with PPRL. Then -they might use `uk-ons` and `us-cb` as their party names, which are succinct -and descriptive. However, they rule out future PPRL projects with the same -names. So, they could make a hash of their project aims and append it to -their names: - -```bash -$ echo -n "pprl uk-ons us-cb ex-pats" | sha256sum -698ea25dc5bc3086fa9600297cba9531615ca875bd3927f9229b7e2899f6dc48 - -``` - -This is very long. You might only want to use the first few characters of this -hash. Note that Google Cloud bucket names also can't be more than 63 characters -long without dots. You can trim it down like so: - -```bash -$ echo -n "pprl uk-ons us-cb ex-pats" | sha256num | cut -c 1-7 -698ea25 -``` - -So our names would be: `uk-ons-698ea25`, `us-cb-698ea25`, and `admin-698ea25`. - -### Party project and roles - -Now each party must create a project with the name as defined above. If one of -parties is also acting as the administrator, they should still follow the -guidance for administrators below. - -Importantly, each party will need to have their Google Cloud administrator -grant them the following IAM roles on their project: - -- Cloud KMS Administrator (`roles/cloudkms.admin`) -- IAM Workload Identity Pool Administrator - (`roles/iam.workloadIdentityPoolAdmin`) -- Service Usage Administrator (`roles/serviceusage.serviceUseageAdmin`) -- Service Account Administrator (`roles/iam.serviceAccountAdmin`) -- Storage Administrator (`roles/storage.admin`) - -### Administrator project and roles - -The administrator also needs to set up a Google Cloud project with the agreed -name. The administrator in PPRL takes on two Confidential Space -[roles](https://cloud.google.com/confidential-computing/confidential-space/docs/confidential-space-overview#roles): -they are the workload author and operator. As such, they will need the -following IAM roles added to their account on the linkage administrator -project: - -- Storage Administrator (`roles/storage.admin`) -- Compute Administrator (`roles/compute.admin`) -- Security Administrator (`roles/securityAdmin`) -- Artifact Registry Administrator (`roles/artifactregistry.admin`) diff --git a/docs/tutorials/01-setup-server.qmd b/docs/tutorials/01-setup-server.qmd deleted file mode 100644 index 2b57881..0000000 --- a/docs/tutorials/01-setup-server.qmd +++ /dev/null @@ -1,115 +0,0 @@ ---- -title: Setting up the server -tags: [server] ---- - -In this tutorial, you will learn how to set up the cloud infrastructure to -allow private record linkage on behalf of two parties. - -::: {.callout-important} -You only need to follow this tutorial if you are going to be the administrator -of the linkage. See our client set-up tutorial if you're looking to get your -data linked privately by another party. -::: - - -## Setting yourself up - -The technical details of the set-up process have been abstracted away into -`bash` scripts. However, for these to work, you need to be set up properly. - -### Google Cloud - -First, you will need to -[set up a Google Cloud project](https://developers.google.com/workspace/guides/create-project) -to act as the centre of the record linkage. The server will run as a virtual -machine (VM) from this project, and all billing will route through it. - -Next, you need to -[install the Google Cloud CLI](https://cloud.google.com/sdk/docs/install). -Once you have that installed, you need to log in and authenticate yourself. -Run the following in your terminal: - -```bash -gcloud auth login -``` - -::: {.callout-tip} -You may also need to configure `gcloud` to use your project. To do so, run -the following in your terminal: - -```bash -gcloud config set project -``` -::: - -### Docker - -You will also need [Docker installed](https://docs.docker.com/get-docker/) -on your machine. - -To [run the scripts](running-the-scripts), you will also need to have the app -running in the background, and be logged in. - -### Environment variables - -For the sake of security, we make use of a hidden environment file (`.env`) to -store the environment variables used in this project. - -To set up the server using our scripts, you must edit your `.env` file further -to the instructions in the README, adding three lines: - -```bash -PPRL_PROJECT_NAME= -PPRL_PROJECT_REGION= -PPRL_PROJECT_ZONE= -``` - -You can find the region and zone of your project in the Google Cloud Console. For example, our region is -`europe-west2`, and the zone is `europe-west2-c`. - -::: {.callout-important} -Please ensure that you do not include spaces or quotation marks in the lines -above as they will cause the scripts to throw an error. -::: - - -## Setting up the architecture - -There are two set-up scripts located in `scripts/`: - -- `01-setup-server.sh`: sets up the body of the architecture, comprising the - key management service and keys, service accounts (including the confidential - VM) and their permissions, storage buckets, a workload identity pool, and all - the credentials -- `02-setup-docker.sh`: sets up an Artifact Registry for your project, and - uploads a Docker image to be run on the confidential server - -To execute these scripts, run the following commands in your terminal: - -```bash -sh /path/to/pprl/scripts/01-setup-server.sh -sh /path/to/pprl/scripts/02-setup-docker.sh -``` - -There will be points where you will have to authenticate in the browser. -Once you've finished, the script should continue in your terminal. - - -## Sharing the credentials - -Once you've run these scripts, there should be three new JSON files on your -machine. These are: - -- `attestation_credentials.json`: attestation credentials for VM. These get - included in the Docker image that is uploaded by the second script. You don't - need to do anything with these. -- `party-[1,2]-service-account-credentials.json`: credentials (including a - private key) for each party. - -You must share the party credential files with the respective parties, and keep -them safe yourself. - -Once you've done that, you should be ready for your parties to work through our -client tutorial. After they're finished, you can work through the pipeline -tutorial to execute the matching. diff --git a/docs/tutorials/02-client.qmd b/docs/tutorials/02-client.qmd deleted file mode 100644 index ce5e37d..0000000 --- a/docs/tutorials/02-client.qmd +++ /dev/null @@ -1,123 +0,0 @@ ---- -title: Acting as a data-owning party -tags: [client] ---- - -This tutorial describes how to go about uploading your data to GCP for linkage -on a confidential server. To do so, you will install and run a Flask app that -provides a graphical user interface (GUI) for uploading data. - -::: {.callout-important} -You will only be able to upload your data if the cloud infrastructure has been -[set up](01-setup-server.qmd) by the administrator of your linkage project. -You do not need to go through that process if you are not the administrator. -::: - - -## Setting yourself up - -### Receiving your credentials - -In order to be able to upload your data to GCP, you will need a set of -credentials for a service account. This account and the credentials were -created by your linkage administrator when setting up the confidential server. -It is their responsibility to share the credentials with you, so reach out if -you have not yet received them. - -The credentials file comes in JSON format and should be named -`party--service-account-credentials.json`, where `` -indicates which party you are in the linkage (one or two). The contents of the -file should look something like this: - -```json -{ - "type": "service_account", - "project_id": "", - "private_key_id": "", - "private_key": "", - "client_email": "party-<1,2>-service-account@.iam.gserviceaccount.com", - "client_id": "", - "auth_uri": "https://accounts.google.com/o/oauth2/auth", - "token_uri": "https://oauth2.googleapis.com/token", - "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", - "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/party-<1,2>-service-account%40.iam.gserviceaccount.com", - "universe_domain": "googleapis.com" -} -``` - -Once you have received your credentials file, put it somewhere safe and -accessible on your machine. We recommend the `secrets/` directory at the root -directory of the `pprl` repository. - -### Environment variables - -Regardless of where you've stored the credentials file, you now need to let -`pprl` know where it is by storing its path as an environment variable. - -To avoid exposing your credentials elsewhere, or bloating your system -environment, we recommend you add the path to a hidden environment file called -`.env`. If you have not yet created one, you can do so by running this command: - -```bash -touch .env -``` - -::: {.callout-tip} -If you're on Windows, use this command instead: - -```ps -echo. > .env -``` -::: - -You should then add the following line to the `.env` file, replacing the path -and file name as needed: - -```shell -GOOGLE_APPLICATION_CREDENTIALS=/path/to/party-<1,2>-service-account-credentials.json -``` - - -## Running the app - -You should be all set up now to run the app and upload your data. - -::: {.callout-important} -You will also need two important things: - -1. Your dataset to be linked with at least sex, name, and DOB columns -2. A _salt_ string that is secret and agreed between you and the other linkage - party -::: - -From the root directory of this project, run the following command: - -```bash -python -m flask --app src/pprl/app/app run -``` - -This should provide a URL to a local host server like `https://127.0.0.1:5000`. -Open the URL in your browser and you will be in the GUI, pictured below. - -![A screenshot of the app](_static/02-client-screenshot.png) - -From here, the process to upload your data is as follows: - -1. Click `Choose file` to open your file browser -2. Navigate to and select your dataset -3. Click `Submit` -4. Assign types to each column in your dataset -5. Enter the agreed salt -6. Click `Upload file to GCP` - -:::{.callout-tip} -If you don't want to upload the data to GCP for linkage, you can choose to -download the processed data instead by clicking `Download file locally`. - -This downloads the bloom filters of your data without performing any matching -or linkage in the cloud. -::: - -Once both parties have uploaded their datasets, your linkage administrator can -execute the [linkage process](03-run-server.qmd). Upon completion, your results -will be available to download and your linkage will be complete. diff --git a/docs/tutorials/03-run-server.qmd b/docs/tutorials/03-run-server.qmd deleted file mode 100644 index 309d7a7..0000000 --- a/docs/tutorials/03-run-server.qmd +++ /dev/null @@ -1,45 +0,0 @@ ---- -title: Running the matching on the server -tags: [server] ---- - -This tutorial shows you how to run the matching pipeline in a confidential -server on GCP. Before you attempt this tutorial, please ensure you have worked -through our [architecture set-up](01-setup-server.qmd), and both of your -data-owning parties have successfully [uploaded their data](02-client.qmd) to -GCP. - -::: {.callout-important} -You only need to follow this tutorial if you are going to be the administrator -of the linkage. See our client set-up tutorial if you're looking to get your -data linked privately by another party. -::: - -## Running the pipeline - -Once your parties have uploaded their data, you can run the pipeline -server by executing the following commands in your terminal: - -```bash -export $(grep 'PPRL_PROJECT' .env | xargs -0) -gcloud compute instances create pprl-tee \ - --confidential-compute \ - --shielded-secure-boot \ - --maintenance-policy=TERMINATE \ - --scopes=cloud-platform \ - --zone=$PPRL_PROJECT_ZONE \ - --image-project=confidential-space-images \ - --image-family=confidential-space \ - --service-account=run-confidential-vm@$PPRL_PROJECT_NAME.iam.gserviceaccount.com \ - --metadata="^~^tee-image-reference=$PPRL_PROJECT_REGION-docker.pkg.dev/$PPRL_PROJECT_NAME/pprl-repo/pprl-server-image:latest" -``` - -## Closing down - -Once the data linkage has taken place, the instance should stop running -automatically. run the following command to delete the -virtual machine: - -```bash -echo "Y" | gcloud compute instances delete projects/$PPRL_PROJECT_NAME/zones/$PPRL_PROJECT_ZONE/instances/pprl-tee -``` diff --git a/docs/tutorials/linkage_example_febrl.qmd b/docs/tutorials/example-febrl.qmd similarity index 98% rename from docs/tutorials/linkage_example_febrl.qmd rename to docs/tutorials/example-febrl.qmd index 0b4fcc1..90345e2 100644 --- a/docs/tutorials/linkage_example_febrl.qmd +++ b/docs/tutorials/example-febrl.qmd @@ -1,5 +1,6 @@ --- -title: "Linking the FEBRL datasets using the PPRL package" +title: Linking the FEBRL datasets +description: Using PPRL locally to link two well-known datasets format: html jupyter: kernelspec: diff --git a/docs/tutorials/linkage_example_verknupfung.qmd b/docs/tutorials/example-verknupfung.qmd similarity index 100% rename from docs/tutorials/linkage_example_verknupfung.qmd rename to docs/tutorials/example-verknupfung.qmd diff --git a/docs/tutorials/in-the-cloud.qmd b/docs/tutorials/in-the-cloud.qmd new file mode 100644 index 0000000..e5dd538 --- /dev/null +++ b/docs/tutorials/in-the-cloud.qmd @@ -0,0 +1,317 @@ +--- +title: Working in the cloud +description: > + Get you and your collaborators performing linkage in the cloud +--- + +This tutorial provides an overview of how to use `pprl_toolkit` on +Google Cloud Platform (GCP). We go over how to assemble and assign roles in a +linkage team, how to set up everybody's projects, and end with executing the +linkage itself. + +![A diagram of the PPRL cloud architecture, with the secure enclave and key management services](https://github.com/datasciencecampus/pprl_toolkit/blob/main/assets/pprl_cloud_diagram.png?raw=true) + +Above is a diagram showing the PPRL cloud architecture. The cloud demo uses a +Google Cloud Platform (GCP) Confidential Space compute instance, which is a +virtual machine (VM) using AMD +[Secure Encrypted Virtualisation](https://www.amd.com/en/developer/sev.html) +(AMD-SEV) technology to encrypt data in-memory. The Confidential Space VM can +also provide cryptographically signed documents, called attestations, which the +server can use to prove that it is running in a secure environment before +gaining access to data. + + +## Assembling a linkage team + +There are four roles to fill in any PPRL project: two data-owning **parties**, +a workload **author**, and a workload **operator**. A workload is how we refer +to the resources for the linkage operation itself (i.e. the containerised +linkage code and the environment in which to run it.) + +These roles need not be fulfilled by four separate people. It is perfectly +possible to perform PPRL on your own, or perhaps you are working under a trust +model that allows one of the data-owning parties to author the workload while +the other is the operator. + +::: {.callout-tip} +In fact, `pprl_toolkit` is set up to allow any configuration of these roles +among up to four people. +::: + +In any case, you must decide who will be doing what from the outset. Each role +comes with different responsibilities, but all roles require a GCP account and +access to the `gcloud` command-line tool. Additionally, everyone in the linkage +project will need to install `pprl_toolkit`. + +### Data-owning party + +Often referred to as just a **party**, a data owner is responsible for the +storage and preparation of some confidential data. During set-up, each party +also sets up a storage bucket, a key management service, and a workload +identity pool that allows the party to share permissions with the server during +the linkage operation. + +They create a Bloom filter embedding of their confidential data using an agreed +configuration, and then upload that to GCP for processing. Once the workload +operator is finished, the parties are able to retrieve their linkage results. + +### Workload author + +The workload **author** is responsible for building a Docker image containing +the cloud-based linkage code and uploading it to a GCP Artifact Registry. This +image is the workload to be run by the operator. + +### Workload operator + +The workload **operator** runs the linkage itself using some embedded data from +the parties and an image from the author. They are responsible for setting up +and running a +[Confidential Space](https://cloud.google.com/docs/security/confidential-space) +in which to perform the linkage. This setting ensures that nobody ever has +access to all the data at once, and that the data can only be accessed via the +linkage code itself. + + +## Creating your GCP projects + +Once you have decided who will be playing which role(s), you need to decide on +a naming structure and make some GCP projects. You will need a project for each +member of the linkage project - not one for each role. The names of these +projects will be used throughout the cloud implementation, from configuration +files to buckets. As such, they need to be descriptive and unique. + +::: {.callout-warning} +Since Google Cloud bucket names must be +[globally unique](https://cloud.google.com/storage/docs/buckets#naming), we +highly recommend using a hash in your project names to ensure that they are +unique. This will ensure that bucket names are also globally unique. + +Our aim is to create a globally unique name (and thus ID) for each project. +::: + +For example, say the US Census Bureau and UK Office for National Statistics +(ONS) are looking to link some data on ex-patriated residents with PPRL. Then +they might use `us-cb` and `uk-ons` as their party names, which are succinct +and descriptive. However, they are generic and rule out future PPRL projects +with the same names. + +As a remedy, they could make a hash of their project description to create an +identifier: + +```bash +$ echo -n "pprl us-cb uk-ons ex-pats-analysis" | sha256sum +d59a50241dc78c3f926b565937b99614b7bb7c84e44fb780440718cb2b0ddc1b - +``` + +This is very long. You might only want to use the first few characters of this +hash. Note that Google Cloud bucket names also can't be more than 63 characters +long without dots. + +You can trim it down like so: + +```bash +$ echo -n "pprl us-cb uk-ons ex-pats-analysis" | sha256sum | cut -c 1-7 +d59a502 +``` + +So, our names would be: `uk-ons-d59a502`, `us-cb-d59a502`. If they had a +third-party linkage administrator (authoring and operating the workload), they +would have a project called something like `admin-d59a502`. + + +## Setting up your projects + +Once you have decided on a naming structure, it is time to create the GCP +projects. Each project will need specific Identity and Access Management (IAM) +roles granted to them by the project owner's GCP Administrator. Which IAM roles +depends on the linkage role they are playing. If someone is fulfilling more +than one role, they should follow all the relevant sections below. + +::: {.callout-tip} +If you have Administrator permissions for your GCP project, you can grant these +roles using the `gcloud` command-line tool: + +```bash +gcloud projects add-iam-policy-binding \ + --member=user: \ + --role= +``` +::: + +### Data-owning parties + +Each data-owning party requires the following IAM roles: + +| Title | Code | Purpose | +|----------------------------------|----------------------------------------|-----------------------------------| +| Cloud KMS Admin | `roles/cloudkms.admin` | Managing encryption keys | +| IAM Workload Identity Pool Admin | `roles/iam.workloadIdentityPoolAdmin` | Managing an impersonation service | +| Service Usage Admin | `roles/serviceusage.serviceUsageAdmin` | Managing access to other APIs | +| Service Account Admin | `roles/iam.serviceAccountAdmin` | Managing a service account | +| Storage Admin | `roles/storage.admin` | Managing a bucket for their data | + +### Workload author + +The workload author only requires one IAM role: + +| Title | Code | Purpose | +|---------------------------------|--------------------------------|----------------------------------------| +| Artifact Registry Administrator | `roles/artifactregistry.admin` | Managing the registry for the workload | + +### Workload operator + +The workload operator requires three IAM roles: + +| Title | Code | Purpose | +|----------------|---------------------------|-------------------------------------| +| Compute Admin | `roles/compute.admin` | Managing the virtual machine | +| Security Admin | `roles/iam.securityAdmin` | Ability to set and get IAM policies | +| Storage Admin | `roles/storage.admin` | Managing a shared bucket | + + +## Configuring `pprl_toolkit` + +Now your linkage team has its projects made up, you need to configure +`pprl_toolkit`. This configuration tells the package where to look and what to +call things; we do this with a single environment file containing a short +collection of key-value pairs. + +We have provided an example environment file in `.env.example`. Copy or rename +that file to `.env` in the root of the `pprl_toolkit` installation. Then, fill +in your project details as necessary. + +For our example above, let's say the ONS will be the workload author and the US +Census Bureau will be the workload operator. The environment file would look +something like this: + +```bash +PARTY_1_PROJECT=us-cb-d59a502 +PARTY_1_KEY_VERSION=1 + +PARTY_2_PROJECT=uk-ons-d59a502 +PARTY_2_KEY_VERSION=1 + +WORKLOAD_AUTHOR_PROJECT=uk-ons-d59a502 +WORKLOAD_AUTHOR_PROJECT_REGION=europe-west2 + +WORKLOAD_OPERATOR_PROJECT=us-cb-d59a502 +WORKLOAD_OPERATOR_PROJECT_ZONE=us-east4-a +``` + +::: {.callout-important} +Your environment file should be identical among all the members of your linkage +project. +::: + + +## Creating the other resources + +The last step in setting up your linkage project is to create and configure the +other resources on GCP. To make things straightforward for users, we have +packaged up the steps to do this into a number of `bash` scripts. These scripts +are located in the `scripts/` directory and are numbered. You and your team +must execute them from the `scripts/` directory in their named order according +to which role(s) each member is fulfilling in the linkage project. + +::: {.callout-tip} +Make sure you have set up `gcloud` on the command line. Once you've installed +it, log in and set the application default: + +```bash +gcloud auth login +gcloud auth application-default login +``` +::: + +1. The data-owning parties set up: a key encryption key; a bucket in which to + store their encrypted data, data encryption key and results; a service + account for accessing said bucket and key; and a workload identity pool to + allow impersonations under stringent conditions. + ```bash + sh ./01-setup-party-resources.sh + ``` +2. The workload operator sets up a bucket for the parties to put their + (non-sensitive) attestation credentials, and a service account for running + the workload. + ```bash + sh ./02-setup-workload-operator.sh + ``` +3. The workload author sets up an Artifact Registry on GCP, creates a Docker + image and uploads that image to their registry. + ```bash + sh ./03-setup-workload-author.sh + ``` +4. The data-owning parties authorise the workload operator's service account to + use the workload identity pool to impersonate their service account in a + Confidential Space. + ```bash + sh ./04-authorise-workload.sh + ``` + + +## Processing and uploading the results + +::: {.callout-important} +This section only applies to data-owning parties. The workload author is +finished now, and the workload operator should wait for this section to be +completed before moving on to the next section. +::: + + +Now that all the cloud infrastructure has been set up, we are ready to start +the first step in doing the actual linkage. That is, to create a Bloom filter +embedding of their data, encrypt it, and upload that to GCP. + +For users who prefer a graphical user interface, we have included a Flask app +to handle the processing and uploading of data behind the scenes. This app will +also be used to download the results once the linkage has completed. + +To launch the app, run the following in your terminal: + +```bash +python -m flask --app src/pprl/app run +``` + +You should now be able to find the app in your browser of choice at +[127.0.0.1:5000](http://127.0.0.1:5000). It should look something like this: + +![A screenshot of the app](../_static/app-home-screenshot.png) + +From here, the process to upload your data is as follows: + +1. Choose which party you are uploading for. Click `Submit`. +2. Select `Upload local file` and click `Choose file` to open your file + browser. Navigate to and select your dataset. Click `Submit`. +3. Assign types to each column in your dataset. Enter the agreed salt. +4. Click `Upload file to GCP`. + +::: {.callout-note} +If you choose to use the Flask app to process your data, you will use a set of +defaults for processing the confidential data before it gets embedded. If you +want more control, then you'll have to agree an embedding configuration with +the other data-owning party and do the processing directly. +::: + +Once you have worked through the selection, processing, and GCP upload portions +of the app, you will be at a holding page. This page can be updated by clicking +the button, and when your results are ready you will be taken to another page +where you can download them. + + +## Running the linkage + +::: {.callout-important} +This section only applies to the workload operator. +::: + +Once the data-owning parties have uploaded their processed data, you are able +to begin the linkage. To do so, run the `05-run-workload.sh` bash script from +`scripts/`: + +```bash +cd /path/to/pprl_toolkit/scripts +sh ./05-run-workload.sh +``` + +You can follow the progress of the workload from the Logs Explorer on GCP. Once +it is complete, the data-owning parties will be able to download their results. diff --git a/docs/tutorials/index.qmd b/docs/tutorials/index.qmd index e2f83c7..0fa58c3 100644 --- a/docs/tutorials/index.qmd +++ b/docs/tutorials/index.qmd @@ -4,7 +4,9 @@ listing: type: table contents: - "*.qmd" - fields: [title, description, tags] + fields: [title, description, reading-time] + sort-ui: false + filter-ui: false --- These tutorials walk you through some of the essential workflows for `pprl`. diff --git a/docs/tutorials/working_in_the_cloud.md b/docs/tutorials/working_in_the_cloud.md deleted file mode 100644 index 0ea3cf3..0000000 --- a/docs/tutorials/working_in_the_cloud.md +++ /dev/null @@ -1,153 +0,0 @@ ---- -title: "Using the cloud demo" -format: html -jupyter: - kernelspec: - name: "pprl" - language: "python" - display_name: "pprl" ---- - - - - -![A diagram of the PPRL cloud architecture, with the secure enclave and key management services](https://github.com/datasciencecampus/pprl_toolkit/blob/main/assets/pprl_cloud_diagram.png?raw=true) - -The cloud demo uses a Google Cloud Platform (GCP) Confidential Space compute instance, which is a virtual machine (VM) using AMD [Secure Encrypted Virtualisation](https://www.amd.com/en/developer/sev.html) (AMD-SEV) technology to encrypt data in-memory. -The Confidential Space VM can also provide cryptographically signed documents, called attestations, which the server can use to prove that it is running in a secure environment before gaining access to data. - -The cloud demo assigns four roles: two data-owning -parties, a workload author, and a workload operator. These roles can be summarised as follows: - -- Each data-owning **party** is responsible for embedding and uploading their data - to the cloud. They also download their results. -- The workload **author** audits and assures the source code of the server, and then builds and uploads the server as a Docker image. -- The workload **operator** sets up and runs the Confidential - Space virtual machine, which uses the Docker image to perform the record linkage. - -We have set up `pprl_toolkit` to allow any configuration of these roles among -users. You could do it all yourself, split the workload roles between two -data owning-parties, or ask a trusted third party to maintain the -workload. - -[This Google tutorial](https://cloud.google.com/confidential-computing/confidential-space/docs/create-your-first-confidential-space-environment) -provides a simple example to familiarise yourselves with the concepts and commands. - -> [!WARNING] The cloud demo requires you to set up one or more Google Cloud accounts with billing. The cost of running the demo should be very small, or within your free quota. -> However, you should ensure that all resources are torn down after running the demo to avoid ongoing charges. - -### Creating your projects - -Once you have decided who will be filling which role(s), every member of your -linkage project will need to set up a GCP project. The names of these projects -will be used in file names and GCP storage buckets. As such, they need to be -descriptive and [globally unique](https://cloud.google.com/storage/docs/buckets#naming). - -> [!TIP] -> It may be worth appending a hash of some sort to every project name to help -> ensure their uniqueness. - -Each user will also need to have their Google Cloud administrator grant them -certain IAM roles on their project depending on which role(s) they are playing -in the linkage: - -- **Data-owning party**: - - Cloud KMS Admin (`roles/cloudkms.admin`) - - IAM Workload Identity Pool Admin (`roles/iam.workloadIdentityPoolAdmin`) - - Service Usage Admin (`roles/serviceusage.serviceUsageAdmin`) - - Service Account Admin (`roles/iam.serviceAccountAdmin`) - - Storage Admin (`roles/storage.admin`) -- **Workload author**: - - Artifact Registry Administrator (`roles/artifactregistry.admin`) -- **Workload operator**: - - Compute Admin (`roles/compute.admin`) - - Security Admin (`roles/securityAdmin`) - - Storage Admin (`roles/storage.admin`) - -### Toolkit configuration - -Now you've got your roles sorted out and projects set up, you (and all other -users) have to write down your project's configuration in an environment file -for `pprl_toolkit`. Make sure that everyone has installed `pprl_toolkit` first. - -We have provided an example in `.env.example`. All you need to do is copy that -file to `.env` and fill in your project's details. Everyone in your project -should have identical environment files. - -### Creating the other resources - -The last step in setting your linkage project up is to create and configure all -the other resources on GCP. We have packaged up these steps into a series of -`bash` scripts, located in the `scripts/` directory. They should be executed in -order from the `scripts/` directory: - -1. The data-owning parties set up a key encryption key, a bucket in which to - store their encrypted data, data encryption key and results, a service - account for accessing said bucket and key, and a workload identity pool to - allow impersonations under stringent conditions: - ```bash - sh ./01-setup-party-resources.sh - ``` -2. The workload operator sets up a bucket for the parties to put their - (non-sensitive) attestation credentials, and a service account for running - the workload: - ```bash - sh ./02-setup-workload-operator.sh - ``` -3. The workload author sets up an Artifact Registry on GCP, creates a Docker - image and uploads that image to their registry: - ```bash - sh ./03-setup-workload-author.sh - ``` -4. The data-owning parties authorise the workload operator's service account to - use the workload identity pool to impersonate their service account in a - Confidential Space: - ```bash - sh ./04-authorise-workload.sh - ``` - -### Processing and uploading the datasets - -> [!IMPORTANT] -> This section only applies to data-owning parties. The workload author is -> finished now, and the workload operator should wait for this section to be -> completed before moving on to the next section. - -Now that all the cloud infrastructure has been set up, we are ready to start -the first step in doing the actual linkage. Much like the toy example above, -that is to make a Bloom filter embedding of each dataset. - -For users who prefer a graphical user interface, we have included a Flask app -to handle the processing and uploading of data behind the scenes. This app will -also be used to download the results once the linkage has completed. - -To launch the app, run the following in your terminal: - -```bash -python -m flask --app src/pprl/app run -``` - -You should now be able to find the app in your browser of choice at -[127.0.0.1:5000](http://127.0.0.1:5000). - -Once you have worked through the selection, processing, and GCP upload portions -of the app, you will be at a holding page. This page can be updated by clicking -the button, and when your results are ready you will be taken to another page -where you can download them. - -### Running the linkage - -> [!IMPORTANT] -> This section only applies to the workload operator. - -Once the data-owning parties have uploaded their processed data, you are able -to begin the linkage. To do so, run the `05-run-workload.sh` bash script from -`scripts/`: - -```bash -cd /path/to/pprl_toolkit/scripts -sh ./05-run-workload.sh -``` - -You can follow the progress of the workload from the Logs Explorer on GCP. Once -it is complete, the data-owning parties will be able to download their results. diff --git a/index.qmd b/index.qmd index 94c31b6..6be93bf 100644 --- a/index.qmd +++ b/index.qmd @@ -1,6 +1,7 @@ --- title: Welcome to the `pprl` documentation! toc: false +sidebar: false about: template: marquee links: diff --git a/src/pprl/app/templates/choose-data.html b/src/pprl/app/templates/choose-data.html index 969dd1a..a35d032 100644 --- a/src/pprl/app/templates/choose-data.html +++ b/src/pprl/app/templates/choose-data.html @@ -19,7 +19,7 @@

Choose a dataset

- +


diff --git a/src/pprl/app/templates/home.html b/src/pprl/app/templates/home.html index 1e1111e..025deff 100644 --- a/src/pprl/app/templates/home.html +++ b/src/pprl/app/templates/home.html @@ -4,9 +4,12 @@

Welcome to the PPRL application

- This application is for data owners to process and upload their data to the - Google Cloud Platform Confidential Space set up by your linkage - administrator. + This application is for data-owning parties to process and upload their data + to a Google Cloud Platform (GCP) bucket. Once both parties have uploaded + their data, the operator can run the workload to link your datasets in a + secure environment. + + Keep this app open and you will be able to download your results at the end.