31 blurb readme (#33)

* Testing adding image to readme * Added blurb and assets * Fixed logos inclusion by removing quotes around url * add hyperlink to issues * Added Sir Ian quote * Removed tutorial content to new file * Removed tutorial content to new file * Lengthened german band example * Changed the name of 'working_in_the_cloud.md' * Add para break * Fixed warning block * Fix the example values for doctest * Fixed doctest failure * Update docs/tutorials/linkage_example_verknupfung.qmd Co-authored-by: Henry Wilde <[email protected]> * Apply suggestions from code review Co-authored-by: Henry Wilde <[email protected]> --------- Co-authored-by: Henry Wilde <[email protected]>
datasciencecampus · Apr 2, 2024 · f3c553c · f3c553c
1 parent 4367d88
commit f3c553c
Show file tree

Hide file tree

Showing 5 changed files with 379 additions and 185 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,20 @@
+![ONS and DSC logos](https://github.com/datasciencecampus/awesome-campus/blob/master/ons_dsc_logo.png)
+
 # `pprl_toolkit`: a toolkit for privacy-preserving record linkage
 
+> "We find ourselves living in a society which is rich with data and the opportunities that comes with this. Yet, when disconnected, this data is limited in its usefulness. ... Being able to link data will be vital for enhancing our understanding of society, driving policy change for greater public good." Sir Ian Diamond, the National Statistician
+
+The Privacy Preserving Record Linkage (PPRL) toolkit demonstrates the feasibility of record linkage in difficult 'eyes off' settings. It has been designed for a situation where two organisations (perhaps in different jurisdictions) want to link their datasets at record level, to enrich the information they contain, but neither party is able to send sensitive personal identifiers -- such as names, addresses or dates of birth -- to the other. Building on [previous ONS research](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods/privacy-preserving-record-linkage-in-the-context-of-a-national-statistics-institute), the toolkit implements a well-known privacy-preserving linkage method in a new way to improve performance, and wraps it in a secure cloud architecture to demonstrate the potential of a layered approach.
+
+The  toolkit has been developed by data scientists at the [Data Science Campus](https://datasciencecampus.ons.gov.uk/) of the UK Office for National Statistics. This project has benefitted from early collaborations with colleagues at NHS England.
+
+The two parts of the toolkit are:
+
+* a Python package for privacy-preserving record linkage with Bloom filters and hash embeddings, that can be used locally with no cloud set-up
+* instructions, scripts and resources to run record linkage in a cloud-based secure enclave. This part of the toolkit requires you to set up Google Cloud accounts with billing
+
+We're publishing the repo as a prototype and teaching tool. Please feel free to download, adapt and experiment with it in compliance with the open-source license. You can submit issues [here](https://github.com/datasciencecampus/pprl_toolkit/issues). However, as this is an experimental repo, the development team cannot commit to maintaining the repo or responding to issues. If you'd like to collaborate with us, to put these ideas into practice for the public good, please [get in touch](https://datasciencecampus.ons.gov.uk/contact/).
+
 ## Installation
 
 To install the package from source, you must clone the repository before
@@ -32,9 +47,11 @@ pre-commit install
 
 ## Getting started
 
+The Python package implements the Bloom filter linkage method ([Schnell et al., 2009](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-9-41)), and can also implement pretrained Hash embeddings ([Miranda et al., 2022](https://arxiv.org/abs/2212.09255)), if a suitable large, pre-matched corpus of data is available.
+
 Let us consider a small example where we want to link two excerpts of data on
 bands. In this scenario, we are looking at some toy data on the members of a
-fictional, German rock trio called "Verknüpfung".
+fictional, German rock trio called "Verknüpfung". In this example we will see how to use untrained Bloom filters to match data.
 
 ### Loading the data
 
@@ -51,15 +68,13 @@ matching. We will use the toolkit to identify these matches.
 ...         "last_name": ["Daten", "Gorman", "Knopf"],
 ...         "gender": ["f", "m", "f"],
 ...         "instrument": ["bass", "guitar", "drums"],
-...         "vocals_ever": [True, True, True],
 ...     }
 ... )
 >>> df2 = pd.DataFrame(
 ...     {
 ...         "name": ["Laura Datten", "Greta Knopf", "Casper Goreman"],
 ...         "sex": ["female", "female", "male"],
 ...         "main_instrument": ["bass guitar", "percussion", "electric guitar"],
-...         "vocals": ["yes", "sometimes", "sometimes"],
 ...     }
 ... )
 
@@ -74,89 +89,60 @@ matching. We will use the toolkit to identify these matches.
 ### Creating and assigning a feature factory
 
 The next step is to decide how to process each of the columns in our datasets.
-
-To do this, we define a feature factory that maps column types to feature
-generation functions, and a column specification for each dataset mapping our
-columns to column types in the factory.
+The `pprl.embedder.features` module provides functions that process different data types so that they can be embedded into the Bloom filter. We pass these functions into the embedder in a dictionary called a feature factory. We also provide a column specification for each dataset mapping our columns to column types in the factory.
 
 ```python
 >>> from pprl.embedder import features
+>>> from functools import partial
 >>>
 >>> factory = dict(
 ...     name=features.gen_name_features,
 ...     sex=features.gen_sex_features,
-...     misc=features.gen_misc_features,
+...     instrument=partial(features.gen_misc_shingled_features, label="instrument"),
 ... )
 >>> spec1 = dict(
 ...     first_name="name",
 ...     last_name="name",
 ...     gender="sex",
-...     instrument="misc",
-...     vocals_ever="misc",
+...     instrument="instrument",
 ... )
->>> spec2 = dict(name="name", sex="sex", main_instrument="misc", vocals="misc")
+>>> spec2 = dict(name="name", sex="sex", main_instrument="instrument")
 
 ```
 
 ### Embedding the data
 
 With our specifications sorted out, we can get to creating our Bloom filter
-embedding. Before doing so, we need to decide on two parameters: the size of
-the filter and the number of hashes. By default, these are `2**10` and `2`,
-respectively.
-
-Once we've decided, we can create our `Embedder` instance and use it to embed
-our data with their column specifications.
+embedding. We can create our `Embedder` instance and use it to embed
+our data with their column specifications. The `Embedder` object has two more parameters: the size of the filter and the number of hashes. We can use the defaults.
 
 ```python
 >>> from pprl.embedder.embedder import Embedder
 >>>
->>> embedder = Embedder(factory, bf_size=2**10, num_hashes=2)
+>>> embedder = Embedder(factory, bf_size=1024, num_hashes=2)
 >>> edf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)
 >>> edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)
 
 ```
 
-If we take a look at one of these embedded datasets, we can see that it has a
-whole bunch of new columns. There is a `_features` column for each of the
-original columns containing their pre-embedding string features. Then there are
-three additional columns: `bf_indices`, `bf_norms` and `thresholds`.
-
-```python
->>> edf1.columns
-Index(['first_name', 'last_name', 'gender', 'instrument', 'vocals_ever',
-       'first_name_features', 'last_name_features', 'gender_features',
-       'instrument_features', 'vocals_ever_features', 'all_features',
-       'bf_indices', 'bf_norms', 'thresholds'],
-      dtype='object')
-
-```
-
-<!-- TODO: What do these columns actually describe? -->
-
 ### Performing the linkage
 
-We can now perform the linkage by comparing these Bloom filter embeddings. We
-use the Soft Cosine Measure to calculate record-wise similarity and an adapted
-Hungarian algorithm to match the records based on those similarities.
+We can now perform the linkage by comparing these Bloom filter embeddings. The package
+uses the Soft Cosine Measure to calculate record-wise similarity scores.
 
 ```python
 >>> similarities = embedder.compare(edf1, edf2)
 >>> similarities
-SimilarityArray([[0.86017213, 0.14285716, 0.12803688],
-                 [0.13216962, 0.13483999, 0.50067019],
-                 [0.12126782, 0.76292716, 0.09240265]])
+SimilarityArray([[0.80074101, 0.18160957, 0.09722178],
+                 [0.40124732, 0.1877348 , 0.58792979],
+                 [0.13147656, 0.51426533, 0.11772856]])
 
 ```
 
-This `SimilarityArray` object is an augmented `numpy.ndarray` that can perform
-our matching. The matching itself has a number of parameters that allow you to
-control how similar two embedded records must be to be matched. In this case,
-let's say that two records can only be matched if their pairwise similarity is
-at least `0.5`.
+Lastly, we compute the matching using an adapted Hungarian algorithm with local match thresholds:
 
 ```python
->>> matching = similarities.match(abs_cutoff=0.5)
+>>> matching = similarities.match()
 >>> matching
 (array([0, 1, 2]), array([0, 2, 1]))
 
@@ -167,150 +153,30 @@ So, all three of the records in each dataset were matched correctly. Excellent!
 
 ## Working in the cloud
 
-The toolkit is configured to work on Google Cloud Platform (GCP) provided you
-have a team of users with Google Cloud accounts and the appropriate
-permissions. In particular, `pprl_toolkit`'s cloud functionality is built on
-top of a GCP Confidential Space. This setting means that nobody ever has direct
-access to each other's data, and the datasets to be linked are only ever
-brought together in a secure environment.
-
-Have a read through [this tutorial](https://cloud.google.com/confidential-computing/confidential-space/docs/create-your-first-confidential-space-environment)
-if you would like to get to grips with how it all works on the inside.
-
-### Determining roles
-
-There are four roles to fill in a data linkage project: two data-owning
-parties, a workload author, and a workload operator. A workload is how we refer
-to the linkage operation itself. These roles can be summarised as follows:
-
-- A data-owning **party** is responsible for embedding and uploading their data
-  to the cloud. They also download their results.
-- The workload **author** creates and uploads a Docker image to a GCP Artifact
-  Registry.
-- The workload **operator** runs the uploaded Docker image on a Confidential
-  Space virtual machine.
-
-> [!NOTE]
-> We have set up `pprl_toolkit` to allow any configuration of these roles among
-> users. You could do it all yourself, split the workload roles between two
-> data owning-parties, or use a third-party administrator to maintain the
-> workload.
-
-### Creating your projects
-
-Once you have decided who will be filling which role(s), every member of your
-linkage project will need to set up a GCP project. The names of these projects
-will be used in file names and GCP storage buckets. As such, they need to be
-descriptive and [unique](https://cloud.google.com/storage/docs/buckets#naming).
-
-> [!TIP]
-> It may be worth appending a hash of some sort to every project name to help
-> ensure their uniqueness.
-
-Each user will also need to have their Google Cloud administrator grant them
-certain IAM roles on their project depending on which role(s) they are playing
-in the linkage:
-
-- **Data-owning party**:
-  - Cloud KMS Admin (`roles/cloudkms.admin`)
-  - IAM Workload Identity Pool Admin (`roles/iam.workloadIdentityPoolAdmin`)
-  - Service Usage Admin (`roles/serviceusage.serviceUsageAdmin`)
-  - Service Account Admin (`roles/iam.serviceAccountAdmin`)
-  - Storage Admin (`roles/storage.admin`)
-- **Workload author**:
-  - Artifact Registry Administrator (`roles/artifactregistry.admin`)
-- **Workload operator**:
-  - Compute Admin (`roles/compute.admin`)
-  - Security Admin (`roles/securityAdmin`)
-  - Storage Admin (`roles/storage.admin`)
-
-### Toolkit configuration
-
-Now you've got your roles sorted out and projects set up, you (and all other
-users) have to write down your project's configuration in an environment file
-for `pprl_toolkit`. Make sure that everyone has installed `pprl_toolkit` first.
-
-We have provided an example in `.env.example`. All you need to do is copy that
-file to `.env` and fill in your project's details. Everyone in your project
-should have identical environment files.
-
-### Creating the other resources
-
-The last step in setting your linkage project up is to create and configure all
-the other resources on GCP. We have packaged up these steps into a series of
-`bash` scripts, located in the `scripts/` directory. They should be executed in
-order from the `scripts/` directory:
-
-1. The data-owning parties set up a key encryption key, a bucket in which to
-   store their encrypted data, data encryption key and results, a service
-   account for accessing said bucket and key, and a workload identity pool to
-   allow impersonations under stringent conditions:
-   ```bash
-   sh ./01-setup-party-resources.sh <name-of-party-project>
-   ```
-2. The workload operator sets up a bucket for the parties to put their
-   (non-sensitive) attestation credentials, and a service account for running
-   the workload:
-   ```bash
-   sh ./02-setup-workload-operator.sh
-   ```
-3. The workload author sets up an Artifact Registry on GCP, creates a Docker
-   image and uploads that image to their registry:
-   ```bash
-   sh ./03-setup-workload-author.sh
-   ```
-4. The data-owning parties authorise the workload operator's service account to
-   use the workload identity pool to impersonate their service account in a
-   Confidential Space:
-   ```bash
-   sh ./04-authorise-workload.sh <name-of-party-project>
-   ```
-
-### Processing and uploading the datasets
-
-> [!IMPORTANT]
-> This section only applies to data-owning parties. The workload author is
-> finished now, and the workload operator should wait for this section to be
-> completed before moving on to the next section.
-
-Now that all the cloud infrastructure has been set up, we are ready to start
-the first step in doing the actual linkage. Much like the toy example above,
-that is to make a Bloom filter embedding of each dataset.
-
-For users who prefer a graphical user interface, we have included a Flask app
-to handle the processing and uploading of data behind the scenes. This app will
-also be used to download the results once the linkage has completed.
-
-To launch the app, run the following in your terminal:
-
-```bash
-python -m flask --app src/pprl/app run
-```
 
-You should now be able to find the app in your browser of choice at
-[127.0.0.1:5000](http://127.0.0.1:5000).
+![A diagram of the PPRL cloud architecture, with the secure enclave and key management services](https://github.com/datasciencecampus/pprl_toolkit/blob/main/assets/pprl_cloud_diagram.png?raw=true)
 
-Once you have worked through the selection, processing, and GCP upload portions
-of the app, you will be at a holding page. This page can be updated by clicking
-the button, and when your results are ready you will be taken to another page
-where you can download them.
+The cloud demo uses a Google Cloud Platform (GCP) Confidential Space compute instance, which is a virtual machine (VM) using AMD [Secure Encrypted Virtualisation](https://www.amd.com/en/developer/sev.html) (AMD-SEV) technology to encrypt data in-memory.
 
-### Running the linkage
+The Confidential Space VM can also provide cryptographically signed documents, called attestations, which the server can use to prove that it is running in a secure environment before gaining access to data.
 
-> [!IMPORTANT]
-> This section only applies to the workload operator.
+The cloud demo assigns four roles: two data-owning
+parties, a workload author, and a workload operator. These roles can be summarised as follows:
 
-Once the data-owning parties have uploaded their processed data, you are able
-to begin the linkage. To do so, run the `05-run-workload.sh` bash script from
-`scripts/`:
+- Each data-owning **party** is responsible for embedding and uploading their data
+  to the cloud. They also download their results.
+- The workload **author** audits and assures the source code of the server, and then builds and uploads the server as a Docker image.
+- The workload **operator** sets up and runs the Confidential
+  Space virtual machine, which uses the Docker image to perform the record linkage.
 
-```bash
-cd /path/to/pprl_toolkit/scripts
-sh ./05-run-workload.sh
-```
+We have set up `pprl_toolkit` to allow any configuration of these roles among
+users. You could do it all yourself, split the workload roles between two
+data owning-parties, or ask a trusted third party to maintain the
+workload.
 
-You can follow the progress of the workload from the Logs Explorer on GCP. Once
-it is complete, the data-owning parties will be able to download their results.
+> [!WARNING]
+> The cloud demo requires you to set up one or more Google Cloud accounts with billing. The cost of running the demo should be very small, or within your free quota.
+> However, you should ensure that all resources are torn down after running the demo to avoid ongoing charges.
 
 ## Building the documentation
 
@@ -332,7 +198,7 @@ the API reference material:
 python -m quartodoc build
 ```
 
-This will create a bunch of files under `docs/reference/`. You can render the
+This will create a set of Quarto files under `docs/reference/`. You can render the
 documentation itself with the following command, opening a local version of the
 site in your browser:
 

diff --git a/docs/_static/02-client-screenshot.png b/docs/_static/02-client-screenshot.png
diff --git a/docs/assets/pprl_cloud_diagram.png b/docs/assets/pprl_cloud_diagram.png