import os
import time
from functools import partial
@@ -359,7 +359,7 @@ Linking the FEBRL datasets
Load the data
The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.
After loading the data, we can parse the true matched ID number from the indices.
-
+
= load_febrl4()
feb4a, feb4b
"true_id"] = (
@@ -382,7 +382,7 @@ feb4a[Create a feature
Pass a dictionary of dictionaries of keyword arguments as an optional ff_args
parameter (e.g. ff_args = {"dob": {"dayfirst": False, "yearfirst": True}})
)
Use functools.partial()
, as we have below.
-
+
= dict(
feature_factory =feat.gen_name_features,
name=partial(feat.gen_dateofbirth_features, dayfirst=False, yearfirst=True),
@@ -396,7 +396,7 @@ dobCreate a feature
Initialise the embedder instance
This instance embeds each feature twice into a Bloom filter of length 1024.
-
+
= Embedder(feature_factory, bf_size=1024, num_hashes=2) embedder
@@ -418,7 +418,7 @@ Embed the datasets
For example, to ensure suburb doesn’t collide with state (if they happened to be the same), gen_misc_features()
would encode each of their tokens as suburb<token>
and state<token>
, respectively. If you want to map different columns into the same feature, such as address
below, you can set the label explicitly when passing the function to the embedder.
-
+
= dict(
colspec ="name",
given_name="name",
@@ -436,7 +436,7 @@ surnameEmbed the datasets
= embedder.embed(feb4b, colspec=colspec) edf2
Store the embedded datasets and their embedder to file.
-
+
"party1_data.json")
edf1.to_json("party2_data.json")
edf2.to_json("embedder.pkl") embedder.to_pickle(
@@ -445,7 +445,7 @@ Embed the datasets
Calculate similarity
Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.
-
+
= time.time()
start
edf1.update_thresholds()
@@ -453,22 +453,22 @@ edf2.update_thresholds()Calculate similarity<
print(f"Updating thresholds took {end - start:.2f} seconds")
-Updating thresholds took 8.35 seconds
+Updating thresholds took 8.40 seconds
Compute the matrix of similarity scores.
-
+
= embedder.compare(edf1,edf2) similarity_scores
Compute a match
Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.
-
+
= similarity_scores.match(require_thresholds=True) matching
Using the true IDs, evaluate the precision and recall of the match.
-
+
def get_results(edf1, edf2, matching):
"""Get the results for a given matching."""
@@ -488,11 +488,11 @@ Compute a match
= get_results(edf1, edf2, matching) _
-True pos: 4973 | False pos: 0 | Precision: 100.0% | Recall: 99.5%
+True pos: 4969 | False pos: 0 | Precision: 100.0% | Recall: 99.4%
Then, we compute the match without using the row thresholds, calculating the same performance metrics:
-
+
= similarity_scores.match(require_thresholds=False)
matching = get_results(edf1, edf2, matching) _
diff --git a/docs/tutorials/example-verknupfung.html b/docs/tutorials/example-verknupfung.html
index 681bff2..4e571c0 100644
--- a/docs/tutorials/example-verknupfung.html
+++ b/docs/tutorials/example-verknupfung.html
@@ -341,7 +341,7 @@ Exploring a simple linkage example
Loading the data
First, we load our data into pandas.DataFrame
objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.
-
+
import pandas as pd
= pd.DataFrame(
@@ -381,7 +381,7 @@ df1 Loading the data
Creating and assigning a feature factory
The next step is to decide how to process each of the columns in our datasets.
To do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.
-
+
from pprl.embedder import features
from functools import partial
@@ -419,7 +419,7 @@ C
Embedding the data
With our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.
Once we’ve decided, we can create our Embedder
instance and use it to embed our data with their column specifications.
-
+
from pprl.embedder.embedder import Embedder
= Embedder(factory, bf_size=1024, num_hashes=2)
@@ -428,7 +428,7 @@ embedder Embedding the data
= embedder.embed(df2, colspec=spec2, update_thresholds=True) edf2
If we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features
column for each of the original columns containing their pre-embedding string features, and there’s an all_features
column that combines the features. Then there are three additional columns: bf_indices
, bf_norms
and thresholds
.
-
+
edf1.columns
Index(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',
@@ -439,15 +439,15 @@ Embedding the data
The bf_indices
column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.
-
+
print(edf1.bf_indices[0])
-[2, 646, 903, 262, 9, 654, 15, 272, 17, 146, 526, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 440, 56, 823, 60, 61, 318, 319, 320, 577, 444, 836, 583, 332, 972, 590, 77, 593, 338, 465, 468, 84, 82, 851, 600, 211, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]
+[2, 262, 646, 903, 9, 526, 15, 272, 654, 146, 531, 532, 17, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 823, 440, 56, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 972, 590, 465, 593, 211, 468, 82, 851, 338, 600, 84, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]
The bf_norms
column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i]))
for record i
. The norm is used to scale the similarity measures so that they take values between -1 and 1.
The thresholds
column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.
-
+
print(edf1.loc[:,["bf_norms","thresholds"]])
print(edf2.loc[:,["bf_norms","thresholds"]])
@@ -467,7 +467,7 @@ Embedding the data
The processed features
Let’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.
First, we’ll look at date of birth:
-
+
print(edf1.date_of_birth_features[0])
print(edf2.birth_date_features[0])
@@ -477,7 +477,7 @@ The processed featu
Python can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.
Then we’ll look at name:
-
+
print(edf1.first_name_features[0] + edf1.last_name_features[0])
print(edf2.name_features[0])
@@ -487,7 +487,7 @@ The processed featu
The two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.
The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:
-
+
print(edf1.gender_features[0])
print(edf2.sex_features[0])
@@ -496,7 +496,7 @@ The processed featu
Finally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label="instrument")
) processed the data:
-
+
print(edf1.instrument_features[0])
print(edf2.main_instrument_features[0])
@@ -509,7 +509,7 @@ The processed featu
Performing the linkage
We can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.
-
+
= embedder.compare(edf1, edf2)
similarities similarities
@@ -519,7 +519,7 @@ Performing the link
This SimilarityArray
object is an augmented numpy.ndarray
that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.
-
+
= similarities.match()
matching matching
diff --git a/docs/tutorials/index.html b/docs/tutorials/index.html
index 113b2fa..d14e966 100644
--- a/docs/tutorials/index.html
+++ b/docs/tutorials/index.html
@@ -384,7 +384,7 @@ Tutorials
-
+
Embedder API run-through
@@ -395,7 +395,7 @@ Tutorials
5 min
-
+
Exploring a simple linkage example
@@ -406,7 +406,7 @@ Tutorials
6 min
-
+
Linking the FEBRL datasets
@@ -417,7 +417,7 @@ Tutorials
4 min
-
+
Working in the cloud
diff --git a/docs/tutorials/run-through.html b/docs/tutorials/run-through.html
index ed46085..756526b 100644
--- a/docs/tutorials/run-through.html
+++ b/docs/tutorials/run-through.html
@@ -346,9 +346,9 @@ Embedder API run-through
the config
module, which includes our package configuration (such as the location of data directories)
some classes from the main embedder
module
-
+
import os
-
+import numpy as np
import pandas as pd
from pprl import EmbeddedDataFrame, Embedder, config
@@ -357,42 +357,45 @@ Embedder API run-through
Data set-up
For this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.
-
+
= pd.DataFrame(
df1 dict(
id=[1,2,3],
=["Henry", "Sally", "Ina"],
forename= ["Tull", "Brown", "Lawrey"],
- surname =["1/1/2001", "2/1/2001", "4/10/1995"],
+ dob=["", "2/1/2001", "4/10/1995"],
dob=["male", "Male", "Female"],
- gender
- )
- )
-= pd.DataFrame(
- df2 dict(
- =[4,5,6],
- personid=["Harry Tull", "Sali Brown", "Ina Laurie"],
- full_name=["2/1/2001", "2/1/2001", "4/11/1995"],
- date_of_birth=["M", "M", "F"],
- sex
- ) )
+=["", np.NaN, "County Durham"]
+ county
+ )
+ )
+= pd.DataFrame(
+ df2 dict(
+ =[4,5,6],
+ personid=["Harry Tull", "Sali Brown", "Ina Laurie"],
+ full_name=["2/1/2001", "2/1/2001", "4/11/1995"],
+ date_of_birth=["M", "M", "F"],
+ sex=["Rutland", "Powys", "Durham"]
+ county
+ ) )
Features are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.
In this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):
-
+
= dict(
feature_factory =feat.gen_name_features,
name=feat.gen_dateofbirth_features,
dob=feat.gen_sex_features,
- sex
- )
-= dict(name={}, sex={}, dob={}) ff_args
+=feat.gen_misc_features
+ misc
+ )
+= dict(name={}, sex={}, dob={}) ff_args
Embedding
Now we can create an Embedder
object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.
-
+
= Embedder(feature_factory,
embedder
ff_args,= 2**10,
@@ -400,21 +403,21 @@ bf_size Embedding
)
Now we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec
that maps each column of the data into the feature_factory
functions. Any columns not mapped will not contribute to the embedding.
-
+
= embedder.embed(
- edf1 =dict(forename="name", surname="name", dob="dob", gender="sex")
+ df1, colspec=dict(forename="name", surname="name", dob="dob", gender="sex", county="misc")
df1, colspec
)= embedder.embed(
- edf2 =dict(full_name="name", date_of_birth="dob", sex="sex")
+ df2, colspec=dict(full_name="name", date_of_birth="dob", sex="sex", county="misc")
df2, colspec
)
print(edf1)
print(edf2)
- id forename surname dob gender \
-0 1 Henry Tull 1/1/2001 male
-1 2 Sally Brown 2/1/2001 Male
-2 3 Ina Lawrey 4/10/1995 Female
+ id forename surname dob gender county \
+0 1 Henry Tull male
+1 2 Sally Brown 2/1/2001 Male NaN
+2 3 Ina Lawrey 4/10/1995 Female County Durham
forename_features \
0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]
@@ -426,44 +429,44 @@ Embedding
1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_]
2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr...
- dob_features gender_features \
-0 [day<01>, month<01>, year<2001>] [sex<m>]
-1 [day<02>, month<01>, year<2001>] [sex<m>]
-2 [day<04>, month<10>, year<1995>] [sex<f>]
+ dob_features gender_features county_features \
+0 [] [sex<m>]
+1 [day<02>, month<01>, year<2001>] [sex<m>]
+2 [day<04>, month<10>, year<1995>] [sex<f>] [county<county durham>]
all_features \
-0 [ll_, _tu, day<01>, ul, l_, sex<m>, ull, y_, _...
-1 [lly, day<02>, wn_, sex<m>, sal, wn, y_, ly_, ...
-2 [_in, _i, ey_, wr, y_, rey, wre, sex<f>, _l, _...
+0 [ll, nr, ll_, _t, ull, _tu, _he, he, tu, hen, ...
+1 [all, ll, ro, n_, ow, sa, ly_, bro, month<01>,...
+2 [ina, ey, _in, re, wr, aw, law, la, na_, ey_, ...
bf_indices bf_norms
-0 [130, 644, 773, 903, 135, 776, 778, 265, 654, ... 6.708204
+0 [644, 773, 135, 776, 265, 778, 271, 402, 404, ... 6.244998
1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428
-2 [647, 394, 269, 13, 15, 532, 155, 28, 667, 413... 6.855655
- personid full_name date_of_birth sex \
-0 4 Harry Tull 2/1/2001 M
-1 5 Sali Brown 2/1/2001 M
-2 6 Ina Laurie 4/11/1995 F
+2 [647, 394, 269, 13, 15, 532, 667, 155, 413, 28... 7.000000
+ personid full_name date_of_birth sex county \
+0 4 Harry Tull 2/1/2001 M Rutland
+1 5 Sali Brown 2/1/2001 M Powys
+2 6 Ina Laurie 4/11/1995 F Durham
full_name_features \
0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...
1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...
2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...
- date_of_birth_features sex_features \
-0 [day<02>, month<01>, year<2001>] [sex<m>]
-1 [day<02>, month<01>, year<2001>] [sex<m>]
-2 [day<04>, month<11>, year<1995>] [sex<f>]
+ date_of_birth_features sex_features county_features \
+0 [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>]
+1 [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>]
+2 [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>]
all_features \
-0 [ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y...
-1 [day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al...
-2 [ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l...
+0 [ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count...
+1 [county<powys>, ro, li_, n_, ow, sa, bro, ali,...
+2 [ina, ie, aur, e_, _in, uri, la, na_, county<d...
bf_indices bf_norms
-0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.708204
-1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 6.855655
-2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.782330
+0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.855655
+1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000
+2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203
@@ -475,7 +478,7 @@ Training
Computing the similarity scores and the matching
Now we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.
First, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.
-
+
@@ -489,9 +492,11 @@ full_name
date_of_birth
sex
+county
full_name_features
date_of_birth_features
sex_features
+county_features
all_features
bf_indices
bf_norms
@@ -505,13 +510,15 @@
1
@@ -519,13 +526,15 @@
2
@@ -533,13 +542,15 @@
+
= embedder.compare(edf1,edf2)
similarities
print(similarities)
-[[0.6666667 0.17395416 0. ]
- [0.29223802 0.79658223 0.08258402]
- [0.08697708 0.10638298 0.58067873]]
+[[0.60728442 0.09150181 0. ]
+ [0.2859526 0.78015612 0.08084521]
+ [0.08335143 0.10204083 0.57735028]]
Finally, you can compute the matching:
-
+
= similarities.match(abs_cutoff=0.5)
matching
print(matching)
@@ -574,24 +585,24 @@ Serialisation and file I/O
That’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.
First, the Embedder
object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder
object.
-
+
"embedder.pkl")
embedder.to_pickle(
= Embedder.from_pickle("embedder.pkl") embedder_copy
The copy has the same functionality as the original:
-
+
= embedder_copy.compare(edf1,edf2)
similarities
print(similarities)
-[[0.6666667 0.17395416 0. ]
- [0.29223802 0.79658223 0.08258402]
- [0.08697708 0.10638298 0.58067873]]
+[[0.60728442 0.09150181 0. ]
+ [0.2859526 0.78015612 0.08084521]
+ [0.08335143 0.10204083 0.57735028]]
NB: This won’t work if two datasets were embedded with different Embedder
instances, even if they’re identical. The compare()
method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame
with the new Embedder
object.
-
+
= EmbeddedDataFrame(edf2, embedder_copy) edf2_copy
In this case, be careful that the Embedder
is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.
@@ -599,7 +610,7 @@ Serialisation an
Serialising the data
The EDF objects are just a thin wrapper around pandas.DataFrame
instances, so you can serialise to JSON using the normal methods.
-
+
"edf1.json")
edf1.to_json(
= pd.read_json("edf1.json")
@@ -613,7 +624,7 @@ edf1_copy Serialising the data<
The bf_indices
, bf_norms
and thresholds
columns will be preserved. However, this demotes the data frames back to normal pandas.DataFrame
instances and loses the link to an Embedder
instance.
To fix this, just re-initialise them:
-
+
= EmbeddedDataFrame(edf1_copy, embedder_copy) edf1_copy
diff --git a/search.json b/search.json
index 8c077a3..ab28c72 100644
--- a/search.json
+++ b/search.json
@@ -223,7 +223,7 @@
"href": "docs/tutorials/example-verknupfung.html",
"title": "Exploring a simple linkage example",
"section": "",
- "text": "The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.\nLet us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.\n\nLoading the data\nFirst, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.\n\nimport pandas as pd\n\ndf1 = pd.DataFrame(\n {\n \"first_name\": [\"Laura\", \"Kaspar\", \"Grete\"],\n \"last_name\": [\"Daten\", \"Gorman\", \"Knopf\"],\n \"gender\": [\"F\", \"M\", \"F\"],\n \"date_of_birth\": [\"01/03/1977\", \"31/12/1975\", \"12/7/1981\"],\n \"instrument\": [\"bass\", \"guitar\", \"drums\"],\n }\n)\ndf2 = pd.DataFrame(\n {\n \"name\": [\"Laura Datten\", \"Greta Knopf\", \"Casper Goreman\"],\n \"sex\": [\"female\", \"female\", \"male\"],\n \"main_instrument\": [\"bass guitar\", \"percussion\", \"electric guitar\"],\n \"birth_date\": [\"1977-03-23\", \"1981-07-12\", \"1975-12-31\"],\n }\n)\n\n\n\n\n\n\n\nNote\n\n\n\nThese datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.\nThankfully, the PPRL Toolkit is flexible enough to handle this!\n\n\n\n\nCreating and assigning a feature factory\nThe next step is to decide how to process each of the columns in our datasets.\nTo do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.\n\nfrom pprl.embedder import features\nfrom functools import partial\n\nfactory = dict(\n name=features.gen_name_features,\n sex=features.gen_sex_features,\n misc=features.gen_misc_features,\n dob=features.gen_dateofbirth_features,\n instrument=partial(features.gen_misc_shingled_features, label=\"instrument\")\n)\nspec1 = dict(\n first_name=\"name\",\n last_name=\"name\",\n gender=\"sex\",\n instrument=\"instrument\",\n date_of_birth=\"dob\",\n)\nspec2 = dict(name=\"name\", sex=\"sex\", main_instrument=\"instrument\", birth_date=\"dob\")\n\n\n\n\n\n\n\nTip\n\n\n\nThe feature generation functions, features.gen_XXX_features have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder as ff_args.\n\n\n\n\nEmbedding the data\nWith our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.\nOnce we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.\n\nfrom pprl.embedder.embedder import Embedder\n\nembedder = Embedder(factory, bf_size=1024, num_hashes=2)\n\nedf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)\nedf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)\n\nIf we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.\n\nedf1.columns\n\nIndex(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',\n 'first_name_features', 'last_name_features', 'gender_features',\n 'instrument_features', 'date_of_birth_features', 'all_features',\n 'bf_indices', 'bf_norms', 'thresholds'],\n dtype='object')\n\n\nThe bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.\n\nprint(edf1.bf_indices[0])\n\n[2, 646, 903, 262, 9, 654, 15, 272, 17, 146, 526, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 440, 56, 823, 60, 61, 318, 319, 320, 577, 444, 836, 583, 332, 972, 590, 77, 593, 338, 465, 468, 84, 82, 851, 600, 211, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]\n\n\nThe bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.\nThe thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.\n\nprint(edf1.loc[:,[\"bf_norms\",\"thresholds\"]])\nprint(edf2.loc[:,[\"bf_norms\",\"thresholds\"]])\n\n bf_norms thresholds\n0 8.246211 0.114332\n1 9.055386 0.143159\n2 8.485281 0.143159\n bf_norms thresholds\n0 9.695360 0.294345\n1 9.380832 0.157014\n2 10.862781 0.294345\n\n\n\n\n\nThe processed features\nLet’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.\nFirst, we’ll look at date of birth:\n\nprint(edf1.date_of_birth_features[0])\nprint(edf2.birth_date_features[0])\n\n['day<01>', 'month<03>', 'year<1977>']\n['day<23>', 'month<03>', 'year<1977>']\n\n\nPython can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.\nThen we’ll look at name:\n\nprint(edf1.first_name_features[0] + edf1.last_name_features[0])\nprint(edf2.name_features[0])\n\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']\n\n\nThe two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.\nThe sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:\n\nprint(edf1.gender_features[0])\nprint(edf2.sex_features[0])\n\n['sex<f>']\n['sex<f>']\n\n\nFinally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label=\"instrument\")) processed the data:\n\nprint(edf1.instrument_features[0])\nprint(edf2.main_instrument_features[0])\n\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']\n\n\nSetting the label argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.\n\n\nPerforming the linkage\nWe can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.\n\nsimilarities = embedder.compare(edf1, edf2)\nsimilarities\n\nSimilarityArray([[0.80050047, 0.10341754, 0.10047246],\n [0.34170424, 0.16480856, 0.63029481],\n [0.12155416, 0.54020787, 0.11933984]])\n\n\nThis SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.\n\nmatching = similarities.match()\nmatching\n\n(array([0, 1, 2]), array([0, 2, 1]))\n\n\nSo, all three of the records in each dataset were matched correctly. Excellent!",
+ "text": "The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.\nLet us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.\n\nLoading the data\nFirst, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.\n\nimport pandas as pd\n\ndf1 = pd.DataFrame(\n {\n \"first_name\": [\"Laura\", \"Kaspar\", \"Grete\"],\n \"last_name\": [\"Daten\", \"Gorman\", \"Knopf\"],\n \"gender\": [\"F\", \"M\", \"F\"],\n \"date_of_birth\": [\"01/03/1977\", \"31/12/1975\", \"12/7/1981\"],\n \"instrument\": [\"bass\", \"guitar\", \"drums\"],\n }\n)\ndf2 = pd.DataFrame(\n {\n \"name\": [\"Laura Datten\", \"Greta Knopf\", \"Casper Goreman\"],\n \"sex\": [\"female\", \"female\", \"male\"],\n \"main_instrument\": [\"bass guitar\", \"percussion\", \"electric guitar\"],\n \"birth_date\": [\"1977-03-23\", \"1981-07-12\", \"1975-12-31\"],\n }\n)\n\n\n\n\n\n\n\nNote\n\n\n\nThese datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.\nThankfully, the PPRL Toolkit is flexible enough to handle this!\n\n\n\n\nCreating and assigning a feature factory\nThe next step is to decide how to process each of the columns in our datasets.\nTo do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.\n\nfrom pprl.embedder import features\nfrom functools import partial\n\nfactory = dict(\n name=features.gen_name_features,\n sex=features.gen_sex_features,\n misc=features.gen_misc_features,\n dob=features.gen_dateofbirth_features,\n instrument=partial(features.gen_misc_shingled_features, label=\"instrument\")\n)\nspec1 = dict(\n first_name=\"name\",\n last_name=\"name\",\n gender=\"sex\",\n instrument=\"instrument\",\n date_of_birth=\"dob\",\n)\nspec2 = dict(name=\"name\", sex=\"sex\", main_instrument=\"instrument\", birth_date=\"dob\")\n\n\n\n\n\n\n\nTip\n\n\n\nThe feature generation functions, features.gen_XXX_features have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder as ff_args.\n\n\n\n\nEmbedding the data\nWith our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.\nOnce we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.\n\nfrom pprl.embedder.embedder import Embedder\n\nembedder = Embedder(factory, bf_size=1024, num_hashes=2)\n\nedf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)\nedf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)\n\nIf we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.\n\nedf1.columns\n\nIndex(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',\n 'first_name_features', 'last_name_features', 'gender_features',\n 'instrument_features', 'date_of_birth_features', 'all_features',\n 'bf_indices', 'bf_norms', 'thresholds'],\n dtype='object')\n\n\nThe bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.\n\nprint(edf1.bf_indices[0])\n\n[2, 262, 646, 903, 9, 526, 15, 272, 654, 146, 531, 532, 17, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 823, 440, 56, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 972, 590, 465, 593, 211, 468, 82, 851, 338, 600, 84, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]\n\n\nThe bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.\nThe thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.\n\nprint(edf1.loc[:,[\"bf_norms\",\"thresholds\"]])\nprint(edf2.loc[:,[\"bf_norms\",\"thresholds\"]])\n\n bf_norms thresholds\n0 8.246211 0.114332\n1 9.055386 0.143159\n2 8.485281 0.143159\n bf_norms thresholds\n0 9.695360 0.294345\n1 9.380832 0.157014\n2 10.862781 0.294345\n\n\n\n\n\nThe processed features\nLet’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.\nFirst, we’ll look at date of birth:\n\nprint(edf1.date_of_birth_features[0])\nprint(edf2.birth_date_features[0])\n\n['day<01>', 'month<03>', 'year<1977>']\n['day<23>', 'month<03>', 'year<1977>']\n\n\nPython can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.\nThen we’ll look at name:\n\nprint(edf1.first_name_features[0] + edf1.last_name_features[0])\nprint(edf2.name_features[0])\n\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']\n\n\nThe two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.\nThe sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:\n\nprint(edf1.gender_features[0])\nprint(edf2.sex_features[0])\n\n['sex<f>']\n['sex<f>']\n\n\nFinally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label=\"instrument\")) processed the data:\n\nprint(edf1.instrument_features[0])\nprint(edf2.main_instrument_features[0])\n\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']\n\n\nSetting the label argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.\n\n\nPerforming the linkage\nWe can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.\n\nsimilarities = embedder.compare(edf1, edf2)\nsimilarities\n\nSimilarityArray([[0.80050047, 0.10341754, 0.10047246],\n [0.34170424, 0.16480856, 0.63029481],\n [0.12155416, 0.54020787, 0.11933984]])\n\n\nThis SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.\n\nmatching = similarities.match()\nmatching\n\n(array([0, 1, 2]), array([0, 2, 1]))\n\n\nSo, all three of the records in each dataset were matched correctly. Excellent!",
"crumbs": [
"About",
"Docs",
@@ -340,7 +340,7 @@
"href": "docs/tutorials/run-through.html",
"title": "Embedder API run-through",
"section": "",
- "text": "This article shows the main classes, methods and functionality of the Embedder API.\nFirst, we’ll import a few modules, including:\nimport os\n\nimport pandas as pd\n\nfrom pprl import EmbeddedDataFrame, Embedder, config\nfrom pprl.embedder import features as feat",
+ "text": "This article shows the main classes, methods and functionality of the Embedder API.\nFirst, we’ll import a few modules, including:\nimport os\nimport numpy as np\nimport pandas as pd\n\nfrom pprl import EmbeddedDataFrame, Embedder, config\nfrom pprl.embedder import features as feat",
"crumbs": [
"About",
"Docs",
@@ -353,7 +353,7 @@
"href": "docs/tutorials/run-through.html#data-set-up",
"title": "Embedder API run-through",
"section": "Data set-up",
- "text": "Data set-up\nFor this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.\n\ndf1 = pd.DataFrame(\n dict(\n id=[1,2,3],\n forename=[\"Henry\", \"Sally\", \"Ina\"],\n surname = [\"Tull\", \"Brown\", \"Lawrey\"],\n dob=[\"1/1/2001\", \"2/1/2001\", \"4/10/1995\"],\n gender=[\"male\", \"Male\", \"Female\"],\n )\n)\n\ndf2 = pd.DataFrame(\n dict(\n personid=[4,5,6],\n full_name=[\"Harry Tull\", \"Sali Brown\", \"Ina Laurie\"],\n date_of_birth=[\"2/1/2001\", \"2/1/2001\", \"4/11/1995\"],\n sex=[\"M\", \"M\", \"F\"],\n )\n)\n\nFeatures are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.\nIn this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):\n\nfeature_factory = dict(\n name=feat.gen_name_features,\n dob=feat.gen_dateofbirth_features,\n sex=feat.gen_sex_features,\n)\n\nff_args = dict(name={}, sex={}, dob={})",
+ "text": "Data set-up\nFor this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.\n\ndf1 = pd.DataFrame(\n dict(\n id=[1,2,3],\n forename=[\"Henry\", \"Sally\", \"Ina\"],\n surname = [\"Tull\", \"Brown\", \"Lawrey\"],\n dob=[\"\", \"2/1/2001\", \"4/10/1995\"],\n gender=[\"male\", \"Male\", \"Female\"],\n county=[\"\", np.NaN, \"County Durham\"]\n )\n)\n\ndf2 = pd.DataFrame(\n dict(\n personid=[4,5,6],\n full_name=[\"Harry Tull\", \"Sali Brown\", \"Ina Laurie\"],\n date_of_birth=[\"2/1/2001\", \"2/1/2001\", \"4/11/1995\"],\n sex=[\"M\", \"M\", \"F\"],\n county=[\"Rutland\", \"Powys\", \"Durham\"]\n )\n)\n\nFeatures are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.\nIn this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):\n\nfeature_factory = dict(\n name=feat.gen_name_features,\n dob=feat.gen_dateofbirth_features,\n sex=feat.gen_sex_features,\n misc=feat.gen_misc_features\n)\n\nff_args = dict(name={}, sex={}, dob={})",
"crumbs": [
"About",
"Docs",
@@ -366,7 +366,7 @@
"href": "docs/tutorials/run-through.html#embedding",
"title": "Embedder API run-through",
"section": "Embedding",
- "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender \\\n0 1 Henry Tull 1/1/2001 male \n1 2 Sally Brown 2/1/2001 Male \n2 3 Ina Lawrey 4/10/1995 Female \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features \\\n0 [day<01>, month<01>, year<2001>] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<10>, year<1995>] [sex<f>] \n\n all_features \\\n0 [ll_, _tu, day<01>, ul, l_, sex<m>, ull, y_, _... \n1 [lly, day<02>, wn_, sex<m>, sal, wn, y_, ly_, ... \n2 [_in, _i, ey_, wr, y_, rey, wre, sex<f>, _l, _... \n\n bf_indices bf_norms \n0 [130, 644, 773, 903, 135, 776, 778, 265, 654, ... 6.708204 \n1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428 \n2 [647, 394, 269, 13, 15, 532, 155, 28, 667, 413... 6.855655 \n personid full_name date_of_birth sex \\\n0 4 Harry Tull 2/1/2001 M \n1 5 Sali Brown 2/1/2001 M \n2 6 Ina Laurie 4/11/1995 F \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features \\\n0 [day<02>, month<01>, year<2001>] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<11>, year<1995>] [sex<f>] \n\n all_features \\\n0 [ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y... \n1 [day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al... \n2 [ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l... \n\n bf_indices bf_norms \n0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.708204 \n1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 6.855655 \n2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.782330",
+ "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\", county=\"misc\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\", county=\"misc\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender county \\\n0 1 Henry Tull male \n1 2 Sally Brown 2/1/2001 Male NaN \n2 3 Ina Lawrey 4/10/1995 Female County Durham \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features county_features \\\n0 [] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<10>, year<1995>] [sex<f>] [county<county durham>] \n\n all_features \\\n0 [ll, nr, ll_, _t, ull, _tu, _he, he, tu, hen, ... \n1 [all, ll, ro, n_, ow, sa, ly_, bro, month<01>,... \n2 [ina, ey, _in, re, wr, aw, law, la, na_, ey_, ... \n\n bf_indices bf_norms \n0 [644, 773, 135, 776, 265, 778, 271, 402, 404, ... 6.244998 \n1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428 \n2 [647, 394, 269, 13, 15, 532, 667, 155, 413, 28... 7.000000 \n personid full_name date_of_birth sex county \\\n0 4 Harry Tull 2/1/2001 M Rutland \n1 5 Sali Brown 2/1/2001 M Powys \n2 6 Ina Laurie 4/11/1995 F Durham \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features county_features \\\n0 [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>] \n2 [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>] \n\n all_features \\\n0 [ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count... \n1 [county<powys>, ro, li_, n_, ow, sa, bro, ali,... \n2 [ina, ie, aur, e_, _in, uri, la, na_, county<d... \n\n bf_indices bf_norms \n0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.855655 \n1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000 \n2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203",
"crumbs": [
"About",
"Docs",
@@ -392,7 +392,7 @@
"href": "docs/tutorials/run-through.html#computing-the-similarity-scores-and-the-matching",
"title": "Embedder API run-through",
"section": "Computing the similarity scores and the matching",
- "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y...\n[640, 130, 644, 135, 776, 10, 778, 271, 402, 5...\n6.708204\n0.195698\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al...\n[130, 523, 525, 398, 271, 152, 671, 803, 806, ...\n6.855655\n0.195698\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day<04>, month<11>, year<1995>]\n[sex<f>]\n[ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l...\n[646, 647, 394, 269, 15, 272, 531, 532, 665, 6...\n6.782330\n0.086026\n\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.6666667 0.17395416 0. ]\n [0.29223802 0.79658223 0.08258402]\n [0.08697708 0.10638298 0.58067873]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))",
+ "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\ncounty\nfull_name_features\ndate_of_birth_features\nsex_features\ncounty_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\nRutland\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[county<rutland>]\n[ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count...\n[640, 130, 644, 135, 776, 10, 778, 271, 402, 5...\n6.855655\n0.187541\n\n\n1\n5\nSali Brown\n2/1/2001\nM\nPowys\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[county<powys>]\n[county<powys>, ro, li_, n_, ow, sa, bro, ali,...\n[130, 523, 525, 398, 271, 152, 671, 803, 806, ...\n7.000000\n0.187541\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\nDurham\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day<04>, month<11>, year<1995>]\n[sex<f>]\n[county<durham>]\n[ina, ie, aur, e_, _in, uri, la, na_, county<d...\n[646, 647, 394, 269, 15, 272, 531, 532, 665, 6...\n6.928203\n0.082479\n\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.60728442 0.09150181 0. ]\n [0.2859526 0.78015612 0.08084521]\n [0.08335143 0.10204083 0.57735028]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))",
"crumbs": [
"About",
"Docs",
@@ -405,7 +405,7 @@
"href": "docs/tutorials/run-through.html#serialisation-and-file-io",
"title": "Embedder API run-through",
"section": "Serialisation and file I/O",
- "text": "Serialisation and file I/O\nThat’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.\nFirst, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.\n\nembedder.to_pickle(\"embedder.pkl\")\n\nembedder_copy = Embedder.from_pickle(\"embedder.pkl\")\n\nThe copy has the same functionality as the original:\n\nsimilarities = embedder_copy.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.6666667 0.17395416 0. ]\n [0.29223802 0.79658223 0.08258402]\n [0.08697708 0.10638298 0.58067873]]\n\n\nNB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.\n\nedf2_copy = EmbeddedDataFrame(edf2, embedder_copy)\n\nIn this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.",
+ "text": "Serialisation and file I/O\nThat’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.\nFirst, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.\n\nembedder.to_pickle(\"embedder.pkl\")\n\nembedder_copy = Embedder.from_pickle(\"embedder.pkl\")\n\nThe copy has the same functionality as the original:\n\nsimilarities = embedder_copy.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.60728442 0.09150181 0. ]\n [0.2859526 0.78015612 0.08084521]\n [0.08335143 0.10204083 0.57735028]]\n\n\nNB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.\n\nedf2_copy = EmbeddedDataFrame(edf2, embedder_copy)\n\nIn this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.",
"crumbs": [
"About",
"Docs",
@@ -496,7 +496,7 @@
"href": "docs/tutorials/example-febrl.html#calculate-similarity",
"title": "Linking the FEBRL datasets",
"section": "Calculate similarity",
- "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 8.35 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)",
+ "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 8.40 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)",
"crumbs": [
"About",
"Docs",
@@ -509,7 +509,7 @@
"href": "docs/tutorials/example-febrl.html#compute-a-match",
"title": "Linking the FEBRL datasets",
"section": "Compute a match",
- "text": "Compute a match\nUse the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.\n\nmatching = similarity_scores.match(require_thresholds=True)\n\nUsing the true IDs, evaluate the precision and recall of the match.\n\ndef get_results(edf1, edf2, matching):\n \"\"\"Get the results for a given matching.\"\"\"\n\n trueids_matched1 = edf1.iloc[matching[0], edf1.columns.get_loc(\"true_id\")]\n trueids_matched2 = edf2.iloc[matching[1], edf2.columns.get_loc(\"true_id\")]\n\n nmatches = len(matching[0])\n truepos = sum(map(np.equal, trueids_matched1, trueids_matched2))\n falsepos = nmatches - truepos\n\n print(\n f\"True pos: {truepos} | False pos: {falsepos} | \"\n f\"Precision: {truepos / nmatches:.1%} | Recall: {truepos / 5000:.1%}\"\n )\n\n return nmatches, truepos, falsepos\n\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 4973 | False pos: 0 | Precision: 100.0% | Recall: 99.5%\n\n\nThen, we compute the match without using the row thresholds, calculating the same performance metrics:\n\nmatching = similarity_scores.match(require_thresholds=False)\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 5000 | False pos: 0 | Precision: 100.0% | Recall: 100.0%\n\n\nWithout using the row thresholds, the number of false positives is larger, but the recall is much better. For some uses this balance may be preferable.\nIn testing, the use of local row thresholds provides a better trade-off between precision and recall, compared to using a single absolute threshold. It has the additional advantage, in a privacy-preserving setting, of being automatic and not requiring clerical review to set the level.",
+ "text": "Compute a match\nUse the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.\n\nmatching = similarity_scores.match(require_thresholds=True)\n\nUsing the true IDs, evaluate the precision and recall of the match.\n\ndef get_results(edf1, edf2, matching):\n \"\"\"Get the results for a given matching.\"\"\"\n\n trueids_matched1 = edf1.iloc[matching[0], edf1.columns.get_loc(\"true_id\")]\n trueids_matched2 = edf2.iloc[matching[1], edf2.columns.get_loc(\"true_id\")]\n\n nmatches = len(matching[0])\n truepos = sum(map(np.equal, trueids_matched1, trueids_matched2))\n falsepos = nmatches - truepos\n\n print(\n f\"True pos: {truepos} | False pos: {falsepos} | \"\n f\"Precision: {truepos / nmatches:.1%} | Recall: {truepos / 5000:.1%}\"\n )\n\n return nmatches, truepos, falsepos\n\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 4969 | False pos: 0 | Precision: 100.0% | Recall: 99.4%\n\n\nThen, we compute the match without using the row thresholds, calculating the same performance metrics:\n\nmatching = similarity_scores.match(require_thresholds=False)\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 5000 | False pos: 0 | Precision: 100.0% | Recall: 100.0%\n\n\nWithout using the row thresholds, the number of false positives is larger, but the recall is much better. For some uses this balance may be preferable.\nIn testing, the use of local row thresholds provides a better trade-off between precision and recall, compared to using a single absolute threshold. It has the additional advantage, in a privacy-preserving setting, of being automatic and not requiring clerical review to set the level.",
"crumbs": [
"About",
"Docs",
@@ -586,7 +586,7 @@
"href": "docs/reference/features.html",
"title": "features",
"section": "",
- "text": "embedder.features\nFeature generation functions for various column types.\n\n\n\n\n\nName\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=['day<01>', 'month<01>', 'year<2050>'])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n['day<01>', 'month<01>', 'year<2050>']\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.",
+ "text": "embedder.features\nFeature generation functions for various column types.\n\n\n\n\n\nName\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n[]\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.",
"crumbs": [
"About",
"Docs",
@@ -599,7 +599,7 @@
"href": "docs/reference/features.html#functions",
"title": "features",
"section": "",
- "text": "Name\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=['day<01>', 'month<01>', 'year<2050>'])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n['day<01>', 'month<01>', 'year<2050>']\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.",
+ "text": "Name\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n[]\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.",
"crumbs": [
"About",
"Docs",
diff --git a/sitemap.xml b/sitemap.xml
index 812ac85..b322d28 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,66 +2,66 @@
https://datasciencecampus.github.io/pprl_toolkit/index.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/index.html
- 2024-05-02T10:18:31.033Z
+ 2024-05-08T14:03:57.385Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/config.html
- 2024-05-02T10:18:31.153Z
+ 2024-05-08T14:03:57.505Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/cloud.html
- 2024-05-02T10:18:31.185Z
+ 2024-05-08T14:03:57.537Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/embedder.html
- 2024-05-02T10:18:31.101Z
+ 2024-05-08T14:03:57.453Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/encryption.html
- 2024-05-02T10:18:31.149Z
+ 2024-05-08T14:03:57.501Z
https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/example-verknupfung.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/in-the-cloud.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/run-through.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/example-febrl.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/index.html
- 2024-05-02T10:17:42.950Z
+ 2024-05-08T14:03:10.597Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/local.html
- 2024-05-02T10:18:31.189Z
+ 2024-05-08T14:03:57.541Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/bloom_filters.html
- 2024-05-02T10:18:31.053Z
+ 2024-05-08T14:03:57.405Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/features.html
- 2024-05-02T10:18:31.137Z
+ 2024-05-08T14:03:57.485Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/perform.html
- 2024-05-02T10:18:31.201Z
+ 2024-05-08T14:03:57.553Z
https://datasciencecampus.github.io/pprl_toolkit/docs/reference/utils.html
- 2024-05-02T10:18:31.169Z
+ 2024-05-08T14:03:57.521Z