diff --git a/.nojekyll b/.nojekyll index 3f6c135..cb3b51d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -6b37c638 \ No newline at end of file +7064a115 \ No newline at end of file diff --git a/docs/reference/features.html b/docs/reference/features.html index c108714..ecd8ab1 100644 --- a/docs/reference/features.html +++ b/docs/reference/features.html @@ -378,17 +378,17 @@

Functions

gen_dateofbirth_features

-

embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=['day<01>', 'month<01>', 'year<2050>'])

+

embedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])

Generate labelled date features from a series of dates of birth.

Features take the form ["day<dd>", "month<mm>", "year<YYYY>"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.

Parameters

+++--- @@ -421,7 +421,7 @@

Parameters

- +
default list[str] Default date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.['day<01>', 'month<01>', 'year<2050>'][]
diff --git a/docs/tutorials/example-febrl.html b/docs/tutorials/example-febrl.html index 0f83fce..c30a2d5 100644 --- a/docs/tutorials/example-febrl.html +++ b/docs/tutorials/example-febrl.html @@ -343,7 +343,7 @@

Linking the FEBRL datasets

This tutorial shows how the package can be used locally to match the FEBRL datasets, included as example datasets in the recordlinkage package.

-
+
import os
 import time
 from functools import partial
@@ -359,7 +359,7 @@ 

Linking the FEBRL datasets

Load the data

The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.

After loading the data, we can parse the true matched ID number from the indices.

-
+
feb4a, feb4b = load_febrl4()
 
 feb4a["true_id"] = (
@@ -382,7 +382,7 @@ 

Create a feature
  • Pass a dictionary of dictionaries of keyword arguments as an optional ff_args parameter (e.g. ff_args = {"dob": {"dayfirst": False, "yearfirst": True}}))
  • Use functools.partial(), as we have below.
  • -
    +
    feature_factory = dict(
         name=feat.gen_name_features,
         dob=partial(feat.gen_dateofbirth_features, dayfirst=False, yearfirst=True),
    @@ -396,7 +396,7 @@ 

    Create a feature

    Initialise the embedder instance

    This instance embeds each feature twice into a Bloom filter of length 1024.

    -
    +
    embedder = Embedder(feature_factory, bf_size=1024, num_hashes=2)
    @@ -418,7 +418,7 @@

    Embed the datasets

    For example, to ensure suburb doesn’t collide with state (if they happened to be the same), gen_misc_features() would encode each of their tokens as suburb<token> and state<token>, respectively. If you want to map different columns into the same feature, such as address below, you can set the label explicitly when passing the function to the embedder.

    -
    +
    colspec = dict(
         given_name="name",
         surname="name",
    @@ -436,7 +436,7 @@ 

    Embed the datasets

    edf2 = embedder.embed(feb4b, colspec=colspec)

    Store the embedded datasets and their embedder to file.

    -
    +
    edf1.to_json("party1_data.json")
     edf2.to_json("party2_data.json")
     embedder.to_pickle("embedder.pkl")
    @@ -445,7 +445,7 @@

    Embed the datasets

    Calculate similarity

    Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.

    -
    +
    start = time.time()
     edf1.update_thresholds()
     edf2.update_thresholds()
    @@ -453,22 +453,22 @@ 

    Calculate similarity< print(f"Updating thresholds took {end - start:.2f} seconds")

    -
    Updating thresholds took 8.35 seconds
    +
    Updating thresholds took 8.40 seconds

    Compute the matrix of similarity scores.

    -
    +
    similarity_scores = embedder.compare(edf1,edf2)

    Compute a match

    Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.

    -
    +
    matching = similarity_scores.match(require_thresholds=True)

    Using the true IDs, evaluate the precision and recall of the match.

    -
    +
    def get_results(edf1, edf2, matching):
         """Get the results for a given matching."""
     
    @@ -488,11 +488,11 @@ 

    Compute a match

    _ = get_results(edf1, edf2, matching)
    -
    True pos: 4973 | False pos: 0 | Precision: 100.0% | Recall: 99.5%
    +
    True pos: 4969 | False pos: 0 | Precision: 100.0% | Recall: 99.4%

    Then, we compute the match without using the row thresholds, calculating the same performance metrics:

    -
    +
    matching = similarity_scores.match(require_thresholds=False)
     _ = get_results(edf1, edf2, matching)
    diff --git a/docs/tutorials/example-verknupfung.html b/docs/tutorials/example-verknupfung.html index 681bff2..4e571c0 100644 --- a/docs/tutorials/example-verknupfung.html +++ b/docs/tutorials/example-verknupfung.html @@ -341,7 +341,7 @@

    Exploring a simple linkage example

    Loading the data

    First, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.

    -
    +
    import pandas as pd
     
     df1 = pd.DataFrame(
    @@ -381,7 +381,7 @@ 

    Loading the data

    Creating and assigning a feature factory

    The next step is to decide how to process each of the columns in our datasets.

    To do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.

    -
    +
    from pprl.embedder import features
     from functools import partial
     
    @@ -419,7 +419,7 @@ 

    C

    Embedding the data

    With our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.

    Once we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.

    -
    +
    from pprl.embedder.embedder import Embedder
     
     embedder = Embedder(factory, bf_size=1024, num_hashes=2)
    @@ -428,7 +428,7 @@ 

    Embedding the data

    edf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)

    If we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.

    -
    +
    edf1.columns
    Index(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',
    @@ -439,15 +439,15 @@ 

    Embedding the data

    The bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.

    -
    +
    print(edf1.bf_indices[0])
    -
    [2, 646, 903, 262, 9, 654, 15, 272, 17, 146, 526, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 440, 56, 823, 60, 61, 318, 319, 320, 577, 444, 836, 583, 332, 972, 590, 77, 593, 338, 465, 468, 84, 82, 851, 600, 211, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]
    +
    [2, 262, 646, 903, 9, 526, 15, 272, 654, 146, 531, 532, 17, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 823, 440, 56, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 972, 590, 465, 593, 211, 468, 82, 851, 338, 600, 84, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]

    The bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.

    The thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.

    -
    +
    print(edf1.loc[:,["bf_norms","thresholds"]])
     print(edf2.loc[:,["bf_norms","thresholds"]])
    @@ -467,7 +467,7 @@

    Embedding the data

    The processed features

    Let’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.

    First, we’ll look at date of birth:

    -
    +
    print(edf1.date_of_birth_features[0])
     print(edf2.birth_date_features[0])
    @@ -477,7 +477,7 @@

    The processed featu

    Python can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.

    Then we’ll look at name:

    -
    +
    print(edf1.first_name_features[0] + edf1.last_name_features[0])
     print(edf2.name_features[0])
    @@ -487,7 +487,7 @@

    The processed featu

    The two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.

    The sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:

    -
    +
    print(edf1.gender_features[0])
     print(edf2.sex_features[0])
    @@ -496,7 +496,7 @@

    The processed featu

    Finally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label="instrument")) processed the data:

    -
    +
    print(edf1.instrument_features[0])
     print(edf2.main_instrument_features[0])
    @@ -509,7 +509,7 @@

    The processed featu

    Performing the linkage

    We can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.

    -
    +
    similarities = embedder.compare(edf1, edf2)
     similarities
    @@ -519,7 +519,7 @@

    Performing the link

    This SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.

    -
    +
    matching = similarities.match()
     matching
    diff --git a/docs/tutorials/index.html b/docs/tutorials/index.html index 113b2fa..d14e966 100644 --- a/docs/tutorials/index.html +++ b/docs/tutorials/index.html @@ -384,7 +384,7 @@

    Tutorials

    - + Embedder API run-through @@ -395,7 +395,7 @@

    Tutorials

    5 min - + Exploring a simple linkage example @@ -406,7 +406,7 @@

    Tutorials

    6 min - + Linking the FEBRL datasets @@ -417,7 +417,7 @@

    Tutorials

    4 min - + Working in the cloud diff --git a/docs/tutorials/run-through.html b/docs/tutorials/run-through.html index ed46085..756526b 100644 --- a/docs/tutorials/run-through.html +++ b/docs/tutorials/run-through.html @@ -346,9 +346,9 @@

    Embedder API run-through

  • the config module, which includes our package configuration (such as the location of data directories)
  • some classes from the main embedder module
  • -
    +
    import os
    -
    +import numpy as np
     import pandas as pd
     
     from pprl import EmbeddedDataFrame, Embedder, config
    @@ -357,42 +357,45 @@ 

    Embedder API run-through

    Data set-up

    For this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.

    -
    +
    df1 = pd.DataFrame(
         dict(
             id=[1,2,3],
             forename=["Henry", "Sally", "Ina"],
             surname = ["Tull", "Brown", "Lawrey"],
    -        dob=["1/1/2001", "2/1/2001", "4/10/1995"],
    +        dob=["", "2/1/2001", "4/10/1995"],
             gender=["male", "Male", "Female"],
    -    )
    -)
    -
    -df2 = pd.DataFrame(
    -    dict(
    -        personid=[4,5,6],
    -        full_name=["Harry Tull", "Sali Brown", "Ina Laurie"],
    -        date_of_birth=["2/1/2001", "2/1/2001", "4/11/1995"],
    -        sex=["M", "M", "F"],
    -    )
    -)
    + county=["", np.NaN, "County Durham"] + ) +) + +df2 = pd.DataFrame( + dict( + personid=[4,5,6], + full_name=["Harry Tull", "Sali Brown", "Ina Laurie"], + date_of_birth=["2/1/2001", "2/1/2001", "4/11/1995"], + sex=["M", "M", "F"], + county=["Rutland", "Powys", "Durham"] + ) +)

    Features are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.

    In this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):

    -
    +
    feature_factory = dict(
         name=feat.gen_name_features,
         dob=feat.gen_dateofbirth_features,
         sex=feat.gen_sex_features,
    -)
    -
    -ff_args = dict(name={}, sex={}, dob={})
    + misc=feat.gen_misc_features +) + +ff_args = dict(name={}, sex={}, dob={})

    Embedding

    Now we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.

    -
    +
    embedder = Embedder(feature_factory,
                         ff_args,
                         bf_size = 2**10,
    @@ -400,21 +403,21 @@ 

    Embedding

    )

    Now we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.

    -
    +
    edf1 = embedder.embed(
    -    df1, colspec=dict(forename="name", surname="name", dob="dob", gender="sex")
    +    df1, colspec=dict(forename="name", surname="name", dob="dob", gender="sex", county="misc")
     )
     edf2 = embedder.embed(
    -    df2, colspec=dict(full_name="name", date_of_birth="dob", sex="sex")
    +    df2, colspec=dict(full_name="name", date_of_birth="dob", sex="sex", county="misc")
     )
     
     print(edf1)
     print(edf2)
    -
       id forename surname        dob  gender  \
    -0   1    Henry    Tull   1/1/2001    male   
    -1   2    Sally   Brown   2/1/2001    Male   
    -2   3      Ina  Lawrey  4/10/1995  Female   
    +
       id forename surname        dob  gender         county  \
    +0   1    Henry    Tull               male                  
    +1   2    Sally   Brown   2/1/2001    Male            NaN   
    +2   3      Ina  Lawrey  4/10/1995  Female  County Durham   
     
                                        forename_features  \
     0  [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]   
    @@ -426,44 +429,44 @@ 

    Embedding

    1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] 2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... - dob_features gender_features \ -0 [day<01>, month<01>, year<2001>] [sex<m>] -1 [day<02>, month<01>, year<2001>] [sex<m>] -2 [day<04>, month<10>, year<1995>] [sex<f>] + dob_features gender_features county_features \ +0 [] [sex<m>] +1 [day<02>, month<01>, year<2001>] [sex<m>] +2 [day<04>, month<10>, year<1995>] [sex<f>] [county<county durham>] all_features \ -0 [ll_, _tu, day<01>, ul, l_, sex<m>, ull, y_, _... -1 [lly, day<02>, wn_, sex<m>, sal, wn, y_, ly_, ... -2 [_in, _i, ey_, wr, y_, rey, wre, sex<f>, _l, _... +0 [ll, nr, ll_, _t, ull, _tu, _he, he, tu, hen, ... +1 [all, ll, ro, n_, ow, sa, ly_, bro, month<01>,... +2 [ina, ey, _in, re, wr, aw, law, la, na_, ey_, ... bf_indices bf_norms -0 [130, 644, 773, 903, 135, 776, 778, 265, 654, ... 6.708204 +0 [644, 773, 135, 776, 265, 778, 271, 402, 404, ... 6.244998 1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428 -2 [647, 394, 269, 13, 15, 532, 155, 28, 667, 413... 6.855655 - personid full_name date_of_birth sex \ -0 4 Harry Tull 2/1/2001 M -1 5 Sali Brown 2/1/2001 M -2 6 Ina Laurie 4/11/1995 F +2 [647, 394, 269, 13, 15, 532, 667, 155, 413, 28... 7.000000 + personid full_name date_of_birth sex county \ +0 4 Harry Tull 2/1/2001 M Rutland +1 5 Sali Brown 2/1/2001 M Powys +2 6 Ina Laurie 4/11/1995 F Durham full_name_features \ 0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... 1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... 2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... - date_of_birth_features sex_features \ -0 [day<02>, month<01>, year<2001>] [sex<m>] -1 [day<02>, month<01>, year<2001>] [sex<m>] -2 [day<04>, month<11>, year<1995>] [sex<f>] + date_of_birth_features sex_features county_features \ +0 [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>] +1 [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>] +2 [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>] all_features \ -0 [ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y... -1 [day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al... -2 [ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l... +0 [ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count... +1 [county<powys>, ro, li_, n_, ow, sa, bro, ali,... +2 [ina, ie, aur, e_, _in, uri, la, na_, county<d... bf_indices bf_norms -0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.708204 -1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 6.855655 -2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.782330
    +0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.855655 +1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000 +2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203
    @@ -475,7 +478,7 @@

    Training

    Computing the similarity scores and the matching

    Now we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.

    First, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.

    -
    +
    @@ -489,9 +492,11 @@

    full_name date_of_birth sex +county full_name_features date_of_birth_features sex_features +county_features all_features bf_indices bf_norms @@ -505,13 +510,15 @@

    1 @@ -519,13 +526,15 @@

    2 @@ -533,13 +542,15 @@

    +
    similarities = embedder.compare(edf1,edf2)
     
     print(similarities)
    -
    [[0.6666667  0.17395416 0.        ]
    - [0.29223802 0.79658223 0.08258402]
    - [0.08697708 0.10638298 0.58067873]]
    +
    [[0.60728442 0.09150181 0.        ]
    + [0.2859526  0.78015612 0.08084521]
    + [0.08335143 0.10204083 0.57735028]]

    Finally, you can compute the matching:

    -
    +
    matching = similarities.match(abs_cutoff=0.5)
     
     print(matching)
    @@ -574,24 +585,24 @@

    Serialisation and file I/O

    That’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.

    First, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.

    -
    +
    embedder.to_pickle("embedder.pkl")
     
     embedder_copy = Embedder.from_pickle("embedder.pkl")

    The copy has the same functionality as the original:

    -
    +
    similarities = embedder_copy.compare(edf1,edf2)
     
     print(similarities)
    -
    [[0.6666667  0.17395416 0.        ]
    - [0.29223802 0.79658223 0.08258402]
    - [0.08697708 0.10638298 0.58067873]]
    +
    [[0.60728442 0.09150181 0.        ]
    + [0.2859526  0.78015612 0.08084521]
    + [0.08335143 0.10204083 0.57735028]]

    NB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.

    -
    +
    edf2_copy = EmbeddedDataFrame(edf2, embedder_copy)

    In this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.

    @@ -599,7 +610,7 @@

    Serialisation an

    Serialising the data

    The EDF objects are just a thin wrapper around pandas.DataFrame instances, so you can serialise to JSON using the normal methods.

    -
    +
    edf1.to_json("edf1.json")
     
     edf1_copy = pd.read_json("edf1.json")
    @@ -613,7 +624,7 @@ 

    Serialising the data<

    The bf_indices, bf_norms and thresholds columns will be preserved. However, this demotes the data frames back to normal pandas.DataFrame instances and loses the link to an Embedder instance.

    To fix this, just re-initialise them:

    -
    +
    edf1_copy = EmbeddedDataFrame(edf1_copy, embedder_copy)
    diff --git a/search.json b/search.json index 8c077a3..ab28c72 100644 --- a/search.json +++ b/search.json @@ -223,7 +223,7 @@ "href": "docs/tutorials/example-verknupfung.html", "title": "Exploring a simple linkage example", "section": "", - "text": "The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.\nLet us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.\n\nLoading the data\nFirst, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.\n\nimport pandas as pd\n\ndf1 = pd.DataFrame(\n {\n \"first_name\": [\"Laura\", \"Kaspar\", \"Grete\"],\n \"last_name\": [\"Daten\", \"Gorman\", \"Knopf\"],\n \"gender\": [\"F\", \"M\", \"F\"],\n \"date_of_birth\": [\"01/03/1977\", \"31/12/1975\", \"12/7/1981\"],\n \"instrument\": [\"bass\", \"guitar\", \"drums\"],\n }\n)\ndf2 = pd.DataFrame(\n {\n \"name\": [\"Laura Datten\", \"Greta Knopf\", \"Casper Goreman\"],\n \"sex\": [\"female\", \"female\", \"male\"],\n \"main_instrument\": [\"bass guitar\", \"percussion\", \"electric guitar\"],\n \"birth_date\": [\"1977-03-23\", \"1981-07-12\", \"1975-12-31\"],\n }\n)\n\n\n\n\n\n\n\nNote\n\n\n\nThese datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.\nThankfully, the PPRL Toolkit is flexible enough to handle this!\n\n\n\n\nCreating and assigning a feature factory\nThe next step is to decide how to process each of the columns in our datasets.\nTo do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.\n\nfrom pprl.embedder import features\nfrom functools import partial\n\nfactory = dict(\n name=features.gen_name_features,\n sex=features.gen_sex_features,\n misc=features.gen_misc_features,\n dob=features.gen_dateofbirth_features,\n instrument=partial(features.gen_misc_shingled_features, label=\"instrument\")\n)\nspec1 = dict(\n first_name=\"name\",\n last_name=\"name\",\n gender=\"sex\",\n instrument=\"instrument\",\n date_of_birth=\"dob\",\n)\nspec2 = dict(name=\"name\", sex=\"sex\", main_instrument=\"instrument\", birth_date=\"dob\")\n\n\n\n\n\n\n\nTip\n\n\n\nThe feature generation functions, features.gen_XXX_features have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder as ff_args.\n\n\n\n\nEmbedding the data\nWith our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.\nOnce we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.\n\nfrom pprl.embedder.embedder import Embedder\n\nembedder = Embedder(factory, bf_size=1024, num_hashes=2)\n\nedf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)\nedf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)\n\nIf we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.\n\nedf1.columns\n\nIndex(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',\n 'first_name_features', 'last_name_features', 'gender_features',\n 'instrument_features', 'date_of_birth_features', 'all_features',\n 'bf_indices', 'bf_norms', 'thresholds'],\n dtype='object')\n\n\nThe bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.\n\nprint(edf1.bf_indices[0])\n\n[2, 646, 903, 262, 9, 654, 15, 272, 17, 146, 526, 532, 531, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 440, 56, 823, 60, 61, 318, 319, 320, 577, 444, 836, 583, 332, 972, 590, 77, 593, 338, 465, 468, 84, 82, 851, 600, 211, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]\n\n\nThe bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.\nThe thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.\n\nprint(edf1.loc[:,[\"bf_norms\",\"thresholds\"]])\nprint(edf2.loc[:,[\"bf_norms\",\"thresholds\"]])\n\n bf_norms thresholds\n0 8.246211 0.114332\n1 9.055386 0.143159\n2 8.485281 0.143159\n bf_norms thresholds\n0 9.695360 0.294345\n1 9.380832 0.157014\n2 10.862781 0.294345\n\n\n\n\n\nThe processed features\nLet’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.\nFirst, we’ll look at date of birth:\n\nprint(edf1.date_of_birth_features[0])\nprint(edf2.birth_date_features[0])\n\n['day<01>', 'month<03>', 'year<1977>']\n['day<23>', 'month<03>', 'year<1977>']\n\n\nPython can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.\nThen we’ll look at name:\n\nprint(edf1.first_name_features[0] + edf1.last_name_features[0])\nprint(edf2.name_features[0])\n\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']\n\n\nThe two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.\nThe sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:\n\nprint(edf1.gender_features[0])\nprint(edf2.sex_features[0])\n\n['sex<f>']\n['sex<f>']\n\n\nFinally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label=\"instrument\")) processed the data:\n\nprint(edf1.instrument_features[0])\nprint(edf2.main_instrument_features[0])\n\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']\n\n\nSetting the label argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.\n\n\nPerforming the linkage\nWe can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.\n\nsimilarities = embedder.compare(edf1, edf2)\nsimilarities\n\nSimilarityArray([[0.80050047, 0.10341754, 0.10047246],\n [0.34170424, 0.16480856, 0.63029481],\n [0.12155416, 0.54020787, 0.11933984]])\n\n\nThis SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.\n\nmatching = similarities.match()\nmatching\n\n(array([0, 1, 2]), array([0, 2, 1]))\n\n\nSo, all three of the records in each dataset were matched correctly. Excellent!", + "text": "The Python package implements the Bloom filter linkage method (Schnell et al., 2009), and can also implement pretrained Hash embeddings (Miranda et al., 2022), if a suitable large, pre-matched corpus of data is available.\nLet us consider a small example where we want to link two excerpts of data on bands. In this scenario, we are looking at some toy data on the members of a fictional, German rock trio called “Verknüpfung”. In this example we will see how to use untrained Bloom filters to match data.\n\nLoading the data\nFirst, we load our data into pandas.DataFrame objects. Here, the first records align, but the other two records should be swapped to have an aligned matching. We will use the toolkit to identify these matches.\n\nimport pandas as pd\n\ndf1 = pd.DataFrame(\n {\n \"first_name\": [\"Laura\", \"Kaspar\", \"Grete\"],\n \"last_name\": [\"Daten\", \"Gorman\", \"Knopf\"],\n \"gender\": [\"F\", \"M\", \"F\"],\n \"date_of_birth\": [\"01/03/1977\", \"31/12/1975\", \"12/7/1981\"],\n \"instrument\": [\"bass\", \"guitar\", \"drums\"],\n }\n)\ndf2 = pd.DataFrame(\n {\n \"name\": [\"Laura Datten\", \"Greta Knopf\", \"Casper Goreman\"],\n \"sex\": [\"female\", \"female\", \"male\"],\n \"main_instrument\": [\"bass guitar\", \"percussion\", \"electric guitar\"],\n \"birth_date\": [\"1977-03-23\", \"1981-07-12\", \"1975-12-31\"],\n }\n)\n\n\n\n\n\n\n\nNote\n\n\n\nThese datasets don’t have the same column names or follow the same encodings, and there are several spelling mistakes in the names of the band members, as well as a typo in the dates.\nThankfully, the PPRL Toolkit is flexible enough to handle this!\n\n\n\n\nCreating and assigning a feature factory\nThe next step is to decide how to process each of the columns in our datasets.\nTo do this, we define a feature factory that maps column types to feature generation functions, and a column specification for each dataset mapping our columns to column types in the factory.\n\nfrom pprl.embedder import features\nfrom functools import partial\n\nfactory = dict(\n name=features.gen_name_features,\n sex=features.gen_sex_features,\n misc=features.gen_misc_features,\n dob=features.gen_dateofbirth_features,\n instrument=partial(features.gen_misc_shingled_features, label=\"instrument\")\n)\nspec1 = dict(\n first_name=\"name\",\n last_name=\"name\",\n gender=\"sex\",\n instrument=\"instrument\",\n date_of_birth=\"dob\",\n)\nspec2 = dict(name=\"name\", sex=\"sex\", main_instrument=\"instrument\", birth_date=\"dob\")\n\n\n\n\n\n\n\nTip\n\n\n\nThe feature generation functions, features.gen_XXX_features have sensible default parameters, but sometimes have to be passed in to the feature factory with different parameters, such as to set a feature label in the example above. There are two ways to achieve this. Either use functools.partial to set parameters (as above), or pass keyword arguments as a dictionary of dictionaries to the Embedder as ff_args.\n\n\n\n\nEmbedding the data\nWith our specifications sorted out, we can get to creating our Bloom filter embedding. Before doing so, we need to decide on two parameters: the size of the filter and the number of hashes. By default, these are 1024 and 2, respectively.\nOnce we’ve decided, we can create our Embedder instance and use it to embed our data with their column specifications.\n\nfrom pprl.embedder.embedder import Embedder\n\nembedder = Embedder(factory, bf_size=1024, num_hashes=2)\n\nedf1 = embedder.embed(df1, colspec=spec1, update_thresholds=True)\nedf2 = embedder.embed(df2, colspec=spec2, update_thresholds=True)\n\nIf we take a look at one of these embedded datasets, we can see that it has a whole bunch of new columns. There is a _features column for each of the original columns containing their pre-embedding string features, and there’s an all_features column that combines the features. Then there are three additional columns: bf_indices, bf_norms and thresholds.\n\nedf1.columns\n\nIndex(['first_name', 'last_name', 'gender', 'date_of_birth', 'instrument',\n 'first_name_features', 'last_name_features', 'gender_features',\n 'instrument_features', 'date_of_birth_features', 'all_features',\n 'bf_indices', 'bf_norms', 'thresholds'],\n dtype='object')\n\n\nThe bf_indices column contains the Bloom filters, represented compactly as a list of non-zero indices for each record.\n\nprint(edf1.bf_indices[0])\n\n[2, 262, 646, 903, 9, 526, 15, 272, 654, 146, 531, 532, 17, 282, 667, 413, 670, 544, 288, 931, 292, 808, 937, 172, 942, 559, 816, 691, 820, 567, 823, 440, 56, 60, 61, 318, 319, 320, 444, 577, 836, 583, 332, 77, 972, 590, 465, 593, 211, 468, 82, 851, 338, 600, 84, 218, 861, 613, 871, 744, 238, 367, 881, 758, 890, 379, 1021, 763]\n\n\nThe bf_norms column contains the norm of each Bloom filter with respect to the Soft Cosine Measure (SCM) matrix. In this case since we are using an untrained model, the SCM matrix is an identity matrix, and the norm is just the Euclidean norm of the Bloom filter represented as a binary vector, which is equal to np.sqrt(len(bf_indices[i])) for record i. The norm is used to scale the similarity measures so that they take values between -1 and 1.\nThe thresholds column is calculated to provide, for each record, a threshold similarity score below which it will not be matched. It’s like a reserve price in an auction – it stops a record being matched to another record when the similarity isn’t high enough. This is an innovative feature of our method; other linkage methods typically only have one global threshold score for the entire dataset.\n\nprint(edf1.loc[:,[\"bf_norms\",\"thresholds\"]])\nprint(edf2.loc[:,[\"bf_norms\",\"thresholds\"]])\n\n bf_norms thresholds\n0 8.246211 0.114332\n1 9.055386 0.143159\n2 8.485281 0.143159\n bf_norms thresholds\n0 9.695360 0.294345\n1 9.380832 0.157014\n2 10.862781 0.294345\n\n\n\n\n\nThe processed features\nLet’s take a look at how the features are processed into small text strings (shingles) before being hashed into the Bloom filter. The first record in the first dataset is the same person as the first record in the second dataset, although the data is not identical, so we can compare the processed features for these records to see how pprl puts them into a format where they can be compared.\nFirst, we’ll look at date of birth:\n\nprint(edf1.date_of_birth_features[0])\nprint(edf2.birth_date_features[0])\n\n['day<01>', 'month<03>', 'year<1977>']\n['day<23>', 'month<03>', 'year<1977>']\n\n\nPython can parse the different formats easily. Although the dates are slightly different in the dataset, the year and month will still match, even though the day will not.\nThen we’ll look at name:\n\nprint(edf1.first_name_features[0] + edf1.last_name_features[0])\nprint(edf2.name_features[0])\n\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_la', 'lau', 'aur', 'ura', 'ra_', '_d', 'da', 'at', 'te', 'en', 'n_', '_da', 'dat', 'ate', 'ten', 'en_']\n['_l', 'la', 'au', 'ur', 'ra', 'a_', '_d', 'da', 'at', 'tt', 'te', 'en', 'n_', '_la', 'lau', 'aur', 'ura', 'ra_', '_da', 'dat', 'att', 'tte', 'ten', 'en_']\n\n\nThe two datasets store the names differently, but this doesn’t matter for the Bloom filter method because it treats each record like a bag of features. By default, the name processor produces 2-grams and 3-grams.\nThe sex processing function just converts different formats to lowercase and takes the first letter. This will often be enough:\n\nprint(edf1.gender_features[0])\nprint(edf2.sex_features[0])\n\n['sex<f>']\n['sex<f>']\n\n\nFinally, we’ll see how our instrument feature function (partial(features.gen_misc_shingled_features, label=\"instrument\")) processed the data:\n\nprint(edf1.instrument_features[0])\nprint(edf2.main_instrument_features[0])\n\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>']\n['instrument<_b>', 'instrument<ba>', 'instrument<as>', 'instrument<ss>', 'instrument<s_>', 'instrument<_g>', 'instrument<gu>', 'instrument<ui>', 'instrument<it>', 'instrument<ta>', 'instrument<ar>', 'instrument<r_>', 'instrument<_ba>', 'instrument<bas>', 'instrument<ass>', 'instrument<ss_>', 'instrument<_gu>', 'instrument<gui>', 'instrument<uit>', 'instrument<ita>', 'instrument<tar>', 'instrument<ar_>']\n\n\nSetting the label argument was important to ensure that the shingles match (and are hashed to the same slots) because the default behaviour of the function is to use the column name as a label: since the two columns have different names, the default wouldn’t have allowed the features to match to each other.\n\n\nPerforming the linkage\nWe can now perform the linkage by comparing these Bloom filter embeddings. We use the Soft Cosine Measure (which in this untrained model, is equivalent to a normal cosine similarity metric) to calculate record-wise similarity and an adapted Hungarian algorithm to match the records based on those similarities.\n\nsimilarities = embedder.compare(edf1, edf2)\nsimilarities\n\nSimilarityArray([[0.80050047, 0.10341754, 0.10047246],\n [0.34170424, 0.16480856, 0.63029481],\n [0.12155416, 0.54020787, 0.11933984]])\n\n\nThis SimilarityArray object is an augmented numpy.ndarray that can perform our matching. The matching itself can optionally be called with an absolute threshold score, but it doesn’t need one.\n\nmatching = similarities.match()\nmatching\n\n(array([0, 1, 2]), array([0, 2, 1]))\n\n\nSo, all three of the records in each dataset were matched correctly. Excellent!", "crumbs": [ "About", "Docs", @@ -340,7 +340,7 @@ "href": "docs/tutorials/run-through.html", "title": "Embedder API run-through", "section": "", - "text": "This article shows the main classes, methods and functionality of the Embedder API.\nFirst, we’ll import a few modules, including:\nimport os\n\nimport pandas as pd\n\nfrom pprl import EmbeddedDataFrame, Embedder, config\nfrom pprl.embedder import features as feat", + "text": "This article shows the main classes, methods and functionality of the Embedder API.\nFirst, we’ll import a few modules, including:\nimport os\nimport numpy as np\nimport pandas as pd\n\nfrom pprl import EmbeddedDataFrame, Embedder, config\nfrom pprl.embedder import features as feat", "crumbs": [ "About", "Docs", @@ -353,7 +353,7 @@ "href": "docs/tutorials/run-through.html#data-set-up", "title": "Embedder API run-through", "section": "Data set-up", - "text": "Data set-up\nFor this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.\n\ndf1 = pd.DataFrame(\n dict(\n id=[1,2,3],\n forename=[\"Henry\", \"Sally\", \"Ina\"],\n surname = [\"Tull\", \"Brown\", \"Lawrey\"],\n dob=[\"1/1/2001\", \"2/1/2001\", \"4/10/1995\"],\n gender=[\"male\", \"Male\", \"Female\"],\n )\n)\n\ndf2 = pd.DataFrame(\n dict(\n personid=[4,5,6],\n full_name=[\"Harry Tull\", \"Sali Brown\", \"Ina Laurie\"],\n date_of_birth=[\"2/1/2001\", \"2/1/2001\", \"4/11/1995\"],\n sex=[\"M\", \"M\", \"F\"],\n )\n)\n\nFeatures are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.\nIn this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):\n\nfeature_factory = dict(\n name=feat.gen_name_features,\n dob=feat.gen_dateofbirth_features,\n sex=feat.gen_sex_features,\n)\n\nff_args = dict(name={}, sex={}, dob={})", + "text": "Data set-up\nFor this demo we’ll create a really minimal pair of datasets. Notice that they don’t have to have the same structure or field names.\n\ndf1 = pd.DataFrame(\n dict(\n id=[1,2,3],\n forename=[\"Henry\", \"Sally\", \"Ina\"],\n surname = [\"Tull\", \"Brown\", \"Lawrey\"],\n dob=[\"\", \"2/1/2001\", \"4/10/1995\"],\n gender=[\"male\", \"Male\", \"Female\"],\n county=[\"\", np.NaN, \"County Durham\"]\n )\n)\n\ndf2 = pd.DataFrame(\n dict(\n personid=[4,5,6],\n full_name=[\"Harry Tull\", \"Sali Brown\", \"Ina Laurie\"],\n date_of_birth=[\"2/1/2001\", \"2/1/2001\", \"4/11/1995\"],\n sex=[\"M\", \"M\", \"F\"],\n county=[\"Rutland\", \"Powys\", \"Durham\"]\n )\n)\n\nFeatures are extracted as different kinds of string objects from each field, ready to be hash embedded into the Bloom filters. We need to specify the feature extraction functions we’ll need.\nIn this case we’ll need one extractor for names, one for dates of birth, and one for sex/gender records. We create a dict with the functions we need. We create another dict to store any keyword arguments we want to pass in to each function (in this case we use all the default arguments so the keyword argument dictionaries are empty):\n\nfeature_factory = dict(\n name=feat.gen_name_features,\n dob=feat.gen_dateofbirth_features,\n sex=feat.gen_sex_features,\n misc=feat.gen_misc_features\n)\n\nff_args = dict(name={}, sex={}, dob={})", "crumbs": [ "About", "Docs", @@ -366,7 +366,7 @@ "href": "docs/tutorials/run-through.html#embedding", "title": "Embedder API run-through", "section": "Embedding", - "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender \\\n0 1 Henry Tull 1/1/2001 male \n1 2 Sally Brown 2/1/2001 Male \n2 3 Ina Lawrey 4/10/1995 Female \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features \\\n0 [day<01>, month<01>, year<2001>] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<10>, year<1995>] [sex<f>] \n\n all_features \\\n0 [ll_, _tu, day<01>, ul, l_, sex<m>, ull, y_, _... \n1 [lly, day<02>, wn_, sex<m>, sal, wn, y_, ly_, ... \n2 [_in, _i, ey_, wr, y_, rey, wre, sex<f>, _l, _... \n\n bf_indices bf_norms \n0 [130, 644, 773, 903, 135, 776, 778, 265, 654, ... 6.708204 \n1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428 \n2 [647, 394, 269, 13, 15, 532, 155, 28, 667, 413... 6.855655 \n personid full_name date_of_birth sex \\\n0 4 Harry Tull 2/1/2001 M \n1 5 Sali Brown 2/1/2001 M \n2 6 Ina Laurie 4/11/1995 F \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features \\\n0 [day<02>, month<01>, year<2001>] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<11>, year<1995>] [sex<f>] \n\n all_features \\\n0 [ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y... \n1 [day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al... \n2 [ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l... \n\n bf_indices bf_norms \n0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.708204 \n1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 6.855655 \n2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.782330", + "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements, and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\", county=\"misc\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\", county=\"misc\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender county \\\n0 1 Henry Tull male \n1 2 Sally Brown 2/1/2001 Male NaN \n2 3 Ina Lawrey 4/10/1995 Female County Durham \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features county_features \\\n0 [] [sex<m>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] \n2 [day<04>, month<10>, year<1995>] [sex<f>] [county<county durham>] \n\n all_features \\\n0 [ll, nr, ll_, _t, ull, _tu, _he, he, tu, hen, ... \n1 [all, ll, ro, n_, ow, sa, ly_, bro, month<01>,... \n2 [ina, ey, _in, re, wr, aw, law, la, na_, ey_, ... \n\n bf_indices bf_norms \n0 [644, 773, 135, 776, 265, 778, 271, 402, 404, ... 6.244998 \n1 [129, 258, 130, 776, 523, 525, 398, 271, 671, ... 7.141428 \n2 [647, 394, 269, 13, 15, 532, 667, 155, 413, 28... 7.000000 \n personid full_name date_of_birth sex county \\\n0 4 Harry Tull 2/1/2001 M Rutland \n1 5 Sali Brown 2/1/2001 M Powys \n2 6 Ina Laurie 4/11/1995 F Durham \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features county_features \\\n0 [day<02>, month<01>, year<2001>] [sex<m>] [county<rutland>] \n1 [day<02>, month<01>, year<2001>] [sex<m>] [county<powys>] \n2 [day<04>, month<11>, year<1995>] [sex<f>] [county<durham>] \n\n all_features \\\n0 [ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count... \n1 [county<powys>, ro, li_, n_, ow, sa, bro, ali,... \n2 [ina, ie, aur, e_, _in, uri, la, na_, county<d... \n\n bf_indices bf_norms \n0 [640, 130, 644, 135, 776, 10, 778, 271, 402, 5... 6.855655 \n1 [130, 523, 525, 398, 271, 152, 671, 803, 806, ... 7.000000 \n2 [646, 647, 394, 269, 15, 272, 531, 532, 665, 6... 6.928203", "crumbs": [ "About", "Docs", @@ -392,7 +392,7 @@ "href": "docs/tutorials/run-through.html#computing-the-similarity-scores-and-the-matching", "title": "Embedder API run-through", "section": "Computing the similarity scores and the matching", - "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[ll_, _tu, day<02>, ar, ul, l_, sex<m>, ull, y...\n[640, 130, 644, 135, 776, 10, 778, 271, 402, 5...\n6.708204\n0.195698\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[day<02>, wn_, sex<m>, wn, sal, ow, al, n_, al...\n[130, 523, 525, 398, 271, 152, 671, 803, 806, ...\n6.855655\n0.195698\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day<04>, month<11>, year<1995>]\n[sex<f>]\n[ri, _in, _i, aur, ie_, ur, sex<f>, _l, au, _l...\n[646, 647, 394, 269, 15, 272, 531, 532, 665, 6...\n6.782330\n0.086026\n\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.6666667 0.17395416 0. ]\n [0.29223802 0.79658223 0.08258402]\n [0.08697708 0.10638298 0.58067873]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))", + "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\ncounty\nfull_name_features\ndate_of_birth_features\nsex_features\ncounty_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\nRutland\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[county<rutland>]\n[ll, ll_, rr, rry, ar, _ha, _t, ha, ull, count...\n[640, 130, 644, 135, 776, 10, 778, 271, 402, 5...\n6.855655\n0.187541\n\n\n1\n5\nSali Brown\n2/1/2001\nM\nPowys\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day<02>, month<01>, year<2001>]\n[sex<m>]\n[county<powys>]\n[county<powys>, ro, li_, n_, ow, sa, bro, ali,...\n[130, 523, 525, 398, 271, 152, 671, 803, 806, ...\n7.000000\n0.187541\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\nDurham\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day<04>, month<11>, year<1995>]\n[sex<f>]\n[county<durham>]\n[ina, ie, aur, e_, _in, uri, la, na_, county<d...\n[646, 647, 394, 269, 15, 272, 531, 532, 665, 6...\n6.928203\n0.082479\n\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.60728442 0.09150181 0. ]\n [0.2859526 0.78015612 0.08084521]\n [0.08335143 0.10204083 0.57735028]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))", "crumbs": [ "About", "Docs", @@ -405,7 +405,7 @@ "href": "docs/tutorials/run-through.html#serialisation-and-file-io", "title": "Embedder API run-through", "section": "Serialisation and file I/O", - "text": "Serialisation and file I/O\nThat’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.\nFirst, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.\n\nembedder.to_pickle(\"embedder.pkl\")\n\nembedder_copy = Embedder.from_pickle(\"embedder.pkl\")\n\nThe copy has the same functionality as the original:\n\nsimilarities = embedder_copy.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.6666667 0.17395416 0. ]\n [0.29223802 0.79658223 0.08258402]\n [0.08697708 0.10638298 0.58067873]]\n\n\nNB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.\n\nedf2_copy = EmbeddedDataFrame(edf2, embedder_copy)\n\nIn this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.", + "text": "Serialisation and file I/O\nThat’s how to do the workflow in one session. However, this demo follows a multi-stage workflow, so we need to be able to pass objects around. There are a couple of methods that enable file I/O and serialisation.\nFirst, the Embedder object itself needs to be written to file and loaded. The idea is to train it, share it to the data owning parties, and also to the matching server. For this purpose, it’s possible to pickle the entire Embedder object.\n\nembedder.to_pickle(\"embedder.pkl\")\n\nembedder_copy = Embedder.from_pickle(\"embedder.pkl\")\n\nThe copy has the same functionality as the original:\n\nsimilarities = embedder_copy.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.60728442 0.09150181 0. ]\n [0.2859526 0.78015612 0.08084521]\n [0.08335143 0.10204083 0.57735028]]\n\n\nNB: This won’t work if two datasets were embedded with different Embedder instances, even if they’re identical. The compare() method checks for the same embedder object memory reference so it won’t work if one was embedded with the original and the other with the copy. The way to fix this is to re-initialise the EmbeddedDataFrame with the new Embedder object.\n\nedf2_copy = EmbeddedDataFrame(edf2, embedder_copy)\n\nIn this case, be careful that the Embedder is compatible with the Bloom filter vectors in the EDF (i.e. uses the same parameters and feature factories), because while you can refresh the norms and thresholds, you can’t refresh the ‘bf_indices’ without reembedding the data frame.", "crumbs": [ "About", "Docs", @@ -496,7 +496,7 @@ "href": "docs/tutorials/example-febrl.html#calculate-similarity", "title": "Linking the FEBRL datasets", "section": "Calculate similarity", - "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 8.35 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)", + "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 8.40 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)", "crumbs": [ "About", "Docs", @@ -509,7 +509,7 @@ "href": "docs/tutorials/example-febrl.html#compute-a-match", "title": "Linking the FEBRL datasets", "section": "Compute a match", - "text": "Compute a match\nUse the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.\n\nmatching = similarity_scores.match(require_thresholds=True)\n\nUsing the true IDs, evaluate the precision and recall of the match.\n\ndef get_results(edf1, edf2, matching):\n \"\"\"Get the results for a given matching.\"\"\"\n\n trueids_matched1 = edf1.iloc[matching[0], edf1.columns.get_loc(\"true_id\")]\n trueids_matched2 = edf2.iloc[matching[1], edf2.columns.get_loc(\"true_id\")]\n\n nmatches = len(matching[0])\n truepos = sum(map(np.equal, trueids_matched1, trueids_matched2))\n falsepos = nmatches - truepos\n\n print(\n f\"True pos: {truepos} | False pos: {falsepos} | \"\n f\"Precision: {truepos / nmatches:.1%} | Recall: {truepos / 5000:.1%}\"\n )\n\n return nmatches, truepos, falsepos\n\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 4973 | False pos: 0 | Precision: 100.0% | Recall: 99.5%\n\n\nThen, we compute the match without using the row thresholds, calculating the same performance metrics:\n\nmatching = similarity_scores.match(require_thresholds=False)\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 5000 | False pos: 0 | Precision: 100.0% | Recall: 100.0%\n\n\nWithout using the row thresholds, the number of false positives is larger, but the recall is much better. For some uses this balance may be preferable.\nIn testing, the use of local row thresholds provides a better trade-off between precision and recall, compared to using a single absolute threshold. It has the additional advantage, in a privacy-preserving setting, of being automatic and not requiring clerical review to set the level.", + "text": "Compute a match\nUse the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.\n\nmatching = similarity_scores.match(require_thresholds=True)\n\nUsing the true IDs, evaluate the precision and recall of the match.\n\ndef get_results(edf1, edf2, matching):\n \"\"\"Get the results for a given matching.\"\"\"\n\n trueids_matched1 = edf1.iloc[matching[0], edf1.columns.get_loc(\"true_id\")]\n trueids_matched2 = edf2.iloc[matching[1], edf2.columns.get_loc(\"true_id\")]\n\n nmatches = len(matching[0])\n truepos = sum(map(np.equal, trueids_matched1, trueids_matched2))\n falsepos = nmatches - truepos\n\n print(\n f\"True pos: {truepos} | False pos: {falsepos} | \"\n f\"Precision: {truepos / nmatches:.1%} | Recall: {truepos / 5000:.1%}\"\n )\n\n return nmatches, truepos, falsepos\n\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 4969 | False pos: 0 | Precision: 100.0% | Recall: 99.4%\n\n\nThen, we compute the match without using the row thresholds, calculating the same performance metrics:\n\nmatching = similarity_scores.match(require_thresholds=False)\n_ = get_results(edf1, edf2, matching)\n\nTrue pos: 5000 | False pos: 0 | Precision: 100.0% | Recall: 100.0%\n\n\nWithout using the row thresholds, the number of false positives is larger, but the recall is much better. For some uses this balance may be preferable.\nIn testing, the use of local row thresholds provides a better trade-off between precision and recall, compared to using a single absolute threshold. It has the additional advantage, in a privacy-preserving setting, of being automatic and not requiring clerical review to set the level.", "crumbs": [ "About", "Docs", @@ -586,7 +586,7 @@ "href": "docs/reference/features.html", "title": "features", "section": "", - "text": "embedder.features\nFeature generation functions for various column types.\n\n\n\n\n\nName\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=['day<01>', 'month<01>', 'year<2050>'])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n['day<01>', 'month<01>', 'year<2050>']\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.", + "text": "embedder.features\nFeature generation functions for various column types.\n\n\n\n\n\nName\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n[]\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.", "crumbs": [ "About", "Docs", @@ -599,7 +599,7 @@ "href": "docs/reference/features.html#functions", "title": "features", "section": "", - "text": "Name\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=['day<01>', 'month<01>', 'year<2050>'])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n['day<01>', 'month<01>', 'year<2050>']\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.", + "text": "Name\nDescription\n\n\n\n\ngen_dateofbirth_features\nGenerate labelled date features from a series of dates of birth.\n\n\ngen_double_metaphone\nGenerate the double methaphones of a string.\n\n\ngen_features\nGenerate string features of various types.\n\n\ngen_misc_features\nGenerate miscellaneous categorical features for a series.\n\n\ngen_misc_shingled_features\nGenerate shingled labelled features.\n\n\ngen_name_features\nGenerate a features series for a series of names.\n\n\ngen_ngram\nGenerate n-grams from a set of tokens.\n\n\ngen_sex_features\nGenerate labelled sex features from a series of sexes.\n\n\ngen_skip_grams\nGenerate skip 2-grams from a set of tokens.\n\n\nsplit_string_underscore\nSplit and underwrap a string at typical punctuation marks.\n\n\n\n\n\nembedder.features.gen_dateofbirth_features(dob, dayfirst=True, yearfirst=False, default=[])\nGenerate labelled date features from a series of dates of birth.\nFeatures take the form [\"day<dd>\", \"month<mm>\", \"year<YYYY>\"]. Note that this feature generator can be used for any sort of date data, not just dates of birth.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\ndob\npandas.pandas.Series\nSeries of dates of birth.\nrequired\n\n\ndayfirst\nbool\nWhether the day comes first in the DOBs. Passed to pd.to_datetime() and defaults to True.\nTrue\n\n\nyearfirst\nbool\nWhether the year comes first in the DOBs. Passed to pd.to_datetime() and defaults to False.\nFalse\n\n\ndefault\nlist[str]\nDefault date to fill in missing data in feature (list) form. Default is the feature form of 2050-01-01.\n[]\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of date features.\n\n\n\n\n\n\n\nembedder.features.gen_double_metaphone(string)\nGenerate the double methaphones of a string.\nThis function is a generator containing all the possible, non-empty double metaphones of a given string, separated by spaces. This function uses the metaphone.doublemetaphone() function under the hood, ignoring any empty strings. See their repository for details.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString from which to derive double metaphones.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next double metaphone in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_features(string, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate string features of various types.\nThis function is a generator capable of producing n-grams, skip 2-grams, and double metaphones from a single string. These outputs are referred to as features.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nBase string from which to generate features.\nrequired\n\n\nngram_length\nlist\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next feature in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_misc_features(field, label=None)\nGenerate miscellaneous categorical features for a series.\nUseful for keeping raw columns in the linkage data. All features use a label and take the form [\"label<option>\"] except for missing data, which are coded as \"\".\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries from which to generate our features.\nrequired\n\n\nlabel\nNone | str | typing.Hashable\nLabel for the series. By default, the name of the series is used if available. Otherwise, if not specified, misc is used.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of miscellaneous features.\n\n\n\n\n\n\n\nembedder.features.gen_misc_shingled_features(field, ngram_length=[2, 3], use_gen_skip_grams=False, label=None)\nGenerate shingled labelled features.\nGenerate n-grams, with a label to distinguish them from (and ensure they’re hashed separately from) names. Like gen_name_features(), this function makes a call to gen_features() via pd.Series.apply().\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nfield\npandas.pandas.Series\nSeries of string data.\nrequired\n\n\nngram_length\nlist\nShingle sizes to generate. By default [2, 3].\n[2, 3]\n\n\nuse_gen_skip_grams\nbool\nWhether to generate skip 2-grams. False by default.\nFalse\n\n\nlabel\nstr\nA label to differentiate from other shingled features. If field has no name, this defaults to zz.\nNone\n\n\n\n\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of shingled string features.\n\n\n\n\n\n\n\nembedder.features.gen_name_features(names, ngram_length=[2, 3], use_gen_ngram=True, use_gen_skip_grams=False, use_double_metaphone=False)\nGenerate a features series for a series of names.\nEffectively, this function is a call to pd.Series.apply() using our gen_features() string feature generator function.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nnames\npandas.pandas.Series\nSeries of names.\nrequired\n\n\nngram_length\nlist[int]\nLengths of n-grams to make. Ignored if use_gen_ngram=False.\n[2, 3]\n\n\nuse_gen_ngram\nbool\nWhether to create n-grams. Default is True.\nTrue\n\n\nuse_gen_skip_grams\nbool\nWhether to create skip 2-grams. Default is False.\nFalse\n\n\nuse_double_metaphone\nbool\nWhether to create double metaphones. Default is False.\nFalse\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of features.\n\n\n\n\n\n\n\nembedder.features.gen_ngram(split_tokens, ngram_length)\nGenerate n-grams from a set of tokens.\nThis is a generator function that contains a series of n-grams the size of the sliding window.\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form n-grams.\nrequired\n\n\nngram_length\nlist\nDesired lengths of n-grams. For examples, ngram_length=[2, 3] would generate all 2-grams and 3-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next n-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.gen_sex_features(sexes)\nGenerate labelled sex features from a series of sexes.\nFeatures take the form [\"sex<option>\"] or [\"\"] for missing data.\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsexes\npandas.pandas.Series\nSeries of sex data.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\npandas.pandas.Series\nSeries containing lists of sex features.\n\n\n\n\n\n\n\nembedder.features.gen_skip_grams(split_tokens)\nGenerate skip 2-grams from a set of tokens.\nThis function is a generator that contains a series of skip 2-grams.\n\n\n>>> string = \"dave james\"\n>>> tokens = split_string_underscore(string)\n>>> skips = list(gen_skip_grams(tokens))\n>>> print(skips)\n[\"_a\", \"dv\", \"ae\", \"v_\", \"_a\", \"jm\", \"ae\", \"ms\", \"e_\"]\n\n\n\n\n\n\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nsplit_tokens\nlist\nAll the split-up tokens from which to form skip 2-grams.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nstr\nThe next skip 2-gram in the sequence.\n\n\n\n\n\n\n\nembedder.features.split_string_underscore(string)\nSplit and underwrap a string at typical punctuation marks.\nCurrently, we split at any combination of spaces, dashes, dots, commas, or underscores.\n\n\n>>> strings = (\"dave william johnson\", \"Francesca__Hogan-O'Malley\")\n>>> for string in strings:\n... print(split_string_underscore(string))\n[\"_dave_\", \"_william_\", \"_johnson_\"]\n[\"_Francesca_\", \"_Hogan_\", \"_O'Malley_\"]\n\n\n\n\n\n\nName\nType\nDescription\nDefault\n\n\n\n\nstring\nstr\nString to split.\nrequired\n\n\n\n\n\n\n\n\n\nType\nDescription\n\n\n\n\nlist[str]\nList of the split and wrapped tokens.", "crumbs": [ "About", "Docs", diff --git a/sitemap.xml b/sitemap.xml index 812ac85..b322d28 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,66 +2,66 @@ https://datasciencecampus.github.io/pprl_toolkit/index.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/index.html - 2024-05-02T10:18:31.033Z + 2024-05-08T14:03:57.385Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/config.html - 2024-05-02T10:18:31.153Z + 2024-05-08T14:03:57.505Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/cloud.html - 2024-05-02T10:18:31.185Z + 2024-05-08T14:03:57.537Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/embedder.html - 2024-05-02T10:18:31.101Z + 2024-05-08T14:03:57.453Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/encryption.html - 2024-05-02T10:18:31.149Z + 2024-05-08T14:03:57.501Z https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/example-verknupfung.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/in-the-cloud.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/run-through.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/example-febrl.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/tutorials/index.html - 2024-05-02T10:17:42.950Z + 2024-05-08T14:03:10.597Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/local.html - 2024-05-02T10:18:31.189Z + 2024-05-08T14:03:57.541Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/bloom_filters.html - 2024-05-02T10:18:31.053Z + 2024-05-08T14:03:57.405Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/features.html - 2024-05-02T10:18:31.137Z + 2024-05-08T14:03:57.485Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/perform.html - 2024-05-02T10:18:31.201Z + 2024-05-08T14:03:57.553Z https://datasciencecampus.github.io/pprl_toolkit/docs/reference/utils.html - 2024-05-02T10:18:31.169Z + 2024-05-08T14:03:57.521Z