Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
daffidwilde committed Apr 3, 2024
1 parent 1564129 commit c01573f
Show file tree
Hide file tree
Showing 5 changed files with 34 additions and 34 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
c6e6eefd
1fafc425
2 changes: 1 addition & 1 deletion docs/tutorials/example-febrl.html
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ <h2 class="anchored" data-anchor-id="calculate-similarity">Calculate similarity<
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Updating thresholds took </span><span class="sc">{</span>end <span class="op">-</span> start<span class="sc">:.2f}</span><span class="ss"> seconds"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Updating thresholds took 3.02 seconds</code></pre>
<pre><code>Updating thresholds took 3.07 seconds</code></pre>
</div>
</div>
<p>Compute the matrix of similarity scores.</p>
Expand Down
30 changes: 15 additions & 15 deletions docs/tutorials/run-through.html
Original file line number Diff line number Diff line change
Expand Up @@ -413,14 +413,14 @@ <h2 class="anchored" data-anchor-id="embedding">Embedding</h2>
2 [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;] [sex&lt;f&gt;]

all_features \
0 [ull, tu, he, hen, year&lt;2001&gt;, en, nry, sex&lt;m&gt;...
1 [br, day&lt;02&gt;, ly, all, bro, own, year&lt;2001&gt;, w...
2 [wr, ina, _l, day&lt;04&gt;, year&lt;1995&gt;, la, na_, a_...
0 [tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, l_, ull, sex...
1 [ly, br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m...
2 [ina, month&lt;10&gt;, ey_, in, _in, la, na_, _i, ye...

bf_indices bf_norms
0 [13, 271, 399, 147, 277, 537, 25, 802, 547, 80... 6.782330
0 [13, 271, 399, 147, 277, 25, 537, 802, 547, 29... 6.782330
1 [259, 647, 521, 781, 147, 533, 918, 151, 408, ... 7.071068
2 [898, 391, 655, 531, 148, 663, 151, 668, 413, ... 6.928203
2 [898, 391, 655, 531, 148, 151, 663, 668, 413, ... 6.928203
personid full_name date_of_birth sex \
0 4 Harry Tull 2/1/2001 M
1 5 Sali Brown 2/1/2001 M
Expand All @@ -437,14 +437,14 @@ <h2 class="anchored" data-anchor-id="embedding">Embedding</h2>
2 [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;] [sex&lt;f&gt;]

all_features \
0 [ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...
1 [br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...
2 [uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...
0 [_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...
1 [br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...
2 [ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...

bf_indices bf_norms
0 [259, 13, 399, 147, 277, 918, 547, 805, 293, 1... 6.855655
0 [259, 13, 399, 147, 277, 918, 547, 293, 805, 1... 6.855655
1 [259, 515, 6, 647, 521, 781, 143, 15, 147, 533... 6.782330
2 [385, 771, 647, 136, 391, 908, 653, 658, 786, ... 6.928203 </code></pre>
2 [385, 771, 391, 136, 647, 908, 653, 786, 658, ... 6.928203 </code></pre>
</div>
</div>
</section>
Expand Down Expand Up @@ -488,8 +488,8 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
<td>[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...</td>
<td>[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]</td>
<td>[sex&lt;m&gt;]</td>
<td>[ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...</td>
<td>[259, 13, 399, 147, 277, 918, 547, 805, 293, 1...</td>
<td>[_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...</td>
<td>[259, 13, 399, 147, 277, 918, 547, 293, 805, 1...</td>
<td>6.855655</td>
<td>0.172053</td>
</tr>
Expand All @@ -502,7 +502,7 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
<td>[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...</td>
<td>[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]</td>
<td>[sex&lt;m&gt;]</td>
<td>[br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...</td>
<td>[br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...</td>
<td>[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...</td>
<td>6.782330</td>
<td>0.172053</td>
Expand All @@ -516,8 +516,8 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
<td>[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...</td>
<td>[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]</td>
<td>[sex&lt;f&gt;]</td>
<td>[uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...</td>
<td>[385, 771, 647, 136, 391, 908, 653, 658, 786, ...</td>
<td>[ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...</td>
<td>[385, 771, 391, 136, 647, 908, 653, 786, 658, ...</td>
<td>6.928203</td>
<td>0.021281</td>
</tr>
Expand Down
6 changes: 3 additions & 3 deletions search.json
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@
"href": "docs/tutorials/example-febrl.html#calculate-similarity",
"title": "Linking the FEBRL datasets",
"section": "Calculate similarity",
"text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 3.02 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)"
"text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 3.07 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)"
},
{
"objectID": "docs/tutorials/example-febrl.html#compute-a-match",
Expand Down Expand Up @@ -151,7 +151,7 @@
"href": "docs/tutorials/run-through.html#embedding",
"title": "Embedder API run-through",
"section": "Embedding",
"text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements (actually 1025 because of an offset), and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender \\\n0 1 Henry Tull 1/1/2001 male \n1 2 Sally Brown 2/1/2001 Male \n2 3 Ina Lawrey 4/10/1995 Female \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features \\\n0 [day&lt;01&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n1 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n2 [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;] [sex&lt;f&gt;] \n\n all_features \\\n0 [ull, tu, he, hen, year&lt;2001&gt;, en, nry, sex&lt;m&gt;... \n1 [br, day&lt;02&gt;, ly, all, bro, own, year&lt;2001&gt;, w... \n2 [wr, ina, _l, day&lt;04&gt;, year&lt;1995&gt;, la, na_, a_... \n\n bf_indices bf_norms \n0 [13, 271, 399, 147, 277, 537, 25, 802, 547, 80... 6.782330 \n1 [259, 647, 521, 781, 147, 533, 918, 151, 408, ... 7.071068 \n2 [898, 391, 655, 531, 148, 663, 151, 668, 413, ... 6.928203 \n personid full_name date_of_birth sex \\\n0 4 Harry Tull 2/1/2001 M \n1 5 Sali Brown 2/1/2001 M \n2 6 Ina Laurie 4/11/1995 F \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features \\\n0 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n1 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n2 [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;] [sex&lt;f&gt;] \n\n all_features \\\n0 [ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,... \n1 [br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;... \n2 [uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1... \n\n bf_indices bf_norms \n0 [259, 13, 399, 147, 277, 918, 547, 805, 293, 1... 6.855655 \n1 [259, 515, 6, 647, 521, 781, 143, 15, 147, 533... 6.782330 \n2 [385, 771, 647, 136, 391, 908, 653, 658, 786, ... 6.928203"
"text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements (actually 1025 because of an offset), and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n ff_args,\n bf_size = 2**10,\n num_hashes=2,\n )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n id forename surname dob gender \\\n0 1 Henry Tull 1/1/2001 male \n1 2 Sally Brown 2/1/2001 Male \n2 3 Ina Lawrey 4/10/1995 Female \n\n forename_features \\\n0 [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_] \n1 [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_] \n2 [_i, in, na, a_, _in, ina, na_] \n\n surname_features \\\n0 [_t, tu, ul, ll, l_, _tu, tul, ull, ll_] \n1 [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_] \n2 [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr... \n\n dob_features gender_features \\\n0 [day&lt;01&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n1 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n2 [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;] [sex&lt;f&gt;] \n\n all_features \\\n0 [tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, l_, ull, sex... \n1 [ly, br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m... \n2 [ina, month&lt;10&gt;, ey_, in, _in, la, na_, _i, ye... \n\n bf_indices bf_norms \n0 [13, 271, 399, 147, 277, 25, 537, 802, 547, 29... 6.782330 \n1 [259, 647, 521, 781, 147, 533, 918, 151, 408, ... 7.071068 \n2 [898, 391, 655, 531, 148, 151, 663, 668, 413, ... 6.928203 \n personid full_name date_of_birth sex \\\n0 4 Harry Tull 2/1/2001 M \n1 5 Sali Brown 2/1/2001 M \n2 6 Ina Laurie 4/11/1995 F \n\n full_name_features \\\n0 [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _... \n1 [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _... \n2 [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _... \n\n date_of_birth_features sex_features \\\n0 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n1 [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;] [sex&lt;m&gt;] \n2 [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;] [sex&lt;f&gt;] \n\n all_features \\\n0 [_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_... \n1 [br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o... \n2 [ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,... \n\n bf_indices bf_norms \n0 [259, 13, 399, 147, 277, 918, 547, 293, 805, 1... 6.855655 \n1 [259, 515, 6, 647, 521, 781, 143, 15, 147, 533... 6.782330 \n2 [385, 771, 391, 136, 647, 908, 653, 786, 658, ... 6.928203"
},
{
"objectID": "docs/tutorials/run-through.html#training",
Expand All @@ -165,7 +165,7 @@
"href": "docs/tutorials/run-through.html#computing-the-similarity-scores-and-the-matching",
"title": "Embedder API run-through",
"section": "Computing the similarity scores and the matching",
"text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...\n[259, 13, 399, 147, 277, 918, 547, 805, 293, 1...\n6.855655\n0.172053\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...\n[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...\n6.782330\n0.172053\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]\n[sex&lt;f&gt;]\n[uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...\n[385, 771, 647, 136, 391, 908, 653, 658, 786, ...\n6.928203\n0.021281\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.66670521 0.13043478 0.02128141]\n [0.24754111 0.7923548 0.02041242]\n [0.10526899 0.04256282 0.54166665]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))"
"text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...\n[259, 13, 399, 147, 277, 918, 547, 293, 805, 1...\n6.855655\n0.172053\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...\n[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...\n6.782330\n0.172053\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]\n[sex&lt;f&gt;]\n[ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...\n[385, 771, 391, 136, 647, 908, 653, 786, 658, ...\n6.928203\n0.021281\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.66670521 0.13043478 0.02128141]\n [0.24754111 0.7923548 0.02041242]\n [0.10526899 0.04256282 0.54166665]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))"
},
{
"objectID": "docs/tutorials/run-through.html#serialisation-and-file-io",
Expand Down
Loading

0 comments on commit c01573f

Please sign in to comment.