Built site for gh-pages

datasciencecampus · Apr 3, 2024 · c01573f · c01573f
1 parent 1564129
commit c01573f
Show file tree

Hide file tree

Showing 5 changed files with 34 additions and 34 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-c6e6eefd
+1fafc425
diff --git a/docs/tutorials/example-febrl.html b/docs/tutorials/example-febrl.html
@@ -437,7 +437,7 @@ <h2 class="anchored" data-anchor-id="calculate-similarity">Calculate similarity<
 <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Updating thresholds took </span><span class="sc">{</span>end <span class="op">-</span> start<span class="sc">:.2f}</span><span class="ss"> seconds"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>Updating thresholds took 3.02 seconds</code></pre>
+<pre><code>Updating thresholds took 3.07 seconds</code></pre>
 </div>
 </div>
 <p>Compute the matrix of similarity scores.</p>

diff --git a/docs/tutorials/run-through.html b/docs/tutorials/run-through.html
@@ -413,14 +413,14 @@ <h2 class="anchored" data-anchor-id="embedding">Embedding</h2>
 2  [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;]        [sex&lt;f&gt;]   
 
                                         all_features  \
-0  [ull, tu, he, hen, year&lt;2001&gt;, en, nry, sex&lt;m&gt;...   
-1  [br, day&lt;02&gt;, ly, all, bro, own, year&lt;2001&gt;, w...   
-2  [wr, ina, _l, day&lt;04&gt;, year&lt;1995&gt;, la, na_, a_...   
+0  [tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, l_, ull, sex...   
+1  [ly, br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m...   
+2  [ina, month&lt;10&gt;, ey_, in, _in, la, na_, _i, ye...   
 
                                           bf_indices  bf_norms  
-0  [13, 271, 399, 147, 277, 537, 25, 802, 547, 80...  6.782330  
+0  [13, 271, 399, 147, 277, 25, 537, 802, 547, 29...  6.782330  
 1  [259, 647, 521, 781, 147, 533, 918, 151, 408, ...  7.071068  
-2  [898, 391, 655, 531, 148, 663, 151, 668, 413, ...  6.928203  
+2  [898, 391, 655, 531, 148, 151, 663, 668, 413, ...  6.928203  
    personid   full_name date_of_birth sex  \
 0         4  Harry Tull      2/1/2001   M   
 1         5  Sali Brown      2/1/2001   M   
@@ -437,14 +437,14 @@ <h2 class="anchored" data-anchor-id="embedding">Embedding</h2>
 2  [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]     [sex&lt;f&gt;]   
 
                                         all_features  \
-0  [ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...   
-1  [br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...   
-2  [uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...   
+0  [_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...   
+1  [br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...   
+2  [ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...   
 
                                           bf_indices  bf_norms  
-0  [259, 13, 399, 147, 277, 918, 547, 805, 293, 1...  6.855655  
+0  [259, 13, 399, 147, 277, 918, 547, 293, 805, 1...  6.855655  
 1  [259, 515, 6, 647, 521, 781, 143, 15, 147, 533...  6.782330  
-2  [385, 771, 647, 136, 391, 908, 653, 658, 786, ...  6.928203  </code></pre>
+2  [385, 771, 391, 136, 647, 908, 653, 786, 658, ...  6.928203  </code></pre>
 </div>
 </div>
 </section>
@@ -488,8 +488,8 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
 <td>[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...</td>
 <td>[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]</td>
 <td>[sex&lt;m&gt;]</td>
-<td>[ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...</td>
-<td>[259, 13, 399, 147, 277, 918, 547, 805, 293, 1...</td>
+<td>[_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...</td>
+<td>[259, 13, 399, 147, 277, 918, 547, 293, 805, 1...</td>
 <td>6.855655</td>
 <td>0.172053</td>
 </tr>
@@ -502,7 +502,7 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
 <td>[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...</td>
 <td>[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]</td>
 <td>[sex&lt;m&gt;]</td>
-<td>[br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...</td>
+<td>[br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...</td>
 <td>[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...</td>
 <td>6.782330</td>
 <td>0.172053</td>
@@ -516,8 +516,8 @@ <h2 class="anchored" data-anchor-id="computing-the-similarity-scores-and-the-mat
 <td>[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...</td>
 <td>[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]</td>
 <td>[sex&lt;f&gt;]</td>
-<td>[uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...</td>
-<td>[385, 771, 647, 136, 391, 908, 653, 658, 786, ...</td>
+<td>[ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...</td>
+<td>[385, 771, 391, 136, 647, 908, 653, 786, 658, ...</td>
 <td>6.928203</td>
 <td>0.021281</td>
 </tr>

diff --git a/search.json b/search.json
@@ -123,7 +123,7 @@
     "href": "docs/tutorials/example-febrl.html#calculate-similarity",
     "title": "Linking the FEBRL datasets",
     "section": "Calculate similarity",
-    "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 3.02 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)"
+    "text": "Calculate similarity\nCompute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.\n\nstart = time.time()\nedf1.update_thresholds()\nedf2.update_thresholds()\nend = time.time()\n\nprint(f\"Updating thresholds took {end - start:.2f} seconds\")\n\nUpdating thresholds took 3.07 seconds\n\n\nCompute the matrix of similarity scores.\n\nsimilarity_scores = embedder.compare(edf1,edf2)"
   },
   {
     "objectID": "docs/tutorials/example-febrl.html#compute-a-match",
@@ -151,7 +151,7 @@
     "href": "docs/tutorials/run-through.html#embedding",
     "title": "Embedder API run-through",
     "section": "Embedding",
-    "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements (actually 1025 because of an offset), and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n                    ff_args,\n                    bf_size = 2**10,\n                    num_hashes=2,\n                    )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n    df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n    df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n   id forename surname        dob  gender  \\\n0   1    Henry    Tull   1/1/2001    male   \n1   2    Sally   Brown   2/1/2001    Male   \n2   3      Ina  Lawrey  4/10/1995  Female   \n\n                                   forename_features  \\\n0  [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]   \n1  [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_]   \n2                    [_i, in, na, a_, _in, ina, na_]   \n\n                                    surname_features  \\\n0           [_t, tu, ul, ll, l_, _tu, tul, ull, ll_]   \n1  [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_]   \n2  [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr...   \n\n                       dob_features gender_features  \\\n0  [day&lt;01&gt;, month&lt;01&gt;, year&lt;2001&gt;]        [sex&lt;m&gt;]   \n1  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]        [sex&lt;m&gt;]   \n2  [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;]        [sex&lt;f&gt;]   \n\n                                        all_features  \\\n0  [ull, tu, he, hen, year&lt;2001&gt;, en, nry, sex&lt;m&gt;...   \n1  [br, day&lt;02&gt;, ly, all, bro, own, year&lt;2001&gt;, w...   \n2  [wr, ina, _l, day&lt;04&gt;, year&lt;1995&gt;, la, na_, a_...   \n\n                                          bf_indices  bf_norms  \n0  [13, 271, 399, 147, 277, 537, 25, 802, 547, 80...  6.782330  \n1  [259, 647, 521, 781, 147, 533, 918, 151, 408, ...  7.071068  \n2  [898, 391, 655, 531, 148, 663, 151, 668, 413, ...  6.928203  \n   personid   full_name date_of_birth sex  \\\n0         4  Harry Tull      2/1/2001   M   \n1         5  Sali Brown      2/1/2001   M   \n2         6  Ina Laurie     4/11/1995   F   \n\n                                  full_name_features  \\\n0  [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...   \n1  [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...   \n2  [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...   \n\n             date_of_birth_features sex_features  \\\n0  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]     [sex&lt;m&gt;]   \n1  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]     [sex&lt;m&gt;]   \n2  [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]     [sex&lt;f&gt;]   \n\n                                        all_features  \\\n0  [ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...   \n1  [br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...   \n2  [uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...   \n\n                                          bf_indices  bf_norms  \n0  [259, 13, 399, 147, 277, 918, 547, 805, 293, 1...  6.855655  \n1  [259, 515, 6, 647, 521, 781, 143, 15, 147, 533...  6.782330  \n2  [385, 771, 647, 136, 391, 908, 653, 658, 786, ...  6.928203"
+    "text": "Embedding\nNow we can create an Embedder object. We want our Bloom filter vectors to have a length of 1024 elements (actually 1025 because of an offset), and we choose to hash each feature two times. These choices seem to work ok, but we haven’t explored them systematically.\n\nembedder = Embedder(feature_factory,\n                    ff_args,\n                    bf_size = 2**10,\n                    num_hashes=2,\n                    )\n\nNow we can hash embed the dataset into an EmbeddedDataFrame (EDF). For this we need to pass a column specification colspec that maps each column of the data into the feature_factory functions. Any columns not mapped will not contribute to the embedding.\n\nedf1 = embedder.embed(\n    df1, colspec=dict(forename=\"name\", surname=\"name\", dob=\"dob\", gender=\"sex\")\n)\nedf2 = embedder.embed(\n    df2, colspec=dict(full_name=\"name\", date_of_birth=\"dob\", sex=\"sex\")\n)\n\nprint(edf1)\nprint(edf2)\n\n   id forename surname        dob  gender  \\\n0   1    Henry    Tull   1/1/2001    male   \n1   2    Sally   Brown   2/1/2001    Male   \n2   3      Ina  Lawrey  4/10/1995  Female   \n\n                                   forename_features  \\\n0  [_h, he, en, nr, ry, y_, _he, hen, enr, nry, ry_]   \n1  [_s, sa, al, ll, ly, y_, _sa, sal, all, lly, ly_]   \n2                    [_i, in, na, a_, _in, ina, na_]   \n\n                                    surname_features  \\\n0           [_t, tu, ul, ll, l_, _tu, tul, ull, ll_]   \n1  [_b, br, ro, ow, wn, n_, _br, bro, row, own, wn_]   \n2  [_l, la, aw, wr, re, ey, y_, _la, law, awr, wr...   \n\n                       dob_features gender_features  \\\n0  [day&lt;01&gt;, month&lt;01&gt;, year&lt;2001&gt;]        [sex&lt;m&gt;]   \n1  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]        [sex&lt;m&gt;]   \n2  [day&lt;04&gt;, month&lt;10&gt;, year&lt;1995&gt;]        [sex&lt;f&gt;]   \n\n                                        all_features  \\\n0  [tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, l_, ull, sex...   \n1  [ly, br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m...   \n2  [ina, month&lt;10&gt;, ey_, in, _in, la, na_, _i, ye...   \n\n                                          bf_indices  bf_norms  \n0  [13, 271, 399, 147, 277, 25, 537, 802, 547, 29...  6.782330  \n1  [259, 647, 521, 781, 147, 533, 918, 151, 408, ...  7.071068  \n2  [898, 391, 655, 531, 148, 151, 663, 668, 413, ...  6.928203  \n   personid   full_name date_of_birth sex  \\\n0         4  Harry Tull      2/1/2001   M   \n1         5  Sali Brown      2/1/2001   M   \n2         6  Ina Laurie     4/11/1995   F   \n\n                                  full_name_features  \\\n0  [_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...   \n1  [_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...   \n2  [_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...   \n\n             date_of_birth_features sex_features  \\\n0  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]     [sex&lt;m&gt;]   \n1  [day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]     [sex&lt;m&gt;]   \n2  [day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]     [sex&lt;f&gt;]   \n\n                                        all_features  \\\n0  [_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...   \n1  [br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...   \n2  [ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...   \n\n                                          bf_indices  bf_norms  \n0  [259, 13, 399, 147, 277, 918, 547, 293, 805, 1...  6.855655  \n1  [259, 515, 6, 647, 521, 781, 143, 15, 147, 533...  6.782330  \n2  [385, 771, 391, 136, 647, 908, 653, 786, 658, ...  6.928203"
   },
   {
     "objectID": "docs/tutorials/run-through.html#training",
@@ -165,7 +165,7 @@
     "href": "docs/tutorials/run-through.html#computing-the-similarity-scores-and-the-matching",
     "title": "Embedder API run-through",
     "section": "Computing the similarity scores and the matching",
-    "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[ull, day&lt;02&gt;, tu, year&lt;2001&gt;, rr, sex&lt;m&gt;, _t,...\n[259, 13, 399, 147, 277, 918, 547, 805, 293, 1...\n6.855655\n0.172053\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[br, day&lt;02&gt;, bro, own, year&lt;2001&gt;, wn, sex&lt;m&gt;...\n[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...\n6.782330\n0.172053\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]\n[sex&lt;f&gt;]\n[uri, month&lt;11&gt;, ina, rie, _l, day&lt;04&gt;, year&lt;1...\n[385, 771, 647, 136, 391, 908, 653, 658, 786, ...\n6.928203\n0.021281\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.66670521 0.13043478 0.02128141]\n [0.24754111 0.7923548  0.02041242]\n [0.10526899 0.04256282 0.54166665]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))"
+    "text": "Computing the similarity scores and the matching\nNow we have two embedded datasets, we can compare them and compute all the pairwise Cosine similarity scores.\nFirst, we have to compute the vector norms of each Bloom vector (for scaling the Cosine similarity) and the thresholds (thresholds are explained here [link]). Computing the thresholds can be time-consuming for a larger dataset, because it essentially computes all pairwise comparisons of the data to itself.\n\n\n\n\n\n\n\n\n\npersonid\nfull_name\ndate_of_birth\nsex\nfull_name_features\ndate_of_birth_features\nsex_features\nall_features\nbf_indices\nbf_norms\nthresholds\n\n\n\n\n0\n4\nHarry Tull\n2/1/2001\nM\n[_h, ha, ar, rr, ry, y_, _t, tu, ul, ll, l_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[_ha, tul, year&lt;2001&gt;, ry_, month&lt;01&gt;, har, l_...\n[259, 13, 399, 147, 277, 918, 547, 293, 805, 1...\n6.855655\n0.172053\n\n\n1\n5\nSali Brown\n2/1/2001\nM\n[_s, sa, al, li, i_, _b, br, ro, ow, wn, n_, _...\n[day&lt;02&gt;, month&lt;01&gt;, year&lt;2001&gt;]\n[sex&lt;m&gt;]\n[br, sal, year&lt;2001&gt;, _s, month&lt;01&gt;, sex&lt;m&gt;, o...\n[259, 515, 6, 647, 521, 781, 143, 15, 147, 533...\n6.782330\n0.172053\n\n\n2\n6\nIna Laurie\n4/11/1995\nF\n[_i, in, na, a_, _l, la, au, ur, ri, ie, e_, _...\n[day&lt;04&gt;, month&lt;11&gt;, year&lt;1995&gt;]\n[sex&lt;f&gt;]\n[ina, in, _in, ri, la, au, na_, month&lt;11&gt;, ur,...\n[385, 771, 391, 136, 647, 908, 653, 786, 658, ...\n6.928203\n0.021281\n\n\n\n\n\n\n\nNB: there’s also a flag to compute these at the same time as the embedding, but it doesn’t by default because, depending on the workflow, you may wish to compute the norms and thresholds at different times (e.g. on the server).\nNow you can compute the similarities:\n\nsimilarities = embedder.compare(edf1,edf2)\n\nprint(similarities)\n\n[[0.66670521 0.13043478 0.02128141]\n [0.24754111 0.7923548  0.02041242]\n [0.10526899 0.04256282 0.54166665]]\n\n\nFinally, you can compute the matching:\n\nmatching = similarities.match(abs_cutoff=0.5)\n\nprint(matching)\n\n(array([0, 1, 2]), array([0, 1, 2]))"
   },
   {
     "objectID": "docs/tutorials/run-through.html#serialisation-and-file-io",