Built site for gh-pages

datasciencecampus · May 13, 2024 · de06c89 · de06c89
1 parent 8968887
commit de06c89
Show file tree

Hide file tree

Showing 9 changed files with 127 additions and 96 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-7064a115
+ce820a19
diff --git a/docs/reference/embedder.html b/docs/reference/embedder.html
@@ -417,26 +417,57 @@ <h4 class="anchored" data-anchor-id="methods">Methods</h4>
 </thead>
 <tbody>
 <tr class="odd">
+<td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.anonymise">anonymise</a></td>
+<td>Remove raw data from embedded dataframe.</td>
+</tr>
+<tr class="even">
 <td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix">to_bloom_matrix</a></td>
 <td>Convert Bloom filter indices into a binary matrix.</td>
 </tr>
-<tr class="even">
+<tr class="odd">
 <td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.update_norms">update_norms</a></td>
 <td>Generate vector norms for each row.</td>
 </tr>
-<tr class="odd">
+<tr class="even">
 <td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.update_thresholds">update_thresholds</a></td>
 <td>Generate matching thresholds for each row of the data.</td>
 </tr>
 </tbody>
 </table>
+<section id="pprl.embedder.embedder.EmbeddedDataFrame.anonymise" class="level5">
+<h5 class="anchored" data-anchor-id="pprl.embedder.embedder.EmbeddedDataFrame.anonymise">anonymise</h5>
+<p><code>embedder.embedder.EmbeddedDataFrame.anonymise(keep=None)</code></p>
+<p>Remove raw data from embedded dataframe.</p>
+<p>Remove all columns from the embedded dataframe expect columns listed in keep and <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code>.</p>
+<section id="returns" class="level6">
+<h6 class="anchored" data-anchor-id="returns">Returns</h6>
+<table class="table">
+<colgroup>
+<col style="width: 8%">
+<col style="width: 91%">
+</colgroup>
+<thead>
+<tr class="header">
+<th>Type</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td>list[str]</td>
+<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code> if they are present in the data.</td>
+</tr>
+</tbody>
+</table>
+</section>
+</section>
 <section id="pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix" class="level5">
 <h5 class="anchored" data-anchor-id="pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix">to_bloom_matrix</h5>
 <p><code>embedder.embedder.EmbeddedDataFrame.to_bloom_matrix()</code></p>
 <p>Convert Bloom filter indices into a binary matrix.</p>
 <p>The matrix has a row for each row in the EDF. The number of columns is equal to <code>self.embedder.bf_size + self.embedder.offset</code>. Each row in the matrix is a Bloom filter expressed as a binary vector, with the ones corresponding to hashed features. This representation is used in the <code>Embedder.compare()</code> method.</p>
-<section id="returns" class="level6">
-<h6 class="anchored" data-anchor-id="returns">Returns</h6>
+<section id="returns-1" class="level6">
+<h6 class="anchored" data-anchor-id="returns-1">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 20%">
@@ -706,8 +737,8 @@ <h6 class="anchored" data-anchor-id="parameters-2">Parameters</h6>
 </tbody>
 </table>
 </section>
-<section id="returns-1" class="level6">
-<h6 class="anchored" data-anchor-id="returns-1">Returns</h6>
+<section id="returns-2" class="level6">
+<h6 class="anchored" data-anchor-id="returns-2">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 24%">
@@ -799,8 +830,8 @@ <h6 class="anchored" data-anchor-id="parameters-3">Parameters</h6>
 </tbody>
 </table>
 </section>
-<section id="returns-2" class="level6">
-<h6 class="anchored" data-anchor-id="returns-2">Returns</h6>
+<section id="returns-3" class="level6">
+<h6 class="anchored" data-anchor-id="returns-3">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 49%">
@@ -879,8 +910,8 @@ <h6 class="anchored" data-anchor-id="raises-1">Raises</h6>
 </tbody>
 </table>
 </section>
-<section id="returns-3" class="level6">
-<h6 class="anchored" data-anchor-id="returns-3">Returns</h6>
+<section id="returns-4" class="level6">
+<h6 class="anchored" data-anchor-id="returns-4">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 40%">
@@ -932,8 +963,8 @@ <h6 class="anchored" data-anchor-id="parameters-5">Parameters</h6>
 </tbody>
 </table>
 </section>
-<section id="returns-4" class="level6">
-<h6 class="anchored" data-anchor-id="returns-4">Returns</h6>
+<section id="returns-5" class="level6">
+<h6 class="anchored" data-anchor-id="returns-5">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 14%">
@@ -1157,8 +1188,8 @@ <h6 class="anchored" data-anchor-id="parameters-8">Parameters</h6>
 </tbody>
 </table>
 </section>
-<section id="returns-5" class="level6">
-<h6 class="anchored" data-anchor-id="returns-5">Returns</h6>
+<section id="returns-6" class="level6">
+<h6 class="anchored" data-anchor-id="returns-6">Returns</h6>
 <table class="table">
 <colgroup>
 <col style="width: 25%">
@@ -1241,8 +1272,8 @@ <h4 class="anchored" data-anchor-id="parameters-9">Parameters</h4>
 </tbody>
 </table>
 </section>
-<section id="returns-6" class="level4">
-<h4 class="anchored" data-anchor-id="returns-6">Returns</h4>
+<section id="returns-7" class="level4">
+<h4 class="anchored" data-anchor-id="returns-7">Returns</h4>
 <table class="table">
 <thead>
 <tr class="header">

diff --git a/docs/reference/utils.html b/docs/reference/utils.html
@@ -471,7 +471,7 @@ <h4 class="anchored" data-anchor-id="parameters-2">Parameters</h4>
 <tr class="odd">
 <td><code>other_columns</code></td>
 <td>None | list</td>
-<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code> and <code>bf_norms</code>.</td>
+<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code>.</td>
 <td><code>None</code></td>
 </tr>
 <tr class="even">

diff --git a/docs/tutorials/example-febrl.html b/docs/tutorials/example-febrl.html
@@ -343,7 +343,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>
 
 
 <p>This tutorial shows how the package can be used locally to match the <a href="http://users.cecs.anu.edu.au/~Peter.Christen/publications/hdkm2008slides.pdf">FEBRL</a> datasets, included as example datasets in the <a href="https://recordlinkage.readthedocs.io/en/latest/"><code>recordlinkage</code></a> package.</p>
-<div id="2658b4e0" class="cell" data-execution_count="1">
+<div id="614f9355" class="cell" data-execution_count="1">
 <div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
 <span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> time</span>
 <span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> functools <span class="im">import</span> partial</span>
@@ -359,7 +359,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>
 <h2 class="anchored" data-anchor-id="load-the-data">Load the data</h2>
 <p>The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.</p>
 <p>After loading the data, we can parse the true matched ID number from the indices.</p>
-<div id="454f8bcd" class="cell" data-execution_count="2">
+<div id="63fef37a" class="cell" data-execution_count="2">
 <div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>feb4a, feb4b <span class="op">=</span> load_febrl4()</span>
 <span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>feb4a[<span class="st">"true_id"</span>] <span class="op">=</span> (</span>
@@ -382,7 +382,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
 <li>Pass a dictionary of dictionaries of keyword arguments as an optional <code>ff_args</code> parameter (e.g.&nbsp;<code>ff_args = {"dob": {"dayfirst": False, "yearfirst": True}})</code>)</li>
 <li>Use <code>functools.partial()</code>, as we have below.</li>
 </ol>
-<div id="b7ebef7c" class="cell" data-execution_count="3">
+<div id="cf5c1cab" class="cell" data-execution_count="3">
 <div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>feature_factory <span class="op">=</span> <span class="bu">dict</span>(</span>
 <span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>    name<span class="op">=</span>feat.gen_name_features,</span>
 <span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>    dob<span class="op">=</span>partial(feat.gen_dateofbirth_features, dayfirst<span class="op">=</span><span class="va">False</span>, yearfirst<span class="op">=</span><span class="va">True</span>),</span>
@@ -396,7 +396,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
 <section id="initialise-the-embedder-instance" class="level2">
 <h2 class="anchored" data-anchor-id="initialise-the-embedder-instance">Initialise the embedder instance</h2>
 <p>This instance embeds each feature twice into a Bloom filter of length 1024.</p>
-<div id="f8a7a5e3" class="cell" data-execution_count="4">
+<div id="bcd35bd0" class="cell" data-execution_count="4">
 <div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>embedder <span class="op">=</span> Embedder(feature_factory, bf_size<span class="op">=</span><span class="dv">1024</span>, num_hashes<span class="op">=</span><span class="dv">2</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </section>
@@ -418,7 +418,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
 <p>For example, to ensure suburb doesn’t collide with state (if they happened to be the same), <code>gen_misc_features()</code> would encode each of their tokens as <code>suburb&lt;token&gt;</code> and <code>state&lt;token&gt;</code>, respectively. If you want to map different columns into the same feature, such as <code>address</code> below, you can set the label explicitly when passing the function to the embedder.</p>
 </div>
 </div>
-<div id="75454628" class="cell" data-execution_count="5">
+<div id="26c65cd3" class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>colspec <span class="op">=</span> <span class="bu">dict</span>(</span>
 <span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    given_name<span class="op">=</span><span class="st">"name"</span>,</span>
 <span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>    surname<span class="op">=</span><span class="st">"name"</span>,</span>
@@ -436,7 +436,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
 <span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>edf2 <span class="op">=</span> embedder.embed(feb4b, colspec<span class="op">=</span>colspec)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Store the embedded datasets and their embedder to file.</p>
-<div id="e043c2cd" class="cell" data-execution_count="6">
+<div id="f8d49a2a" class="cell" data-execution_count="6">
 <div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>edf1.to_json(<span class="st">"party1_data.json"</span>)</span>
 <span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>edf2.to_json(<span class="st">"party2_data.json"</span>)</span>
 <span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>embedder.to_pickle(<span class="st">"embedder.pkl"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
@@ -445,30 +445,30 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
 <section id="calculate-similarity" class="level2">
 <h2 class="anchored" data-anchor-id="calculate-similarity">Calculate similarity</h2>
 <p>Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.</p>
-<div id="58f15fdc" class="cell" data-execution_count="7">
+<div id="055e343c" class="cell" data-execution_count="7">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>start <span class="op">=</span> time.time()</span>
 <span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>edf1.update_thresholds()</span>
 <span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>edf2.update_thresholds()</span>
 <span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>end <span class="op">=</span> time.time()</span>
 <span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Updating thresholds took </span><span class="sc">{</span>end <span class="op">-</span> start<span class="sc">:.2f}</span><span class="ss"> seconds"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>Updating thresholds took 8.40 seconds</code></pre>
+<pre><code>Updating thresholds took 8.35 seconds</code></pre>
 </div>
 </div>
 <p>Compute the matrix of similarity scores.</p>
-<div id="58455085" class="cell" data-execution_count="8">
+<div id="d96a56b8" class="cell" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>similarity_scores <span class="op">=</span> embedder.compare(edf1,edf2)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 </section>
 <section id="compute-a-match" class="level2">
 <h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
 <p>Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.</p>
-<div id="e5dbd978" class="cell" data-execution_count="9">
+<div id="04f6b677" class="cell" data-execution_count="9">
 <div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div>
 <p>Using the true IDs, evaluate the precision and recall of the match.</p>
-<div id="f72fd9d8" class="cell" data-execution_count="10">
+<div id="493670e2" class="cell" data-execution_count="10">
 <div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> get_results(edf1, edf2, matching):</span>
 <span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>    <span class="co">"""Get the results for a given matching."""</span></span>
 <span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span>
@@ -492,7 +492,7 @@ <h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
 </div>
 </div>
 <p>Then, we compute the match without using the row thresholds, calculating the same performance metrics:</p>
-<div id="74d9fb69" class="cell" data-execution_count="11">
+<div id="7342aafc" class="cell" data-execution_count="11">
 <div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">False</span>)</span>
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>_ <span class="op">=</span> get_results(edf1, edf2, matching)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">