Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed May 13, 2024
1 parent 8968887 commit de06c89
Show file tree
Hide file tree
Showing 9 changed files with 127 additions and 96 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
7064a115
ce820a19
63 changes: 47 additions & 16 deletions docs/reference/embedder.html
Original file line number Diff line number Diff line change
Expand Up @@ -417,26 +417,57 @@ <h4 class="anchored" data-anchor-id="methods">Methods</h4>
</thead>
<tbody>
<tr class="odd">
<td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.anonymise">anonymise</a></td>
<td>Remove raw data from embedded dataframe.</td>
</tr>
<tr class="even">
<td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix">to_bloom_matrix</a></td>
<td>Convert Bloom filter indices into a binary matrix.</td>
</tr>
<tr class="even">
<tr class="odd">
<td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.update_norms">update_norms</a></td>
<td>Generate vector norms for each row.</td>
</tr>
<tr class="odd">
<tr class="even">
<td><a href="#pprl.embedder.embedder.EmbeddedDataFrame.update_thresholds">update_thresholds</a></td>
<td>Generate matching thresholds for each row of the data.</td>
</tr>
</tbody>
</table>
<section id="pprl.embedder.embedder.EmbeddedDataFrame.anonymise" class="level5">
<h5 class="anchored" data-anchor-id="pprl.embedder.embedder.EmbeddedDataFrame.anonymise">anonymise</h5>
<p><code>embedder.embedder.EmbeddedDataFrame.anonymise(keep=None)</code></p>
<p>Remove raw data from embedded dataframe.</p>
<p>Remove all columns from the embedded dataframe expect columns listed in keep and <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code>.</p>
<section id="returns" class="level6">
<h6 class="anchored" data-anchor-id="returns">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 8%">
<col style="width: 91%">
</colgroup>
<thead>
<tr class="header">
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>list[str]</td>
<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code> if they are present in the data.</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix" class="level5">
<h5 class="anchored" data-anchor-id="pprl.embedder.embedder.EmbeddedDataFrame.to_bloom_matrix">to_bloom_matrix</h5>
<p><code>embedder.embedder.EmbeddedDataFrame.to_bloom_matrix()</code></p>
<p>Convert Bloom filter indices into a binary matrix.</p>
<p>The matrix has a row for each row in the EDF. The number of columns is equal to <code>self.embedder.bf_size + self.embedder.offset</code>. Each row in the matrix is a Bloom filter expressed as a binary vector, with the ones corresponding to hashed features. This representation is used in the <code>Embedder.compare()</code> method.</p>
<section id="returns" class="level6">
<h6 class="anchored" data-anchor-id="returns">Returns</h6>
<section id="returns-1" class="level6">
<h6 class="anchored" data-anchor-id="returns-1">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 20%">
Expand Down Expand Up @@ -706,8 +737,8 @@ <h6 class="anchored" data-anchor-id="parameters-2">Parameters</h6>
</tbody>
</table>
</section>
<section id="returns-1" class="level6">
<h6 class="anchored" data-anchor-id="returns-1">Returns</h6>
<section id="returns-2" class="level6">
<h6 class="anchored" data-anchor-id="returns-2">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 24%">
Expand Down Expand Up @@ -799,8 +830,8 @@ <h6 class="anchored" data-anchor-id="parameters-3">Parameters</h6>
</tbody>
</table>
</section>
<section id="returns-2" class="level6">
<h6 class="anchored" data-anchor-id="returns-2">Returns</h6>
<section id="returns-3" class="level6">
<h6 class="anchored" data-anchor-id="returns-3">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 49%">
Expand Down Expand Up @@ -879,8 +910,8 @@ <h6 class="anchored" data-anchor-id="raises-1">Raises</h6>
</tbody>
</table>
</section>
<section id="returns-3" class="level6">
<h6 class="anchored" data-anchor-id="returns-3">Returns</h6>
<section id="returns-4" class="level6">
<h6 class="anchored" data-anchor-id="returns-4">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 40%">
Expand Down Expand Up @@ -932,8 +963,8 @@ <h6 class="anchored" data-anchor-id="parameters-5">Parameters</h6>
</tbody>
</table>
</section>
<section id="returns-4" class="level6">
<h6 class="anchored" data-anchor-id="returns-4">Returns</h6>
<section id="returns-5" class="level6">
<h6 class="anchored" data-anchor-id="returns-5">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 14%">
Expand Down Expand Up @@ -1157,8 +1188,8 @@ <h6 class="anchored" data-anchor-id="parameters-8">Parameters</h6>
</tbody>
</table>
</section>
<section id="returns-5" class="level6">
<h6 class="anchored" data-anchor-id="returns-5">Returns</h6>
<section id="returns-6" class="level6">
<h6 class="anchored" data-anchor-id="returns-6">Returns</h6>
<table class="table">
<colgroup>
<col style="width: 25%">
Expand Down Expand Up @@ -1241,8 +1272,8 @@ <h4 class="anchored" data-anchor-id="parameters-9">Parameters</h4>
</tbody>
</table>
</section>
<section id="returns-6" class="level4">
<h4 class="anchored" data-anchor-id="returns-6">Returns</h4>
<section id="returns-7" class="level4">
<h4 class="anchored" data-anchor-id="returns-7">Returns</h4>
<table class="table">
<thead>
<tr class="header">
Expand Down
2 changes: 1 addition & 1 deletion docs/reference/utils.html
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,7 @@ <h4 class="anchored" data-anchor-id="parameters-2">Parameters</h4>
<tr class="odd">
<td><code>other_columns</code></td>
<td>None | list</td>
<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code> and <code>bf_norms</code>.</td>
<td>Columns to be returned as they appear in the data in addition to <code>bf_indices</code>, <code>bf_norms</code> and <code>thresholds</code>.</td>
<td><code>None</code></td>
</tr>
<tr class="even">
Expand Down
24 changes: 12 additions & 12 deletions docs/tutorials/example-febrl.html
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>


<p>This tutorial shows how the package can be used locally to match the <a href="http://users.cecs.anu.edu.au/~Peter.Christen/publications/hdkm2008slides.pdf">FEBRL</a> datasets, included as example datasets in the <a href="https://recordlinkage.readthedocs.io/en/latest/"><code>recordlinkage</code></a> package.</p>
<div id="2658b4e0" class="cell" data-execution_count="1">
<div id="614f9355" class="cell" data-execution_count="1">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> time</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> functools <span class="im">import</span> partial</span>
Expand All @@ -359,7 +359,7 @@ <h1 class="title">Linking the FEBRL datasets</h1>
<h2 class="anchored" data-anchor-id="load-the-data">Load the data</h2>
<p>The datasets we are using are 5000 records across two datasets with no duplicates, and each of the records has a valid match in the other dataset.</p>
<p>After loading the data, we can parse the true matched ID number from the indices.</p>
<div id="454f8bcd" class="cell" data-execution_count="2">
<div id="63fef37a" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>feb4a, feb4b <span class="op">=</span> load_febrl4()</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>feb4a[<span class="st">"true_id"</span>] <span class="op">=</span> (</span>
Expand All @@ -382,7 +382,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
<li>Pass a dictionary of dictionaries of keyword arguments as an optional <code>ff_args</code> parameter (e.g.&nbsp;<code>ff_args = {"dob": {"dayfirst": False, "yearfirst": True}})</code>)</li>
<li>Use <code>functools.partial()</code>, as we have below.</li>
</ol>
<div id="b7ebef7c" class="cell" data-execution_count="3">
<div id="cf5c1cab" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>feature_factory <span class="op">=</span> <span class="bu">dict</span>(</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a> name<span class="op">=</span>feat.gen_name_features,</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> dob<span class="op">=</span>partial(feat.gen_dateofbirth_features, dayfirst<span class="op">=</span><span class="va">False</span>, yearfirst<span class="op">=</span><span class="va">True</span>),</span>
Expand All @@ -396,7 +396,7 @@ <h2 class="anchored" data-anchor-id="create-a-feature-factory">Create a feature
<section id="initialise-the-embedder-instance" class="level2">
<h2 class="anchored" data-anchor-id="initialise-the-embedder-instance">Initialise the embedder instance</h2>
<p>This instance embeds each feature twice into a Bloom filter of length 1024.</p>
<div id="f8a7a5e3" class="cell" data-execution_count="4">
<div id="bcd35bd0" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>embedder <span class="op">=</span> Embedder(feature_factory, bf_size<span class="op">=</span><span class="dv">1024</span>, num_hashes<span class="op">=</span><span class="dv">2</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
Expand All @@ -418,7 +418,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<p>For example, to ensure suburb doesn’t collide with state (if they happened to be the same), <code>gen_misc_features()</code> would encode each of their tokens as <code>suburb&lt;token&gt;</code> and <code>state&lt;token&gt;</code>, respectively. If you want to map different columns into the same feature, such as <code>address</code> below, you can set the label explicitly when passing the function to the embedder.</p>
</div>
</div>
<div id="75454628" class="cell" data-execution_count="5">
<div id="26c65cd3" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>colspec <span class="op">=</span> <span class="bu">dict</span>(</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> given_name<span class="op">=</span><span class="st">"name"</span>,</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> surname<span class="op">=</span><span class="st">"name"</span>,</span>
Expand All @@ -436,7 +436,7 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>edf2 <span class="op">=</span> embedder.embed(feb4b, colspec<span class="op">=</span>colspec)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>Store the embedded datasets and their embedder to file.</p>
<div id="e043c2cd" class="cell" data-execution_count="6">
<div id="f8d49a2a" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a>edf1.to_json(<span class="st">"party1_data.json"</span>)</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>edf2.to_json(<span class="st">"party2_data.json"</span>)</span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>embedder.to_pickle(<span class="st">"embedder.pkl"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
Expand All @@ -445,30 +445,30 @@ <h2 class="anchored" data-anchor-id="embed-the-datasets">Embed the datasets</h2>
<section id="calculate-similarity" class="level2">
<h2 class="anchored" data-anchor-id="calculate-similarity">Calculate similarity</h2>
<p>Compute the row thresholds to provide a lower bound on matching similarity scores for each row. This operation is the most computationally intensive part of the whole process.</p>
<div id="58f15fdc" class="cell" data-execution_count="7">
<div id="055e343c" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>start <span class="op">=</span> time.time()</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>edf1.update_thresholds()</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>edf2.update_thresholds()</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>end <span class="op">=</span> time.time()</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"Updating thresholds took </span><span class="sc">{</span>end <span class="op">-</span> start<span class="sc">:.2f}</span><span class="ss"> seconds"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Updating thresholds took 8.40 seconds</code></pre>
<pre><code>Updating thresholds took 8.35 seconds</code></pre>
</div>
</div>
<p>Compute the matrix of similarity scores.</p>
<div id="58455085" class="cell" data-execution_count="8">
<div id="d96a56b8" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>similarity_scores <span class="op">=</span> embedder.compare(edf1,edf2)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
</section>
<section id="compute-a-match" class="level2">
<h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
<p>Use the similarity scores to compute a match, using the Hungarian algorithm. First, we compute the match with the row thresholds.</p>
<div id="e5dbd978" class="cell" data-execution_count="9">
<div id="04f6b677" class="cell" data-execution_count="9">
<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>Using the true IDs, evaluate the precision and recall of the match.</p>
<div id="f72fd9d8" class="cell" data-execution_count="10">
<div id="493670e2" class="cell" data-execution_count="10">
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> get_results(edf1, edf2, matching):</span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a> <span class="co">"""Get the results for a given matching."""</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span>
Expand All @@ -492,7 +492,7 @@ <h2 class="anchored" data-anchor-id="compute-a-match">Compute a match</h2>
</div>
</div>
<p>Then, we compute the match without using the row thresholds, calculating the same performance metrics:</p>
<div id="74d9fb69" class="cell" data-execution_count="11">
<div id="7342aafc" class="cell" data-execution_count="11">
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>matching <span class="op">=</span> similarity_scores.match(require_thresholds<span class="op">=</span><span class="va">False</span>)</span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>_ <span class="op">=</span> get_results(edf1, edf2, matching)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
Expand Down
Loading

0 comments on commit de06c89

Please sign in to comment.