Adjust multiple augur filter section for weighted sampling

Weighted sampling makes this scenario technically feasible, but practically difficult to achieve in a single augur filter call. Explain this trade-off in detail.
nextstrain · Aug 19, 2024 · 17ca960 · 17ca960
1 parent 6ef6a19
commit 17ca960
Showing 1 changed file with 28 additions and 8 deletions.
diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -266,22 +266,42 @@ Subsampling using multiple ``augur filter`` commands
 ====================================================
 
 There are some subsampling strategies in which a single call to ``augur filter``
-does not suffice. One such strategy is "tiered subsampling". In this strategy,
-mutually exclusive sets of filters, each representing a "tier", are sampled with
-different subsampling rules. This is commonly used to create geographic tiers.
-Consider this subsampling scheme:
+does not suffice or is difficult to put together. One such strategy is "tiered
+subsampling". In this strategy, mutually exclusive sets of filters, each
+representing a "tier", are sampled with different subsampling rules. This is
+commonly used to create geographic tiers. Consider this subsampling scheme:
 
    Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.
 
-This cannot be done in a single call to ``augur filter``. Instead, it can be
-decomposed into multiple schemes, each handled by a single call to ``augur
-filter``. Additionally, there is an extra step to combine the intermediate
-samples.
+This can be approximated by ``--subsample-max-sequences 150`` +  ``--group-by region`` +
+``--group-by-weights weights.tsv`` with this ``weights.tsv``:
+
+.. code-block::
+
+   state	weight
+   WA	100
+   OR	1.02
+   CA	1.02
+   ...
+
+The above is rather complex, needing a list of all other states and a calculation to determine their weights:
+
+.. math::
+
+  {n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} = 50 * \frac{1}{49} \approx 1.02
+
+A simpler approach is to decompose this into multiple schemes, each handled by a
+single call to ``augur filter``. Additionally, there is an extra step to combine
+the intermediate samples.
 
    1. Sample 100 sequences from Washington state.
    2. Sample 50 sequences from the rest of the United States.
    3. Combine the samples.
 
+.. note::
+
+   FIXME: add note on difference compared to previous example due to lack of ``--group-by``
+
 Calling ``augur filter`` multiple times
 ---------------------------------------