Skip to content

Commit

Permalink
Adjust multiple augur filter section for weighted sampling
Browse files Browse the repository at this point in the history
Weighted sampling makes this scenario technically feasible, but
practically difficult to achieve in a single augur filter call. Explain
this trade-off in detail.
  • Loading branch information
victorlin committed Aug 19, 2024
1 parent 6ef6a19 commit 17ca960
Showing 1 changed file with 28 additions and 8 deletions.
36 changes: 28 additions & 8 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -266,22 +266,42 @@ Subsampling using multiple ``augur filter`` commands
====================================================

There are some subsampling strategies in which a single call to ``augur filter``
does not suffice. One such strategy is "tiered subsampling". In this strategy,
mutually exclusive sets of filters, each representing a "tier", are sampled with
different subsampling rules. This is commonly used to create geographic tiers.
Consider this subsampling scheme:
does not suffice or is difficult to put together. One such strategy is "tiered
subsampling". In this strategy, mutually exclusive sets of filters, each
representing a "tier", are sampled with different subsampling rules. This is
commonly used to create geographic tiers. Consider this subsampling scheme:

Sample 100 sequences from Washington state and 50 sequences from the rest of the United States.

This cannot be done in a single call to ``augur filter``. Instead, it can be
decomposed into multiple schemes, each handled by a single call to ``augur
filter``. Additionally, there is an extra step to combine the intermediate
samples.
This can be approximated by ``--subsample-max-sequences 150`` + ``--group-by region`` +
``--group-by-weights weights.tsv`` with this ``weights.tsv``:

.. code-block::
state weight
WA 100
OR 1.02
CA 1.02
...
The above is rather complex, needing a list of all other states and a calculation to determine their weights:

.. math::
{n_{\text{other sequences}}} * \frac{1}{{n_{\text{other states}}}} = 50 * \frac{1}{49} \approx 1.02
A simpler approach is to decompose this into multiple schemes, each handled by a
single call to ``augur filter``. Additionally, there is an extra step to combine
the intermediate samples.

1. Sample 100 sequences from Washington state.
2. Sample 50 sequences from the rest of the United States.
3. Combine the samples.

.. note::

FIXME: add note on difference compared to previous example due to lack of ``--group-by``

Calling ``augur filter`` multiple times
---------------------------------------

Expand Down

0 comments on commit 17ca960

Please sign in to comment.