Skip to content

Commit

Permalink
Add notes explaining implementation and edge cases
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Aug 15, 2024
1 parent b379bb1 commit b9ec0bb
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,25 @@ total sequences:
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
.. note::

``--subsample-max-sequences`` works by internally calculating a value for
``--sequences-per-group``.

.. note::

For these options, the number of targeted sequences per group does not take
into account:

1. The actual number of sequences available in the input data. For example,
consider a dataset with 200 sequences available from 2023 and 100
sequences available from 2024. ``--group-by year --subsample-max-sequences
300`` is equivalent to ``--group-by year --sequences-per-group 150``. This
will take 150 sequences from 2023 and all 100 sequences from 2024 for a
total of 250 sequences, which is less than the target of 300.
2. Any sequences force-included by ``--include`` or ``--include-where``. This
may result in higher than targeted sequences for some groups.

Subsampling using multiple ``augur filter`` commands
====================================================

Expand Down

0 comments on commit b9ec0bb

Please sign in to comment.