From b9ec0bb604395001e1ab8936687aba61524eb65b Mon Sep 17 00:00:00 2001 From: Victor Lin <13424970+victorlin@users.noreply.github.com> Date: Thu, 15 Aug 2024 14:57:44 -0700 Subject: [PATCH] Add notes explaining implementation and edge cases --- .../filtering-and-subsampling.rst | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index 7fbc314a..b2f8b6d9 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -86,6 +86,25 @@ total sequences: --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv +.. note:: + + ``--subsample-max-sequences`` works by internally calculating a value for + ``--sequences-per-group``. + +.. note:: + + For these options, the number of targeted sequences per group does not take + into account: + + 1. The actual number of sequences available in the input data. For example, + consider a dataset with 200 sequences available from 2023 and 100 + sequences available from 2024. ``--group-by year --subsample-max-sequences + 300`` is equivalent to ``--group-by year --sequences-per-group 150``. This + will take 150 sequences from 2023 and all 100 sequences from 2024 for a + total of 250 sequences, which is less than the target of 300. + 2. Any sequences force-included by ``--include`` or ``--include-where``. This + may result in higher than targeted sequences for some groups. + Subsampling using multiple ``augur filter`` commands ====================================================