Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group segments by strains #18

Merged
merged 7 commits into from
Oct 11, 2024
Merged

Group segments by strains #18

merged 7 commits into from
Oct 11, 2024

Conversation

jameshadfield
Copy link
Member

This PR closes #15 and is the first implementation of nextstrain/pathogen-repo-guide#59. There are still some improvements to be made, but it's at a good stopping point to reflect on the approach. The main outstanding tasks/improvements are around:

  • how we resolve metadata disagreements across segments (of the same strain), and probably providing a way to manually resolve disagreements (e.g. via a YAML file)
  • improved logging
  • potentially change the way segment-specific information is presented in the Auspice JSON

Live builds

https://next.nextstrain.org/staging/oropouche/S/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/M/grouped-by-strain
https://next.nextstrain.org/staging/oropouche/L/grouped-by-strain

Running locally

Trial ingest files are available at:

s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/metadata.tsv.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/S/sequences.fasta.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/M/sequences.fasta.zst
s3://nextstrain-data/files/workflows/oropouche/trials/grouped-by-strain/L/sequences.fasta.zst

The phylo workflow may be run locally by adding --config ingest_url_prefix:"https://data.nextstrain.org/files/workflows/oropouche/trials/grouped-by-strain/"

Current conflicts in the metadata

The approach is currently overly conservative and removes a number of strains due to conflicting metadata. They fall into three groups, however in total we only drop 21 strains + another 8 segments so I think providing a manually curated conflict-resolution-YAML is the best way to resolve these, where appropriate.

Multiple sequences for a segment
There are 14 of these. Example:

Strain TRVL9760 had multiple sequences for segment S. Dropping this entire strain.

Disagreement of common strain fields
Seven occurrences. One for sampling date, six for authors:

ERROR! 'H498913' sampling dates disagree: {'1988-XX-XX', '1990-XX-XX'}
ERROR! 'LET-2083' Disagreement for 'authors', observed values: {'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.,Perez,L.J.,Perez-Restrepo,L.,Ciuoderis,K.,Usuga,J.,Moreno,I.,Vargas,V.,Arevalo-Arbelaez,A.J.,Berg,M.G.,Cloherty,G.A.,Osorio,J.E.', 'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.', 'Limonta,D.,Peres-Restrepo,L.S.,Ciouderis,K.,Hernandez-Ortiz,J.P.,Osorio,J.R.,Perez,L.J.,Perez-Restrepo,L.S.,Ciuoderis,K.,Usuga,J.,Moreno,I.,Vargas,V.,Arevalo-Arbelaez,A.J.,Berg,M.G.,Cloherty,G.A.,Osorio,J.E.'}

Accession mapped to no segments / multiple segments
n=8, examples:

Accession 'PP952119' (strain 'IRCCS-SCDC_1/2024') mapped to multiple segments: L, S
Accession 'KP347674' (strain 'OROV/MT_CbaH498/2012') mapped to no segments

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully finished review, but wanted to provide some workflow feedback.

ingest/rules/curate.smk Outdated Show resolved Hide resolved
ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
ingest/scripts/parse_nextclade_tsv.py Outdated Show resolved Hide resolved
ingest/rules/curate.smk Outdated Show resolved Hide resolved
ingest/scripts/group_metadata.py Outdated Show resolved Hide resolved
ingest/scripts/group_metadata.py Outdated Show resolved Hide resolved
@miparedes
Copy link
Collaborator

miparedes commented Oct 4, 2024

Maybe I'm missing it somewhere but does this new workflow output a file, or even to the terminal, the strains that are being deduped? I ask because I noticed in the builds that are staged that all the Colombia seqs are missing (which can be seen in the live site) and I can't really tell if that's because they were dropped because theyre duplicates or not. Either way, even if some are duplicated I'd expect at least one of those seqs to make it through?

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the multiple iterations on this James! I mostly left non-blocking comments.

I think the only things that need to be resolved

ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
Comment on lines +87 to +88
if len(segments_present)>1:
raise AssertionError(f"Accession '{accession}' (strain '{strain}') mapped to multiple segments: {', '.join(segments_present)}. Skipping this accession.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Huh, could a single sequence align to multiple segments? Did you actually run into this situation in testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Accession 'PP952119' (strain 'IRCCS-SCDC_1/2024') mapped to multiple segments: L, S. Skipping this accession.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh weird, I would have expected the alignment to fail for one of the segments. I wonder if the params for the Nextclade alignment are too lax.

ingest/rules/curate.smk Outdated Show resolved Hide resolved
ingest/rules/group_segments.smk Outdated Show resolved Hide resolved
Produces a single large metadata file keyed off accession with both
strain names and segment-level information encoded therein. The
following commit will group this "long" metadata TSV into a "wide"
TSV where each row represents a strain.
Allows us to avoid another python script.

Original implementation by @joverlee521 in
<#18 (comment)>
jameshadfield added a commit that referenced this pull request Oct 9, 2024
Mismatched field values across segments (e.g. segments disagree on the
'date') are now resolved by choosing the most common occurrence with
the intention they are resolved upstream, as implemented here.

This approach was the third implementation. Initially I resolved
disagreements within `group_segments.py` via a provided resolutions
YAML. After discussion with @joverlee521 we decided this could be better
implemented via `augur curate` and the original implementation here did
this _after_ the segment grouping, however this made it impossible to
distinguish disagreements which will be fixed vs those which won't¹

¹ <#18 (comment)>
tsibley added a commit to nextstrain/conda-base that referenced this pull request Oct 9, 2024
They're provided in our other runtimes (almost by happenstance, as part
of the underlying OS image) and having them available in all runtimes
makes it much easier to write portable programs without having to deal
with GNU vs. BSD differences.

Note that typically these GNU programs would already be available in the
Conda runtime on Linux (via the host system), but not the Conda runtime
on macOS (unless installed separately, e.g. via Homebrew).  So
explicitly including the GNU-flavored programs here increases
consistency, isolation, and portability of the runtime.

These were ostensibly overlooked in "Provide GNU coreutils in the
runtime" (a0b6609).

Related-to: <nextstrain/oropouche#18 (comment)>
@genehack genehack requested a review from miparedes October 9, 2024 23:25
jameshadfield added a commit that referenced this pull request Oct 9, 2024
Mismatched field values across segments (e.g. segments disagree on the
'date') are now resolved by choosing the most common occurrence with
the intention they are resolved upstream, as implemented here.

This approach was the third implementation. Initially I resolved
disagreements within `group_segments.py` via a provided resolutions
YAML. After discussion with @joverlee521 we decided this could be better
implemented via `augur curate` and the original implementation here did
this _after_ the segment grouping, however this made it impossible to
distinguish disagreements which will be fixed vs those which won't¹

NOTE: Here we use accession as the ID, however using strain name would
be better going forward as it would reduce the duplication needed in the
current format. We can't (currently) do this in oropouche because strain
names are added _after_ the curate chain runs.

¹ <#18 (comment)>
@jameshadfield
Copy link
Member Author

@miparedes - I hope to merge this tomorrow, please let me know if you don't want this merged!

@miparedes
Copy link
Collaborator

I just did a fresh pull and everything seems to be running okay. The trees look pretty different than whats currently on nextstrain.org though. I reran everything using this current branch and uploaded the jsons to staging. If you compare https://nextstrain.org/staging/oropouche/S to https://nextstrain.org/oropouche/S you see a different tree topology plus 400 more seqs although the nextstrain.org one was just updated yesterday it seems. same with https://nextstrain.org/staging/oropouche/M where you see the time tree adn the divergence tree looking pretty different (the lower lineage in M seems to be the less diverged but close to the present in time?) but it's probably due to this weird tip https://nextstrain.org/staging/oropouche/M?l=clock&s=BeH505764 which should be easy to exclude.

There doesnt seem to be a massive upload of new seqs to explain the large difference between the two builds so I guess im not too sure where the discrepancy is coming from? Thoughts @jameshadfield ?

@jameshadfield
Copy link
Member Author

There doesnt seem to be a massive upload of new seqs to explain the large difference between the two builds so I guess im not too sure where the discrepancy is coming from?

Thanks for flagging! I forgot to update the phylo exclude/include lists to use strain names. Will sort that out ASAP.

Copy link
Collaborator

@miparedes miparedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Thanks for all the hard work @jameshadfield ! The only outstanding thing is to figure out why the original build didnt include those recent brazil seqs but it isn't blocking given that it seems like the build from this PR is probably the more accurate one.

Transforms a metadata file with one row per accession into a file with
one row per strain. Where a strain contains sequences for multiple segments this
will group those segments together.

Segment-specific field names (i.e. those not in --common-strain-fields) are modified
to ensure their suffix is "_{segment}". For instance "accession → accession_HA".

Rows are matched on strain name and basic sanity checking is performed when grouping.
Segments with multiple matches for a given strain are dropped, and strains where all
segments have either zero or multiple matches are dropped entirely. Manual resolutions
may be provided via a `--resolutions` YAML which is a list of dictionaries, each with
keys "strain", "accession" and "segment" informing the program which accession (of
multiple) to use. The resolutions here are taken both from exploring the data and
the existing phylo exclude list.

Any disagreement within the "--common-strain-fields" will result in the strain being
dropped, however empty values may be replaced and ambiguous dates may be replaced with
specific ones (where appropriate).
Mismatched field values across segments (e.g. segments disagree on the
'date') are now resolved by choosing the most common occurrence with
the intention they are resolved upstream, as implemented here.

This approach was the third implementation. Initially I resolved
disagreements within `group_segments.py` via a provided resolutions
YAML. After discussion with @joverlee521 we decided this could be better
implemented via `augur curate` and the original implementation here did
this _after_ the segment grouping, however this made it impossible to
distinguish disagreements which will be fixed vs those which won't¹

NOTE: Here we use accession as the ID, however using strain name would
be better going forward as it would reduce the duplication needed in the
current format. We can't (currently) do this in oropouche because strain
names are added _after_ the curate chain runs.

¹ <#18 (comment)>
Updates the files which ingest uploads and makes the corresponding
changes to the phylogenetic workflow. As metadata (and sequences) now
use "strain" as the unique ID a number of simplifications can be made
to the workflow.

There is one regression: the "accession" column no longer exists and
is thus not exported. We'll fix this in a subsequent commit.
This adds back in segment-specific metadata to the Auspice JSON. There
are multiple ways this can be done, each with trade-offs. The approach
employed here leaves the "_{segment}" suffix on the field names.
Alternatively we could remap the metadata file for each `export` call so
that (e.g.) "accession_S" becomes "accession".
To use the new metadata format where we group by strain. Steps to regenerate:

1. Populate 'data/' with metadata & sequences from an ingest run.
2. Subsample this as example data via:

```
augur filter --metadata data/metadata.tsv --group-by country --subsample-max-sequences 30 --output-metadata example_data/metadata.tsv
augur filter --metadata example_data/metadata.tsv --sequences data/L/sequences.fasta --output-sequences example_data/sequences_L.fasta
augur filter --metadata example_data/metadata.tsv --sequences data/M/sequences.fasta --output-sequences example_data/sequences_M.fasta
augur filter --metadata example_data/metadata.tsv --sequences data/S/sequences.fasta --output-sequences example_data/sequences_S.fasta
```
@jameshadfield jameshadfield merged commit 06b3e6e into main Oct 11, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: Switch to 1 metadata TSV + 3 FASTA files
5 participants