Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Support sequences with cross-checking #1601

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Aug 27, 2024

Description of proposed changes

Initial prototype for sequence support in augur merge where metadata and sequence merge happens with additional cross-checking.

Related issue(s)

Closes #1579

Checklist

  • Automated checks pass
  • Add a changelog message
  • Add tests
  • Update docs
  • Test on a pathogen repo (zika, avian-flu or mpox?)
    • James/Jennifer/Jover suggested zika should be a good for initial testing.

@victorlin victorlin self-assigned this Aug 27, 2024
Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 87.24280% with 31 lines in your changes missing coverage. Please review.

Project coverage is 72.51%. Comparing base (862aa37) to head (81b22d7).

Files with missing lines Patch % Lines
augur/merge.py 87.24% 18 Missing and 13 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1601      +/-   ##
==========================================
+ Coverage   72.24%   72.51%   +0.27%     
==========================================
  Files          79       79              
  Lines        8268     8459     +191     
  Branches     1691     1731      +40     
==========================================
+ Hits         5973     6134     +161     
- Misses       2009     2027      +18     
- Partials      286      298      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@victorlin victorlin force-pushed the victorlin/merge-sequences branch from d143872 to f86323a Compare August 30, 2024 22:55
@victorlin victorlin changed the title merge: Add --sequences + --output-sequences merge: Add --sequences + --output-sequences with cross-checking Sep 16, 2024
@victorlin victorlin changed the title merge: Add --sequences + --output-sequences with cross-checking merge: Support sequences with cross-checking Sep 16, 2024
@victorlin victorlin force-pushed the victorlin/merge-sequences branch from ac2dd27 to 275a87e Compare October 2, 2024 23:51
@victorlin victorlin force-pushed the victorlin/merge-sequences branch from 275a87e to 146084c Compare October 15, 2024 22:06
@victorlin victorlin force-pushed the victorlin/merge-sequences branch 2 times, most recently from 9c0fd88 to 4e40e48 Compare October 23, 2024 18:07
Preparing for use across functions.
Preparing for use across functions.
Preparing for use across functions.
Preparing for NamedSequences
Preparing for sequence support.
Sequence support will require the ability to load metadata into the
database without actually merging (if --output-metadata is not
specified).
Preparing for sequence support, which allows unnamed inputs.
Add sequence support in addition to the existing metadata support.
SeqKit is used to deduplicate across sequence files. Duplicates within
an individual sequence file are not supported. Those are checked by
reading IDs using read_sequences.
@victorlin victorlin force-pushed the victorlin/merge-sequences branch from 128d62b to 32199b8 Compare October 23, 2024 20:59
@victorlin victorlin marked this pull request as ready for review October 23, 2024 21:50
@victorlin
Copy link
Member Author

@tsibley this is ready for initial review whenever you get the chance!

@victorlin victorlin requested a review from tsibley October 23, 2024 21:50
Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda skimmed, so not marking approved at this point.

Would be nice if methods had return type annotations too; seemed like they generally didn't.

def run(args):
print_info = print_err if not args.quiet else lambda *_: None
def validate_arguments(args):
# These will make more sense when sequence support is added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment can go away, as this is adding sequence support?

Comment on lines +460 to +463
def load_metadata(
db: Database,
metadata: List[NamedMetadata],
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def load_metadata(
db: Database,
metadata: List[NamedMetadata],
):
def load_metadata(
db: Database,
metadata: List[NamedMetadata],
) -> List[NamedMetadata]:

Comment on lines +615 to +617
# Confirm that seqkit is installed.
if which("seqkit") is None:
raise AugurError("'seqkit' is not installed! This is required to merge sequences.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates a check that's also done on L717; that latter check is also looking at the SEQKIT env var, which this check isn't.

I'm not sure which one is correct, but it doesn't seem like both are useful?


$ source "$TESTDIR"/_setup.sh

BASIC USAGE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 at 81b22d7. when a FASTA doesn't have a trailing new line it results in a malformed merged FASTA.

e.g. 2 fasta files each without trailing new lines:

# A.fasta
>A
ATGC
# B.fasta
>B
ATGC
$ augur merge  --sequences A=A.fasta B=B.fasta  --output-sequences -
>B
ATGC>AATGC

I also find it strange that B comes before A in the output - is this on purpose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

merge: Support sequences
3 participants