-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Support sequences with cross-checking #1601
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1601 +/- ##
==========================================
+ Coverage 72.24% 72.51% +0.27%
==========================================
Files 79 79
Lines 8268 8459 +191
Branches 1691 1731 +40
==========================================
+ Hits 5973 6134 +161
- Misses 2009 2027 +18
- Partials 286 298 +12 ☔ View full report in Codecov by Sentry. |
d143872
to
f86323a
Compare
--sequences
+ --output-sequences
--sequences
+ --output-sequences
with cross-checking
--sequences
+ --output-sequences
with cross-checkingac2dd27
to
275a87e
Compare
275a87e
to
146084c
Compare
9c0fd88
to
4e40e48
Compare
Preparing for use across functions.
Preparing for use across functions.
Preparing for use across functions.
Preparing for NamedSequences
Preparing for sequence support.
Sequence support will require the ability to load metadata into the database without actually merging (if --output-metadata is not specified).
Preparing for sequence support, which allows unnamed inputs.
Add sequence support in addition to the existing metadata support. SeqKit is used to deduplicate across sequence files. Duplicates within an individual sequence file are not supported. Those are checked by reading IDs using read_sequences.
128d62b
to
32199b8
Compare
@tsibley this is ready for initial review whenever you get the chance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda skimmed, so not marking approved at this point.
Would be nice if methods had return type annotations too; seemed like they generally didn't.
def run(args): | ||
print_info = print_err if not args.quiet else lambda *_: None | ||
def validate_arguments(args): | ||
# These will make more sense when sequence support is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment can go away, as this is adding sequence support?
def load_metadata( | ||
db: Database, | ||
metadata: List[NamedMetadata], | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def load_metadata( | |
db: Database, | |
metadata: List[NamedMetadata], | |
): | |
def load_metadata( | |
db: Database, | |
metadata: List[NamedMetadata], | |
) -> List[NamedMetadata]: |
# Confirm that seqkit is installed. | ||
if which("seqkit") is None: | ||
raise AugurError("'seqkit' is not installed! This is required to merge sequences.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This duplicates a check that's also done on L717; that latter check is also looking at the SEQKIT
env var, which this check isn't.
I'm not sure which one is correct, but it doesn't seem like both are useful?
|
||
$ source "$TESTDIR"/_setup.sh | ||
|
||
BASIC USAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐛 at 81b22d7. when a FASTA doesn't have a trailing new line it results in a malformed merged FASTA.
e.g. 2 fasta files each without trailing new lines:
# A.fasta
>A
ATGC
# B.fasta
>B
ATGC
$ augur merge --sequences A=A.fasta B=B.fasta --output-sequences -
>B
ATGC>AATGC
I also find it strange that B comes before A in the output - is this on purpose?
Description of proposed changes
Initial prototype for sequence support in
augur merge
where metadata and sequence merge happens with additional cross-checking.Related issue(s)
Closes #1579
Checklist