-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow addition of unreleased INRB data #255
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review is just of the test commands and ability of the workflow to produce a tree and not a full code review or phylogenetic review.
The test commands worked for me from an ambient Nextstrain Conda environment after I ran the following commands:
# Needed to install dependencies not in Nextstrain env.
conda install openpyxl
# Needed to make data dir in fresh clone of mpox repo.
mkdir -p data/
After those commands, both the test commands ran without issue and the output tree from the workflow had the private data included as expected. 🎉
I was surprised to see that the output sequences and metadata from curate_private_data.py
had default values instead of required values, but I see that the workflow expects those specific default names. Do you want to allow users to configure those names in the workflow config file? If they change the defaults, the workflow won't work as expected, so you could drop those output options completely and force the defaults.
Other minor comments included below.
33292bb
to
02e4c21
Compare
ded99fc
to
6b4b47c
Compare
A number of clade-Ia samples are missing these detailed geo resolutions, but they are still very helpful. Additional lat-longs provided by Dr. Eddy K. Lusamaki (INRB, DRC)
This approach is necessary because the mutations assigned to the top branches in the tree are random. I'd consider this approach temporary and we should revisit it in the near-future.
Consists of two main additions: 1. A script to parse private metadata (xlsx format) and sequences and convert them to a format compatible with this workflow. This was designed specifically for the INRBs use case where Excel is the source-of-truth. This is intended to be run from outside the snakemake pipeline because of the additional dependencies not available in the managed conda nextstrain runtime and also to allow users to inspect the output for any warnings. 2. Merging of metadata/sequences from multiple sources via an additonal rule in the snakemake workflow (only used when `--config private_data=true`). This will be able to be removed once `augur filter` accepts multiple inputs. The data source is one-hot encoded prior to subsampling so that can be referenced as needed.
Parameters and description via email correspondence
6b4b47c
to
46c85b0
Compare
These changes are to allow addition of private metadata + sequences to be spiked into the build (after filtering but prior to subsampling). They are intended for use by INRB using an excel + fasta file which they maintain. As such, these changes probably shouldn't be merged into our canonical mpox repo, but perhaps we fork the repo into the INRB organisation and maintain these (small) changes there?
If you have access to the test
xlsx
+fasta
(see this internal slack thread) then you can run from within thephylogenetic
directory via:python scripts/curate_private_data.py --sequences <fasta file> --xlsx <xlsx file> --remap-columns 'accession:accession' 'accession:strain' 'collection date:date' 'province:division' 'health zone:location' --fasta-header-idx 1
nextstrain build . --configfile defaults/clade-i/config.yaml --config private_data=true auspice_config=defaults/clade-i/auspice_config_inrb.json -f results/clade-i/good_metadata_combined.tsv auspice/mpox_clade-I.json
to do