Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow addition of unreleased INRB data #255

Closed
wants to merge 4 commits into from
Closed

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented May 28, 2024

These changes are to allow addition of private metadata + sequences to be spiked into the build (after filtering but prior to subsampling). They are intended for use by INRB using an excel + fasta file which they maintain. As such, these changes probably shouldn't be merged into our canonical mpox repo, but perhaps we fork the repo into the INRB organisation and maintain these (small) changes there?

If you have access to the test xlsx + fasta (see this internal slack thread) then you can run from within the phylogenetic directory via:

  1. python scripts/curate_private_data.py --sequences <fasta file> --xlsx <xlsx file> --remap-columns 'accession:accession' 'accession:strain' 'collection date:date' 'province:division' 'health zone:location' --fasta-header-idx 1
  2. nextstrain build . --configfile defaults/clade-i/config.yaml --config private_data=true auspice_config=defaults/clade-i/auspice_config_inrb.json -f results/clade-i/good_metadata_combined.tsv auspice/mpox_clade-I.json

to do

  • Metadata values need updating (e.g. they have "Homo-sapiens" we have "Homo sapiens")
  • Fork into INRB repo?
  • Test from INRB

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review is just of the test commands and ability of the workflow to produce a tree and not a full code review or phylogenetic review.

The test commands worked for me from an ambient Nextstrain Conda environment after I ran the following commands:

# Needed to install dependencies not in Nextstrain env.
conda install openpyxl

# Needed to make data dir in fresh clone of mpox repo.
mkdir -p data/

After those commands, both the test commands ran without issue and the output tree from the workflow had the private data included as expected. 🎉

I was surprised to see that the output sequences and metadata from curate_private_data.py had default values instead of required values, but I see that the workflow expects those specific default names. Do you want to allow users to configure those names in the workflow config file? If they change the defaults, the workflow won't work as expected, so you could drop those output options completely and force the defaults.

Other minor comments included below.

phylogenetic/rules/prepare_sequences.smk Show resolved Hide resolved
phylogenetic/scripts/curate_private_data.py Outdated Show resolved Hide resolved
@jameshadfield jameshadfield force-pushed the james/inrb branch 2 times, most recently from 33292bb to 02e4c21 Compare May 31, 2024 03:26
A number of clade-Ia samples are missing these detailed geo resolutions,
but they are still very helpful.

Additional lat-longs provided by Dr. Eddy K. Lusamaki (INRB, DRC)
This approach is necessary because the mutations assigned to the top
branches in the tree are random. I'd consider this approach temporary
and we should revisit it in the near-future.
Consists of two main additions:

1. A script to parse private metadata (xlsx format) and sequences and
   convert them to a format compatible with this workflow. This was
   designed specifically for the INRBs use case where Excel is the
   source-of-truth. This is intended to be run from outside the
   snakemake pipeline because of the additional dependencies not
   available in the managed conda nextstrain runtime and also to allow
   users to inspect the output for any warnings.

2. Merging of metadata/sequences from multiple sources via an additonal
   rule in the snakemake workflow (only used when `--config
   private_data=true`). This will be able to be removed once `augur
   filter` accepts multiple inputs. The data source is one-hot encoded
   prior to subsampling so that can be referenced as needed.
Parameters and description via email correspondence
@jameshadfield
Copy link
Member Author

Superseded by #287

Do you want to allow users to configure those names in the workflow config file?

Done in #287 - thanks for the suggestion

@victorlin victorlin deleted the james/inrb branch November 20, 2024 01:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants