Allow addition of unreleased INRB data #255

jameshadfield · 2024-05-28T04:48:06Z

These changes are to allow addition of private metadata + sequences to be spiked into the build (after filtering but prior to subsampling). They are intended for use by INRB using an excel + fasta file which they maintain. As such, these changes probably shouldn't be merged into our canonical mpox repo, but perhaps we fork the repo into the INRB organisation and maintain these (small) changes there?

If you have access to the test xlsx + fasta (see this internal slack thread) then you can run from within the phylogenetic directory via:

python scripts/curate_private_data.py --sequences <fasta file> --xlsx <xlsx file> --remap-columns 'accession:accession' 'accession:strain' 'collection date:date' 'province:division' 'health zone:location' --fasta-header-idx 1
nextstrain build . --configfile defaults/clade-i/config.yaml --config private_data=true auspice_config=defaults/clade-i/auspice_config_inrb.json -f results/clade-i/good_metadata_combined.tsv auspice/mpox_clade-I.json

to do

Metadata values need updating (e.g. they have "Homo-sapiens" we have "Homo sapiens")
Fork into INRB repo?
Test from INRB

huddlej

This review is just of the test commands and ability of the workflow to produce a tree and not a full code review or phylogenetic review.

The test commands worked for me from an ambient Nextstrain Conda environment after I ran the following commands:

# Needed to install dependencies not in Nextstrain env.
conda install openpyxl

# Needed to make data dir in fresh clone of mpox repo.
mkdir -p data/

After those commands, both the test commands ran without issue and the output tree from the workflow had the private data included as expected. 🎉

I was surprised to see that the output sequences and metadata from curate_private_data.py had default values instead of required values, but I see that the workflow expects those specific default names. Do you want to allow users to configure those names in the workflow config file? If they change the defaults, the workflow won't work as expected, so you could drop those output options completely and force the defaults.

Other minor comments included below.

phylogenetic/rules/prepare_sequences.smk

phylogenetic/scripts/curate_private_data.py

A number of clade-Ia samples are missing these detailed geo resolutions, but they are still very helpful. Additional lat-longs provided by Dr. Eddy K. Lusamaki (INRB, DRC)

This approach is necessary because the mutations assigned to the top branches in the tree are random. I'd consider this approach temporary and we should revisit it in the near-future.

Consists of two main additions: 1. A script to parse private metadata (xlsx format) and sequences and convert them to a format compatible with this workflow. This was designed specifically for the INRBs use case where Excel is the source-of-truth. This is intended to be run from outside the snakemake pipeline because of the additional dependencies not available in the managed conda nextstrain runtime and also to allow users to inspect the output for any warnings. 2. Merging of metadata/sequences from multiple sources via an additonal rule in the snakemake workflow (only used when `--config private_data=true`). This will be able to be removed once `augur filter` accepts multiple inputs. The data source is one-hot encoded prior to subsampling so that can be referenced as needed.

Parameters and description via email correspondence

jameshadfield · 2024-11-18T23:41:38Z

Superseded by #287

Do you want to allow users to configure those names in the workflow config file?

Done in #287 - thanks for the suggestion

huddlej reviewed May 28, 2024

View reviewed changes

phylogenetic/rules/prepare_sequences.smk Show resolved Hide resolved

phylogenetic/scripts/curate_private_data.py Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/inrb branch 2 times, most recently from 33292bb to 02e4c21 Compare May 31, 2024 03:26

jameshadfield force-pushed the james/inrb branch from ded99fc to 6b4b47c Compare June 17, 2024 08:17

jameshadfield added 4 commits September 27, 2024 13:10

Clade-I Update geo exports

412e4ab

A number of clade-Ia samples are missing these detailed geo resolutions, but they are still very helpful. Additional lat-longs provided by Dr. Eddy K. Lusamaki (INRB, DRC)

[clade-I] hardcode clade assignment

cd29467

This approach is necessary because the mutations assigned to the top branches in the tree are random. I'd consider this approach temporary and we should revisit it in the near-future.

INRB-specific configs

46c85b0

Parameters and description via email correspondence

jameshadfield force-pushed the james/inrb branch from 6b4b47c to 46c85b0 Compare September 27, 2024 03:13

jameshadfield closed this Nov 18, 2024

victorlin deleted the james/inrb branch November 20, 2024 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow addition of unreleased INRB data #255

Allow addition of unreleased INRB data #255

jameshadfield commented May 28, 2024 •

edited

Loading

huddlej left a comment

jameshadfield commented Nov 18, 2024

Allow addition of unreleased INRB data #255

Allow addition of unreleased INRB data #255

Conversation

jameshadfield commented May 28, 2024 • edited Loading

huddlej left a comment

Choose a reason for hiding this comment

jameshadfield commented Nov 18, 2024

jameshadfield commented May 28, 2024 •

edited

Loading