Skip to content

Commit

Permalink
ingest: Rename Nextclade metadata fields with augur curate rename
Browse files Browse the repository at this point in the history
This construction reads much clearer and cleaner.

Moves the Nextclade field map directly and more conveniently into the
YAML config instead of referencing a separate TSV file.  Putting the
field map into a separate file seemed to be only for the sake of the
--kv-file (-k) interface provided by `cvstk rename2`, which we're no
longer using here.  Backwards compatibility with configs that name a TSV
file is not preserved since this pathogen-repo-guide is expected to be
used to stamp out new repos, and we don't have any particular
process/plan for how to update previously stamped out repos.

Note that `augur curate` commands currently emit CSV-like TSVs that are
limited to be IANA-like¹ such that parsing them with tsv-utils is most
appropriate, hence the switch from `csvtk cut` to `tsv-select`.

¹ See <nextstrain/augur#1566>.

Ported-from: <nextstrain/measles@faebd64>
Related-to: <nextstrain/measles#52>
Related-to: <#65>
  • Loading branch information
tsibley committed Oct 3, 2024
1 parent bd393f1 commit 443d0de
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 31 deletions.
20 changes: 17 additions & 3 deletions ingest/defaults/nextclade_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,23 @@ nextclade:
# The name of the Nextclade dataset to use for running nextclade.
# Run `nextclade dataset list` to get a full list of available Nextclade datasets
dataset_name: ""
# Path to the mapping for renaming Nextclade output columns
# The path should be relative to the ingest directory
field_map: "defaults/nextclade_field_map.tsv"
# The first column should be the original column name of the Nextclade TSV
# The second column should be the new column name to use in the final metadata TSV
# Nextclade can have pathogen specific output columns so make sure to check which
# columns would be useful for your downstream phylogenetic analysis.
field_map:
seqName: "seqName"
clade: "clade"
coverage: "coverage"
totalMissing: "missing_data"
totalSubstitutions: "divergence"
totalNonACGTNs: "nonACGTN"
qc.missingData.status: "QC_missing_data"
qc.mixedSites.status: "QC_mixed_sites"
qc.privateMutations.status: "QC_rare_mutations"
qc.frameShifts.status: "QC_frame_shifts"
qc.stopCodons.status: "QC_stop_codons"
frameShifts: "frame_shifts"
# This is the ID field you would use to match the Nextclade output with the record metadata.
# This should be the new name that you have defined in your field map.
id_field: "seqName"
17 changes: 0 additions & 17 deletions ingest/defaults/nextclade_field_map.tsv

This file was deleted.

19 changes: 8 additions & 11 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -65,24 +65,21 @@ rule join_metadata_and_nextclade:
input:
nextclade="results/nextclade.tsv",
metadata="data/subset_metadata.tsv",
nextclade_field_map=config["nextclade"]["field_map"],
output:
metadata="results/metadata.tsv",
params:
metadata_id_field=config["curate"]["output_id_field"],
nextclade_id_field=config["nextclade"]["id_field"],
nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()],
nextclade_fields=",".join(config["nextclade"]["field_map"].values()),
shell:
r"""
export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'`
csvtk -tl cut -f $SUBSET_FIELDS \
{input.nextclade} \
| csvtk -tl rename2 \
-F \
-f '*' \
-p '(.+)' \
-r '{{kv}}' \
-k {input.nextclade_field_map} \
augur curate rename \
--metadata {input.nextclade:q} \
--id-column {params.nextclade_id_field:q} \
--field-map {params.nextclade_field_map:q} \
--output-metadata - \
| tsv-select --header --fields {params.nextclade_fields:q} \
| tsv-join -H \
--filter-file - \
--key-fields {params.nextclade_id_field} \
Expand Down

0 comments on commit 443d0de

Please sign in to comment.