ingest: Rename Nextclade metadata fields with augur curate rename

This construction reads much clearer and cleaner. Moves the Nextclade field map directly and more conveniently into the YAML config instead of referencing a separate TSV file. Putting the field map into a separate file seemed to be only for the sake of the --kv-file (-k) interface provided by `cvstk rename2`, which we're no longer using here. Backwards compatibility with configs that name a TSV file is not preserved since this pathogen-repo-guide is expected to be used to stamp out new repos, and we don't have any particular process/plan for how to update previously stamped out repos. Note that `augur curate` commands currently emit CSV-like TSVs that are limited to be IANA-like¹ such that parsing them with tsv-utils is most appropriate, hence the switch from `csvtk cut` to `tsv-select`. ¹ See <nextstrain/augur#1566>. Ported-from: <nextstrain/measles@faebd64> Related-to: <nextstrain/measles#52> Related-to: <#65>
nextstrain · Oct 3, 2024 · 443d0de · 443d0de
1 parent bd393f1
commit 443d0de
Show file tree

Hide file tree

Showing 3 changed files with 25 additions and 31 deletions.
diff --git a/ingest/defaults/nextclade_config.yaml b/ingest/defaults/nextclade_config.yaml
@@ -4,9 +4,23 @@ nextclade:
   # The name of the Nextclade dataset to use for running nextclade.
   # Run `nextclade dataset list` to get a full list of available Nextclade datasets
   dataset_name: ""
-  # Path to the mapping for renaming Nextclade output columns
-  # The path should be relative to the ingest directory
-  field_map: "defaults/nextclade_field_map.tsv"
+  # The first column should be the original column name of the Nextclade TSV
+  # The second column should be the new column name to use in the final metadata TSV
+  # Nextclade can have pathogen specific output columns so make sure to check which
+  # columns would be useful for your downstream phylogenetic analysis.
+  field_map:
+    seqName: "seqName"
+    clade: "clade"
+    coverage: "coverage"
+    totalMissing: "missing_data"
+    totalSubstitutions: "divergence"
+    totalNonACGTNs: "nonACGTN"
+    qc.missingData.status: "QC_missing_data"
+    qc.mixedSites.status: "QC_mixed_sites"
+    qc.privateMutations.status: "QC_rare_mutations"
+    qc.frameShifts.status: "QC_frame_shifts"
+    qc.stopCodons.status: "QC_stop_codons"
+    frameShifts: "frame_shifts"
   # This is the ID field you would use to match the Nextclade output with the record metadata.
   # This should be the new name that you have defined in your field map.
   id_field: "seqName"
diff --git a/ingest/defaults/nextclade_field_map.tsv b/ingest/defaults/nextclade_field_map.tsv
diff --git a/ingest/rules/nextclade.smk b/ingest/rules/nextclade.smk
@@ -65,24 +65,21 @@ rule join_metadata_and_nextclade:
     input:
         nextclade="results/nextclade.tsv",
         metadata="data/subset_metadata.tsv",
-        nextclade_field_map=config["nextclade"]["field_map"],
     output:
         metadata="results/metadata.tsv",
     params:
         metadata_id_field=config["curate"]["output_id_field"],
         nextclade_id_field=config["nextclade"]["id_field"],
+        nextclade_field_map=[f"{old}={new}" for old, new in config["nextclade"]["field_map"].items()],
+        nextclade_fields=",".join(config["nextclade"]["field_map"].values()),
     shell:
         r"""
-        export SUBSET_FIELDS=`grep -v '^#' {input.nextclade_field_map} | awk '{{print $1}}' | tr '\n' ',' | sed 's/,$//g'`
-
-        csvtk -tl cut -f $SUBSET_FIELDS \
-            {input.nextclade} \
-        | csvtk -tl rename2 \
-            -F \
-            -f '*' \
-            -p '(.+)' \
-            -r '{{kv}}' \
-            -k {input.nextclade_field_map} \
+        augur curate rename \
+            --metadata {input.nextclade:q} \
+            --id-column {params.nextclade_id_field:q} \
+            --field-map {params.nextclade_field_map:q} \
+            --output-metadata - \
+        | tsv-select --header --fields {params.nextclade_fields:q} \
         | tsv-join -H \
             --filter-file - \
             --key-fields {params.nextclade_id_field} \