Skip to content

Commit

Permalink
ingest/nextclade: Add join of metadata and Nextclade outputs
Browse files Browse the repository at this point in the history
The shell script for joining the metadata and Nextclade outputs is taken
from @j23414's work in nextstrain/mpox#207

Co-authored-by: Jennifer Chang <[email protected]>
  • Loading branch information
joverlee521 and j23414 committed Oct 10, 2023
1 parent 5ef742d commit 3f9888e
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 0 deletions.
3 changes: 3 additions & 0 deletions ingest/config/defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,6 @@ nextclade:
# The name of the Nextclade dataset to use for running nextclade.
# Run `nextclade dataset list` to get a full list of available Nextclade datasets
dataset_name: ""
# Path to the mapping for renaming Nextclade output columns
# The path should be relative to the ingest directory
column_map: "config/nextclade_column_map.tsv"
18 changes: 18 additions & 0 deletions ingest/config/nextclade_column_map.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# TSV file that is a mapping of column names for Nextclade output TSV
# The first column should be the original column name of the Nextclade TSV
# The second column should be the new column name to use in the final metadata TSV
# Nextclade can have pathogen specific output columns so make sure to check which
# columns would be useful for your downstream phylogenetic analysis.
seqName seqName
clade clade
lineage lineage
coverage coverage
totalMissing missing_data
totalSubstitutions divergence
totalNonACGTNs nonACGTN
qc.missingData.status QC_missing_data
qc.mixedSites.status QC_mixed_sites
qc.privateMutations.status QC_rare_mutations
qc.frameShifts.status QC_frame_shifts
qc.stopCodons.status QC_stop_codons
frameShifts frame_shifts
32 changes: 32 additions & 0 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,35 @@ rule run_nextclade:
zip -rj {output.translations} results/translations
"""


rule join_metadata_and_nextclade:
input:
nextclade="results/nextclade.tsv",
metadata="results/subset_metadata.tsv",
nextclade_field_map=config["nextclade"]["column_map"],
output:
metadata="results/metadata.tsv",
params:
metadata_id_field=config["curate"]["output_id_field"],
shell:
"""
export SUBSET_FIELDS=`awk 'NR>1 {{print $1}}' {input.nextclade_field_map} | tr '\n' ',' | sed 's/,$//g'`
csvtk -tl cut -f $SUBSET_FIELDS \
{input.nextclade} \
| csvtk -tl rename2 \
-F \
-f '*' \
-p '(.+)' \
-r '{{kv}}' \
-k {input.nextclade_field_map} \
| tsv-join -H \
--filter-file - \
--key-fields seqName \
--data-fields {params.metadata_id_field} \
--append-fields '*' \
--write-all ? \
{input.metadata} \
| tsv-select -H --exclude seqName \
> {output.metadata}
"""

0 comments on commit 3f9888e

Please sign in to comment.