Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingest to pathogen-repo-template #234

Merged
merged 7 commits into from
Feb 6, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
ingest/curate.smk: Make the field map config more user friendly
Match changes in pathogen-repo-template
nextstrain/pathogen-repo-guide@5e1b1ef
  • Loading branch information
joverlee521 committed Jan 30, 2024
commit edc904bf8fdab7d164ba06181dce5bd25963f399
6 changes: 5 additions & 1 deletion ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
@@ -10,7 +10,11 @@ curate:
# Fields to rename.
# This is the first step in the pipeline, so any references to field names
# in the configs below should use the new field names
field_map: ['collected=date', 'submitted=date_submitted', 'genbank_accession=accession', 'submitting_organization=institution']
field_map:
collected: date
submitted: date_submitted
genbank_accession: accession
submitting_organization: institution
# Standardized strain name regex
# Currently accepts any characters because we do not have a clear standard for strain names
strain_regex: '^.+$'
9 changes: 8 additions & 1 deletion ingest/rules/curate.smk
Original file line number Diff line number Diff line change
@@ -36,6 +36,13 @@ rule concat_geolocation_rules:
"""


def format_field_map(field_map: dict[str, str]) -> str:
"""
Format dict to `"key1"="value1" "key2"="value2"...` for use in shell commands.
"""
return " ".join([f'"{key}"="{value}"' for key, value in field_map.items()])


rule curate:
input:
sequences_ndjson="data/sequences.ndjson",
@@ -47,7 +54,7 @@ rule curate:
log:
"logs/curate.txt",
params:
field_map=config["curate"]["field_map"],
field_map=format_field_map(config["curate"]["field_map"]),
strain_regex=config["curate"]["strain_regex"],
strain_backup_fields=config["curate"]["strain_backup_fields"],
date_fields=config["curate"]["date_fields"],
Loading