Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy ingest #13

Merged
merged 8 commits into from
Dec 5, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
NCBI Dataset field name transformations
Originally the field map was created to keep mpox NDJSON backward compatible
with field names used from NCBI Virus. However, this constraint is not
applicable to dengue.¹

This commit organizes field renaming into two parts.

1. Rename the NCBI output columns to match the NCBI mnemonics²
   (see "ncbi_field_map:" in `config/config.yaml`)
2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names³
   (see "transform: fieldmap:" in `config/config.yaml`)

¹ #13 (comment)
² https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
³ https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
  • Loading branch information
j23414 committed Dec 5, 2023
commit 2684e248edd33a735163a25bb3f695ba0791f968
56 changes: 44 additions & 12 deletions ingest/config/config.yaml
Original file line number Diff line number Diff line change
@@ -2,22 +2,52 @@
sources: ['genbank']
# Pathogen NCBI Taxonomy ID
ncbi_taxon_id: '12637'
# Renames the NCBI dataset headers
ncbi_field_map: 'source-data/ncbi-dataset-field-map.tsv'
# The list of NCBI Datasets fields to include from NCBI Datasets output
# These need to be the mneumonics of the NCBI Datasets fields, see docs for full list of fields
# https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
# Note: the "accession" field MUST be provided to match with the sequences
ncbi_datasets_fields:
- accession
- sourcedb
- isolate-lineage
- geo-region
- geo-location
- isolate-collection-date
- release-date
- update-date
- length
- host-name
- isolate-lineage-source
- submitter-names
- submitter-affiliation

# Params for the transform rule
transform:
# Fields to rename.
# NCBI Fields to rename to Nextstrain field names.
# This is the first step in the pipeline, so any references to field names
# in the configs below should use the new field names
field_map: ['collected=date', 'released=date_released', 'genbank_accession=accession', 'submitting_organization=institution']
field_map: [
'accession=genbank_accession',
'accession-rev=genbank_accession_rev',
'isolate-lineage=strain',
'sourcedb=database', # necessary for applying geo location rules
'geo-region=region',
'geo-location=location',
'host-name=host',
'isolate-collection-date=date',
'release-date=release_date',
'update-date=update_date',
'sra-accs=sra_accessions',
'submitter-names=authors',
'submitter-affiliation=institution',
]
# Standardized strain name regex
# Currently accepts any characters because we do not have a clear standard for strain names
strain_regex: '^.+$'
# Back up strain name field if 'strain' doesn't match regex above
strain_backup_fields: ['accession']
strain_backup_fields: ['genbank_accession']
# List of date fields to standardize
date_fields: ['date', 'date_released']
date_fields: ['date', 'release_date', 'update_date']
# Expected date formats present in date fields
# These date formats should use directives expected by datetime
# See https://docs.python.org/3.9/library/datetime.html#strftime-and-strptime-format-codes
@@ -47,24 +77,26 @@ transform:
# User annotations file
annotations: 'source-data/annotations.tsv'
# ID field used to merge annotations
annotations_id: 'accession'
annotations_id: 'genbank_accession'
# Field to use as the sequence ID in the FASTA file
id_field: 'accession'
id_field: 'genbank_accession'
# Field to use as the sequence in the FASTA file
sequence_field: 'sequence'
# Final output columns for the metadata TSV
metadata_columns: [
'accession',
'genbank_accession_rev',
'strain',
'genbank_accession',
'genbank_accession_rev',
'date',
'region',
'country',
'division',
'location',
'length',
'host',
'date_released',
'sra_accession',
'release_date',
'update_date',
'sra_accessions',
'abbr_authors',
'authors',
'institution'
17 changes: 0 additions & 17 deletions ingest/source-data/ncbi-dataset-field-map.tsv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field map was added to keep the monkeypox NDJSON backwards compatible with field names that we used from NCBI Virus. I don't think we need to maintain it here and just directly use the NCBI Dataset field names.

My reasoning for using the default NCBI Dataset field names is to centralize all field/data transformations to the curation pipeline that start from the NDJSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat appreciate the renaming because I'm not a fan of the spaces and capitalization in the NCBI Dataset field names (e.g. "Isolate Lineage" is being renamed to "strain" and "Geographic Region" to "region").

Do you prefer "Isolate Lineage" or would find "isolate_lineage" also acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd rather keep the fields exactly as NCBI has them. This way we can point users to the NCBI Dataset docs for a description of each field. I think it would help us avoid things like the submitted/released field mix up in the NDJSON.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of pointing users to the NCBI Dataset field documentation! However, it is worth noting that the field Geographic Region is documented as geographicRegion and Region. I guess that would be NCBI's responsibility to keep their documentation up-to-date.

Spaces in column names can require special handling (e.g. properly quoting column names) which I think our code has been made to handle. However, such spaces may post challenges in other languages, scripts, or projects. Replacing spaces with underscores in the NCBI Dataset fields, would still enable us to point to NCBI's documentation.

Nevertheless, I'm open to using dengue as a test pipeline to assess how our code handles columns with spaces. If needed, we can later reintroduce modifications to column titles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it is worth noting that the field Geographic Region is documented as geographicRegion and Region.

Oh right, I forgot that the NCBI reference docs have a nonintuitive set up. They have a top level "Geographic" column name that gets prepended to the VirusAssembly.CollectionLocation fields. The dataformat tsv virus-genome docs for the available fields is a better place to see the full field names.

So I guess there's no that much benefit from keeping the space matching for the column names...Welp, maybe we should transform the names back to the mnemonics so that we only have to maintain one set of column mappings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be a bit uncertain on desired names for the final metadata, so any feedback is welcome.

I've been looking around for any documentation for standard Nextstrain metadata fields and found a page in the ncov workflow docs. I think we can use these standard metadata fields across all pathogens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised the metadata fields in 116f363 The fields do not exactly match the docs since we are not pulling in gisaid data, but I tried to adhere to the stylistic choices (e.g. underscores preferred, lowercase). I am open to discussing any of the field names on a call if that would be helpful. Just let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like auspice requires accession instead of genbank_accession to render the link. Unless there's a workaround that I'm not seeing.

Screenshot 2023-11-13 at 4 30 52 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j23414 marked this conversation as resolved.
Outdated
Show resolved Hide resolved

This file was deleted.

51 changes: 11 additions & 40 deletions ingest/workflow/snakemake_rules/fetch_sequences.smk
Original file line number Diff line number Diff line change
@@ -44,57 +44,26 @@ rule extract_ncbi_dataset_sequences:
"""


def _get_ncbi_dataset_field_mnemonics(wildcards) -> str:
"""
Return list of NCBI Dataset report field mnemonics for fields that we want
to parse out of the dataset report. The column names in the output TSV
are different from the mnemonics.

See NCBI Dataset docs for full list of available fields and their column
names in the output:
https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields
"""
fields = [
"accession",
"sourcedb",
"isolate-lineage",
"geo-region",
"geo-location",
"isolate-collection-date",
"release-date",
"update-date",
"length",
"host-name",
"isolate-lineage-source",
"bioprojects",
"biosample-acc",
"sra-accs",
"submitter-names",
"submitter-affiliation",
]
return ",".join(fields)


rule format_ncbi_dataset_report:
# Formats the headers to be the same as before we used NCBI Datasets
# The only fields we do not have equivalents for are "title" and "publications"
# Formats the headers to match the NCBI mnemonic names
input:
dataset_package="data/ncbi_dataset.zip",
ncbi_field_map=config["ncbi_field_map"],
output:
ncbi_dataset_tsv=temp("data/ncbi_dataset_report.tsv"),
params:
fields_to_include=_get_ncbi_dataset_field_mnemonics,
ncbi_datasets_fields=",".join(config["ncbi_datasets_fields"]),
benchmark:
"benchmarks/format_ncbi_dataset_report.txt"
shell:
"""
dataformat tsv virus-genome \
--package {input.dataset_package} \
--fields {params.fields_to_include:q} \
| csvtk -tl rename2 -F -f '*' -p '(.+)' -r '{{kv}}' -k {input.ncbi_field_map} \
| csvtk -tl mutate -f genbank_accession_rev -n genbank_accession -p "^(.+?)\." \
| tsv-select -H -f genbank_accession --rest last \
--fields {params.ncbi_datasets_fields:q} \
--elide-header \
| csvtk add-header -t -l -n {params.ncbi_datasets_fields:q} \
| csvtk rename -t -f accession -n accession-rev \
| csvtk -tl mutate -f accession-rev -n accession -p "^(.+?)\." \
| tsv-select -H -f accession --rest last \
> {output.ncbi_dataset_tsv}
"""

@@ -105,6 +74,8 @@ rule format_ncbi_datasets_ndjson:
ncbi_dataset_tsv="data/ncbi_dataset_report.tsv",
output:
ndjson="data/genbank.ndjson",
params:
ncbi_datasets_fields=",".join(config["ncbi_datasets_fields"]),
log:
"logs/format_ncbi_datasets_ndjson.txt",
benchmark:
@@ -114,7 +85,7 @@ rule format_ncbi_datasets_ndjson:
augur curate passthru \
--metadata {input.ncbi_dataset_tsv} \
--fasta {input.ncbi_dataset_sequences} \
--seq-id-column genbank_accession_rev \
--seq-id-column accession-rev \
--seq-field sequence \
--unmatched-reporting warn \
--duplicate-reporting warn \