-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy ingest #13
Merged
Copy ingest #13
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
11b67b1
Ingest: Copy ingest from mpox repo
j23414 71d2f56
git subrepo clone (merge) https://github.com/nextstrain/ingest ingest…
j23414 b64198c
Ingest: Replace mpox text and taxon id with dengue
j23414 50e6083
Remove Nextclade related rules
j23414 e1007c7
Remove unused reverse column
j23414 109f2eb
Clear pathogen specific user provided annotations and rules
j23414 94b0113
eradicate our confounding of release and submission
j23414 2684e24
NCBI Dataset field name transformations
j23414 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev
Previous commit
NCBI Dataset field name transformations
Originally the field map was created to keep mpox NDJSON backward compatible with field names used from NCBI Virus. However, this constraint is not applicable to dengue.¹ This commit organizes field renaming into two parts. 1. Rename the NCBI output columns to match the NCBI mnemonics² (see "ncbi_field_map:" in `config/config.yaml`) 2. Where necessary, rename the NCBI mnemonics to match Nextstrain expected column names³ (see "transform: fieldmap:" in `config/config.yaml`) ¹ #13 (comment) ² https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_virus-genome/#fields ³ https://docs.nextstrain.org/projects/ncov/en/latest/reference/metadata-fields.html
- Loading branch information
commit 2684e248edd33a735163a25bb3f695ba0791f968
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
j23414 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field map was added to keep the monkeypox NDJSON backwards compatible with field names that we used from NCBI Virus. I don't think we need to maintain it here and just directly use the NCBI Dataset field names.
My reasoning for using the default NCBI Dataset field names is to centralize all field/data transformations to the curation pipeline that start from the NDJSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I somewhat appreciate the renaming because I'm not a fan of the spaces and capitalization in the NCBI Dataset field names (e.g. "Isolate Lineage" is being renamed to "strain" and "Geographic Region" to "region").
Do you prefer "Isolate Lineage" or would find "isolate_lineage" also acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd rather keep the fields exactly as NCBI has them. This way we can point users to the NCBI Dataset docs for a description of each field. I think it would help us avoid things like the submitted/released field mix up in the NDJSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of pointing users to the NCBI Dataset field documentation! However, it is worth noting that the field
Geographic Region
is documented asgeographicRegion
andRegion
. I guess that would be NCBI's responsibility to keep their documentation up-to-date.Spaces in column names can require special handling (e.g. properly quoting column names) which I think our code has been made to handle. However, such spaces may post challenges in other languages, scripts, or projects. Replacing spaces with underscores in the NCBI Dataset fields, would still enable us to point to NCBI's documentation.
Nevertheless, I'm open to using dengue as a test pipeline to assess how our code handles columns with spaces. If needed, we can later reintroduce modifications to column titles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I forgot that the NCBI reference docs have a nonintuitive set up. They have a top level "Geographic" column name that gets prepended to the VirusAssembly.CollectionLocation fields. The
dataformat tsv virus-genome
docs for the available fields is a better place to see the full field names.So I guess there's no that much benefit from keeping the space matching for the column names...Welp, maybe we should transform the names back to the mnemonics so that we only have to maintain one set of column mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking around for any documentation for standard Nextstrain metadata fields and found a page in the ncov workflow docs. I think we can use these standard metadata fields across all pathogens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revised the metadata fields in 116f363 The fields do not exactly match the docs since we are not pulling in gisaid data, but I tried to adhere to the stylistic choices (e.g. underscores preferred, lowercase). I am open to discussing any of the field names on a call if that would be helpful. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like auspice requires
accession
instead ofgenbank_accession
to render the link. Unless there's a workaround that I'm not seeing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gisaid_epi_isl
andgenbank_accession
are special cases in AuspiceThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just documenting that this PR is on hold until we resolve following questions: