Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move accession to the first column of metadata_all.tsv #36

Merged
merged 2 commits into from
Feb 24, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 23, 2024

Description of proposed changes

During the merge of Usvi data and GenBank data, the accession field ended up as the last column. This caused confusion as the first column was named genbank_accession which could be mistaken for the strain ID.

This commit moves the accession column to the first column such that accession and genbank_accession are next to each other; hopefully, providing clarity that accession is being used as the strain ID, while genbank_accession can be used to generate a URL (in auspice) to the NCBI GenBank record if provided.

Related issue(s)

Checklist

  • Checks pass

During the merge of Usvi data and GenBank data, the accession field ended
up as the last column. This caused confusion as the first column was named
"genbank_accession" which could be mistaken for the strain ID.

This commit moves the "accession" column to the first column such that
"accession" and "genbank_accession" are next to each other; hopefully,
providing clarity that "accession" is being used as the strain ID, while
"genbank_accession" can be used to generate a url to the NCBI GenBank record
if provided.
@j23414 j23414 requested a review from a team February 23, 2024 05:52
Comment on lines 45 to +46
| csvtk concat -tl - {input.usvi_metadata} \
| tsv-select -H -f accession --rest last \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

Suggestion: In cases like this where a column name is ambiguous, add more detail somewhere in the repo. Maybe as the docstring of this rule:

rule append_usvi:
    """Appending USVI sequences.

    Notable columns:
    - accession: Either the GenBank accession or USVI accession.
    - genbank_accession: For Auspice to generate a URL to the NCBI GenBank record. Empty for USVI sequences.
    - url: ?
    """
    input:
        …

I don't know if this has been done in other repos, but it seems like it'd be useful to bring this context out of commit messages/PRs and into the code itself.

Copy link
Contributor Author

@j23414 j23414 Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agree with writing the context into the code itself (docstring). Fixed with 3631e90 but let me know if the url explanation is confusing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Discussion is beyond the scope of this PR now, so feel free to merge first.)

The Auspice source code indicates that when both are specified, genbank_accession takes precedence and url will be ignored.

Because of this behavior, I don't think url should be set for GenBank sequences. That would only cause confusion in the event that GenBank changes the URL, we try to update it in this file, and scratch our heads over why the old URL is still showing on Auspice. I would expect url to only be set for USVI sequences where there is no genbank_accession.

In cases like this where a column name is ambiguous ('accession' and 'genbank_accession'),
bring this context out of commit messages/PRs and into the code itself.
@j23414 j23414 merged commit 86012b2 into main Feb 24, 2024
32 checks passed
@j23414 j23414 deleted the mv_accession_first branch February 24, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants