Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dev to v1.2.1 #71

Merged
merged 26 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
61a756b
Update README.md (#61)
dthoward96 Aug 6, 2024
60da176
Update README.md (#63)
dthoward96 Aug 6, 2024
d2a65d9
in the GISAID covCLI workflow: addressed issues w/metadata formatting…
erikwolfsohn Aug 15, 2024
fe7f44f
Prevent failures when:
erikwolfsohn Aug 16, 2024
4cfc3a1
Added check for empty columns before attempting to drop them.
erikwolfsohn Aug 16, 2024
542e3d0
warn when submitting sra and biosample together while Link_Sample_Bet…
erikwolfsohn Aug 17, 2024
1a57c29
removed code preventing prod submissions, made sure bs-description on…
erikwolfsohn Aug 21, 2024
74e0cc4
add handling for GISAID submission when samples have previously been …
erikwolfsohn Aug 27, 2024
d5efaa0
make fasta optional - not required for biosample/sra submission.
erikwolfsohn Aug 30, 2024
de4a25f
for column w/ mix of data & blank values, don't write empty values to…
erikwolfsohn Aug 30, 2024
24a0deb
use titles for individual sra runs if they exist, otherwise apply gen…
erikwolfsohn Sep 3, 2024
4bf0c27
change check for empty title columns
erikwolfsohn Sep 3, 2024
aed3c3e
Merge branch 'seqsender_gisaid_dev' of github.com:erikwolfsohn/seqsen…
erikwolfsohn Sep 3, 2024
a4fff17
remove unused json reference
erikwolfsohn Sep 3, 2024
fee1a29
Update issue templates (#66)
dthoward96 Sep 4, 2024
971bfba
Merge pull request #64 from erikwolfsohn/seqsender_gisaid_dev
dthoward96 Sep 6, 2024
73ef85b
v1.2.1 release
dthoward96 Sep 11, 2024
66e211f
Merge pull request #68 from CDCgov/master
dthoward96 Sep 11, 2024
c13937a
v1.2.1 Release
dthoward96 Sep 11, 2024
fde148c
Merge branch 'v1.2.1.-Bug-Fixes-Update' of https://github.com/CDCgov/…
dthoward96 Sep 11, 2024
fcc0410
Merge pull request #69 from CDCgov/v1.2.1.-Bug-Fixes-Update
dthoward96 Sep 11, 2024
0baa9d0
Delete gisaid_cli/fluCLI/fluCLI
dthoward96 Sep 11, 2024
b28ab5e
Delete submission_log.csv
dthoward96 Sep 11, 2024
3c82e2b
Update seqsender.py
dthoward96 Sep 11, 2024
ee8fcba
Update settings.py
dthoward96 Sep 11, 2024
2e0bd2f
Merge pull request #70 from CDCgov/dthoward96-remove-CLI
dthoward96 Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/shiny/*
/vignettes/*
/docs/*
39 changes: 11 additions & 28 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,27 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''
about: Create a report to help us improve SeqSender
title: "[BUG]"
labels: bug
assignees: dthoward96

---

**Describe the bug**
A clear and concise description of what feature is not working.

**Impact**
Please describe the impact this bug is causing to your program or organization.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.
- Which databases are you attempting submission to?
- Is the error related to a specific database/metadata field? If so, which?
- Steps to reproduce the behavior:

**Logs**
If applicable, please attach logs to help describe your problem.

**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]

**Smartphone (please complete the following information):**
- Device: [e.g. iPhone6]
- OS: [e.g. iOS8.1]
- Browser [e.g. stock browser, safari]
- Version [e.g. 22]
**Version**
- Version Number: [e.g. v1.2.0.]
- SeqSender Version: [e.g. Singularity, Docker, Script]
- OS [e.g. Linux, Mac]

**Additional context**
Add any other context about the problem here.
14 changes: 7 additions & 7 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
about: Suggest a new feature for SeqSender
title: "[FEATURE REQUEST]"
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
**Is this feature for general SeqSender usage or for submitting to a specific database, if so what database?**
SeqSender or a specified database name.

**Is your feature request related to a problem? If so please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/general.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: General
about: SeqSender Issue
title: ''
labels: ''
assignees: ''

---

**Is this related to a specific database? If so, which database?**

**Describe your issue below:**
17 changes: 0 additions & 17 deletions .github/ISSUE_TEMPLATE/maintenance.md

This file was deleted.

5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ submit.ready
*report.xml
test_input/test_metadata.tsv
upload_log.csv
submission_log.csv
*.vscode
*.Rproj
*.Rhistory
Expand All @@ -19,3 +20,7 @@ docker-compose-*.yaml

# ignore folders
**/.Rproj.user
**/test_data/*
**/gisaid_cli/*
**/COV_TEST_DATA/*
**/FLU_TEST_DATA/*
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ github_pages_url <- description$GITHUB_PAGES

<p style="font-size: 16px;"><em>Public Database Submission Pipeline</em></p>

**Beta Version**: v1.2.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!
**Beta Version**: v1.2.1. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!

**General Disclaimer**: This repository was created for use by CDC programs to collaborate on public health related projects in support of the [CDC mission](https://www.cdc.gov/about/organization/mission.htm). GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Expand Down
82 changes: 2 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

</p>

**Beta Version**: 1.2.0. This pipeline is currently in Beta testing, and
**Beta Version**: 1.2.1. This pipeline is currently in Beta testing, and
issues could appear during submission. Please use it at your own risk.
Feedback and suggestions are welcome\!

Expand All @@ -23,7 +23,7 @@ CDC and its partners to share information and collaborate on software.
CDC use of GitHub does not imply an endorsement of any one particular
service, product, or enterprise.

# [Documentation](https://dthoward96.github.io/seqsender_test_website/)
# [Documentation](https://cdcgov.github.io/seqsender/)

## Overview

Expand All @@ -45,84 +45,6 @@ issue.
| Maintainer | [Dakota Howard](https://github.com/dthoward96) |
| Back-Up | [Reina Chau](https://github.com/rchau88), [Brian Lee](https://github.com/leebrian) |

## Prerequisites

- **NCBI Submissions**

`seqsender` utilizes an UI-Less Data Submission Protocol to bulk upload
submission files (e.g., *submission.xml*, *submission.zip*, etc.) to
NCBI archives. The submission files are uploaded to the NCBI server via
FTP on the command line. Before attempting to submit a submission using
`seqsender`, submitter will need to

1. Have a NCBI account. To sign up, visit [NCBI
website](https://account.ncbi.nlm.nih.gov/).

2. Required for CDC users and highly recommended for others is creating
a center account for your institution/lab [NCBI Center Account
Instructions](https://submit.ncbi.nlm.nih.gov/sarscov2/sra/#step6).
Center accounts allow you to perform submissions UI-less submissions
as your institution/lab.

3. Required for CDC users and also recommended is creating a submission
group in [NCBI Submission Portal](https://submit.ncbi.nlm.nih.gov).
A group should include all individuals who need access to UI-less
submissions through the web interface with your center account. Each
member of the group must also have an individual NCBI account. [NCBI
website](https://account.ncbi.nlm.nih.gov/).

4. Refer to this page for information regarding requirements for
GenBank submissions via FTP only. This page applies only for COVID
and Influenza [NCBI GenBank FTP
Submissions](https://submit.ncbi.nlm.nih.gov/sarscov2/genbank/#step5)
For further questions contact
<a href="mailto:[email protected]">[email protected]</a>
to discuss requirements for submissions.

5. Coordinate a NCBI namespace name (**spuid\_namespace**) that will be
used with Submitter Provided Unique Identifiers (**spuid**) in the
submission. The liaison of **spuid\_namespace** and **spuid** is
used to report back assigned accessions as well as for cross-linking
objects within submission. The values of **spuid\_namespace** are up
to the submitter to decide but they must be unique and
well-coordinated prior to make a submission.

<!-- end list -->

- **GISAID Submissions**

`seqsender` makes use of GISAID’s Command Line Interface tools to bulk
uploading meta- and sequence-data to GISAID databases. Presently, the
pipeline supports upload to EpiFlu (**Influenza A Virus**), EpiCoV
(**SARS-COV-2**), EpiPox (**Monkeypox**), and EpiArbo (**Arbovirus**).
Before uploading, submitter needs to

1. Have a GISAID account. To sign up, visit [GISAID
Platform](https://gisaid.org/).

2. Request a client-ID for your specified Epi(Flu/CoV/Pox/Arbo)
database in order to use its CLI tool. The CLI utilizes the
client-ID along with the username and password to authenticate the
database prior to make a submission. To obtain a client-ID, please
email
<a href="mailto:[email protected]" >[email protected]</a> to
request. ***Important note**: If submitter would like to upload a
“test” submission first to familiarize themselves with the
submission process prior to make a real submission, one should
additionally request a test client-id to perform such submissions.*

3. Download the
<a href="https://cdcgov.github.io/seqsender/articles/images/fluCLI_download.png" target="_blank">EpiFlu</a>
or
<a href="https://cdcgov.github.io/seqsender/articles/images/covCLI_download.png" target="_blank">EpiCoV</a>
CLI from the **GISAID platform** and stored them in the destination
of choice prior to perform a batch upload.

Here is a quick look of where to store the downloaded **GISAID CLI**
package.

![](man/figures/gisaid_cli_dir.png)

## Code Attributions

Dakota Howard and Reina Chau for majority of the code base with input
Expand Down
2 changes: 1 addition & 1 deletion argument_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def args_parser():
required=True)
file_parser.add_argument("--fasta_file",
help="Fasta file used to generate submission files; fasta header should match the column 'sequence_name' stored in your metadata. Input either full file path or if just file name it must be stored at '<submission_dir>/<submission_name>/<fasta_file>'.",
required=True)
default = None)
file_parser.add_argument("--table2asn",
help="Perform a table2asn submission instead of GenBank FTP submission for organism choices 'FLU' or 'COV'.",
required=False,
Expand Down
70 changes: 47 additions & 23 deletions biosample_sra_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
column_ordered = ["sample_name","library_ID"]
prefix = "sra-"
# Create SRA specific fields
metadata["sra-title"] = config_dict["Description"]["Title"]
filename_cols = [col for col in metadata.columns.tolist() if re.match("sra-file_[1-9]\d*", col)]
# Correct index for filename column
for col in filename_cols:
Expand All @@ -69,8 +68,8 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
rename_columns[col] = col.replace("sra-file_", "sra-filename")
elif "BIOSAMPLE" in database:
metadata_regex = "^bs-|^organism$|^collection_date$"
rename_columns = {"bs-description":"sample_title","bioproject":"bioproject_accession"}
drop_columns = ["bs-package"]
rename_columns = {"bioproject":"bioproject_accession"}
drop_columns = ["bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"]
column_ordered = ["sample_name"]
prefix = "bs-"
else:
Expand All @@ -92,14 +91,31 @@ def create_manual_submission_files(database: str, submission_dir: str, metadata:
file_handler.save_csv(df=database_df, file_path=submission_dir, file_name="metadata.tsv", sep="\t")

# Create submission XML
def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame, failed_seqs_auto_removed: bool = True) -> bytes:
def create_submission_xml(organism: str, database: str, submission_name: str, config_dict: Dict[str, Any], metadata: pd.DataFrame) -> bytes:
# Submission XML header
root = etree.Element("Submission")
description = etree.SubElement(root, "Description")
title = etree.SubElement(description, "Title")
title.text = config_dict["Description"]["Title"]
comment = etree.SubElement(description, "Comment")
comment.text = config_dict["Description"]["Comment"]
if "BIOSAMPLE" in database:
if "bs-title" in metadata and pd.notnull(metadata["bs-title"].iloc[0]) and metadata["bs-title"].iloc[0].strip() != 0:
title.text = metadata["bs-title"].iloc[0]
else:
title.text = submission_name + "-BS"
comment = etree.SubElement(description, "Comment")
if "bs-comment" in metadata and pd.notnull(metadata["bs-comment"].iloc[0]) and metadata["bs-comment"].iloc[0].strip() != 0:
comment.text = metadata["bs-comment"].iloc[0]
else:
comment.text = "BioSample Submission"
elif "SRA" in database:
if "sra-title" in metadata and pd.notnull(metadata["sra-title"].iloc[0]) and metadata["sra-title"].iloc[0].strip() != 0:
title.text = metadata["sra-title"].iloc[0]
else:
title.text = submission_name + "-SRA"
comment = etree.SubElement(description, "Comment")
if "sra-comment" in metadata and pd.notnull(metadata["sra-comment"].iloc[0]) and metadata["sra-comment"].iloc[0].strip() != 0:
comment.text = metadata["sra-comment"].iloc[0]
else:
comment.text = "SRA Submission"
# Description info including organization and contact info
organization = etree.SubElement(description, "Organization", type=config_dict["Description"]["Organization"]["Type"], role=config_dict["Description"]["Organization"]["Role"])
org_name = etree.SubElement(organization, "Name")
Expand All @@ -125,13 +141,18 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
sampleid = etree.SubElement(biosample, "SampleId")
spuid = etree.SubElement(sampleid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = row["bs-sample_name"]
descriptor = etree.SubElement(biosample, "Descriptor")
title = etree.SubElement(descriptor, "Title")
title.text = row["bs-description"]
if ("bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "") or ("bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != ""):
descriptor = etree.SubElement(biosample, "Descriptor")
if "bs-sample_title" in metadata and pd.notnull(row["bs-sample_title"]) and row["bs-sample_title"].strip != "":
sample_title = etree.SubElement(descriptor, "Title")
sample_title.text = row["bs-sample_title"]
if "bs-sample_description" in metadata and pd.notnull(row["bs-sample_description"]) and row["bs-sample_description"].strip != "":
sample_description = etree.SubElement(descriptor, "Description")
sample_description.text = row["bs-sample_description"]
organismxml = etree.SubElement(biosample, "Organism")
organismname = etree.SubElement(organismxml, "OrganismName")
organismname.text = row["organism"]
if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
if "bioproject" in metadata and pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
bioproject = etree.SubElement(biosample, "BioProject")
primaryid = etree.SubElement(bioproject, "PrimaryId", db="BioProject")
primaryid.text = row["bioproject"]
Expand All @@ -140,10 +161,12 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
# Attributes
attributes = etree.SubElement(biosample, "Attributes")
# Remove columns with bs-prefix that are not attributes
biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-description"])]
biosample_cols = [col for col in database_df.columns.tolist() if (col.startswith('bs-')) and (col not in ["bs-sample_name", "bs-package", "bs-title", "bs-comment", "bs-sample_title", "bs-sample_description"])]
for col in biosample_cols:
attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
attribute.text = row[col]
attribute_value = row[col]
if pd.notnull(attribute_value) and attribute_value.strip() != "":
attribute = etree.SubElement(attributes, "Attribute", attribute_name=col.replace("bs-",""))
attribute.text = row[col]
# Add collection date to Attributes
attribute = etree.SubElement(attributes, "Attribute", attribute_name="collection_date")
attribute.text = row["collection_date"]
Expand Down Expand Up @@ -174,20 +197,21 @@ def create_submission_xml(organism: str, database: str, submission_name: str, co
datatype = etree.SubElement(file, "DataType")
datatype.text = "generic-data"
# Remove columns with sra- prefix that are not attributes
sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-file_location|sra-file_\d*)", col)]
sra_cols = [col for col in database_df.columns.tolist() if col.startswith('sra-') and not re.match("(sra-sample_name|sra-title|sra-comment|sra-file_location|sra-file_\d*)", col)]
for col in sra_cols:
attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
attribute.text = row[col]
attribute_value = row[col]
if pd.notnull(attribute_value) and attribute_value.strip() != "":
attribute = etree.SubElement(addfiles, "Attribute", name=col.replace("sra-",""))
attribute.text = row[col]
if pd.notnull(row["bioproject"]) and row["bioproject"].strip() != "":
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioProject")
refid = etree.SubElement(attribute_ref_id, "RefId")
primaryid = etree.SubElement(refid, "PrimaryId")
primaryid.text = row["bioproject"]
if config_dict["Link_Sample_Between_NCBI_Databases"] and metadata.columns.str.contains("bs-sample_name").any():
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
refid = etree.SubElement(attribute_ref_id, "RefId")
spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
attribute_ref_id = etree.SubElement(addfiles, "AttributeRefId", name="BioSample")
refid = etree.SubElement(attribute_ref_id, "RefId")
spuid = etree.SubElement(refid, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = metadata.loc[metadata["sra-sample_name"] == row["sra-sample_name"], "bs-sample_name"].iloc[0]
identifier = etree.SubElement(addfiles, "Identifier")
spuid = etree.SubElement(identifier, "SPUID", spuid_namespace=config_dict["Spuid_Namespace"])
spuid.text = row["sra-sample_name"]
Expand All @@ -209,7 +233,7 @@ def create_biosample_sra_submission(organism: str, database: str, submission_nam
create_raw_reads_list(submission_dir=submission_dir, raw_files_list=raw_files_list)
manual_df = metadata.copy()
create_manual_submission_files(database=database, submission_dir=submission_dir, metadata=manual_df, config_dict=config_dict)
xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict, failed_seqs_auto_removed=True)
xml_str = create_submission_xml(organism=organism, database=database, submission_name=submission_name, metadata=metadata, config_dict=config_dict)
file_handler.save_xml(xml_str, submission_dir)

# Read xml report and get status of the submission
Expand Down
Loading
Loading