Skip to content

Commit

Permalink
Comment on the logic behind GISAID data prep
Browse files Browse the repository at this point in the history
  • Loading branch information
huddlej committed Oct 16, 2023
1 parent faeae9d commit 7784681
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions profiles/gisaid/prepare_data.smk
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
# Assumes that metadata XLS is the XLS metadata file downloaded from GISAID for
# the same samples that appear in the raw sequences FASTA below.
#
# 1. Convert metadata from XLS to CSV for better downstream parsing.
# 2. Select only the metadata fields that we need.
# 3. Rename GISAID fields to Nextstrain standard field names.
# 4. Split the "location" field into four separate geographic fields with standard Nextstrain field names.
# 5. Remove whitespace in strain names to make names consistent with the sequence records as processed below.
# 6. Sort records in descending order by strain name and accession such that the most recent accession for each strain appears first.
# 7. Select the first record for each unique strain name in the metadata, keeping the most recent accession.
rule prepare_metadata:
input:
metadata="data/{lineage}/metadata.xls",
Expand All @@ -18,6 +28,15 @@ rule prepare_metadata:
| csvtk uniq -T -f strain > {output.metadata}
"""

# Assumes that "raw sequences" FASTA is downloaded from GISAID with only the
# "Isolate_name" field selected such that each record looks like:
# ">strain name|accession".
#
# 1. Remove spaces from strain names.
# 2. Add unique id to duplicate strain name and accession pairs.
# 3. Sort sequences in descending order by strain and accession (latest accession comes first).
# 4. Replace "|" character with space, changing record name to strain name only.
# 5. Keep the first sequence for a given strain name, keeping the sequence for the most recent accession.
rule prepare_sequences:
input:
sequences="data/{lineage}/raw_sequences_{segment}.fasta",
Expand Down

0 comments on commit 7784681

Please sign in to comment.