Skip to content

Commit

Permalink
fixup: Document USVI data provenance
Browse files Browse the repository at this point in the history
  • Loading branch information
j23414 committed Jan 30, 2024
1 parent ed148b0 commit f9eff33
Show file tree
Hide file tree
Showing 3 changed files with 4,630 additions and 3,413 deletions.
94 changes: 93 additions & 1 deletion phylogenetic/data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,96 @@ The primary source of data for this build is GenBank. However, there are instanc

This Zika build incorporates data from https://github.com/blab/zika-usvi/. The sequences and metadata for USVI from that GitHub repository have undergone curation and were uploaded to https://github.com/nextstrain/fauna. Subsequently, they were downloaded as sequences and metadata, and a filter was applied to include only those records not yet submitted to NCBI GenBank. The resulting records are now available as a pair of metadata and sequences files in this directory.

The process of merging the USVI data into the GenBank dataset is facilitated through the `append_usvi`` rule.
The process of merging the USVI data into the GenBank dataset is facilitated through the `append_usvi` rule.

Steps to create the `metadata_usvi.tsv` and `sequences_usvi.fasta` files were as follows:

1. Sequences were uploaded to the fauna database following [these instructions](https://github.com/nextstrain/fauna/blob/f9e7955cb4381d5e881c337e005778ed43b7c56c/builds/ZIKA.md#fred-hutch-sequences).
2. Sequences were downloaded from the fauna database following [these instructions](https://github.com/nextstrain/fauna/blob/5d5a1f3faf06805a5f31e91df2c76b06e6f3bf6a/builds/ZIKA.md#download-from-fauna-parse-compress-and-push-to-s3) and saved as `zika.fasta`
3. Sequences were ingested from GenBank following [these instructions](../README.md) and saved as `sequences.fasta`
4. [NCBI Blastn](https://www.ncbi.nlm.nih.gov/books/NBK279690/) was used to identify fauna sequences that were not one hundred percent identical to GenBank sequences using the following commnads:


```bash
GENBANK_SEQUENCES=sequences.fasta
FAUNA_SEQUENCES=zika.fasta

# Create a local blast database
makeblastdb \
  -in ${GENBANK_SEQUENCES} \  
  -dbtype nucl

# Blast fauna against GenBank
blastn \
-db ${GENBANK_SEQUENCES} \
-query ${FAUNA_SEQUENCES} \
-num_alignments 1 \
-outfmt 6 \
-out blast_output.txt

# USVI strains that
# + match at 100%
# + match at least a 5000nt region (to filter out short substring matches)
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" && $3>=100 && $4>5000 , OFS="\t" {print $1}' \
> USVI_100_match.txt

less USVI_100_match.txt
# USVI/5/2016|zika|MW165881|2016-10-17|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/43/2016|zika|MW165884|2016-07-19|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/4/2016|zika|MW165880|2016-10-14|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/35/2016|zika|MW165883|2016-09-08|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago
# USVI/25/2016|zika|MW165882|2016-09-27|north_america|usvi|saint_thomas|saint_thomas|genbank|genome|Santiago

# USVI strains that are not in the 100 match list
cat blast_output.txt \
| awk -F'\t' '$1~"USVI" , OFS="\t" {print}' \
| grep -Fvf USVI_100_match.txt \
| awk -F'\t' '{print $1}' \
| sort \
| uniq \
> USVI_not_match.txt

head USVI_not_match.txt
# USVI/1/2016|zika|VI1_1d|2016-09-28|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/11/2016|zika|VI11|2016-03-22|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# USVI/12/2016|zika|VI12|2016-11-04|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# USVI/13/2016|zika|VI13|2016-08-13|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black
# USVI/19/2016|zika|VI19_12plex|2016-11-21|north_america|usvi|saint_croix|saint_croix|fh|genome|Black
# ...
```

5. Pull out the corresponding `metadata_usvi.tsv` and `sequences_usvi.fasta` using a combination of [smof](https://github.com/incertae-sedis/smof) and [augur parse](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/parse.html)

```bash
# Pulls out sequences based on a match against header strings
smof grep -f USVI_not_match.txt zika.fasta > usvi.fasta

# Splits file into metadata_usvi.tsv and sequences_usvi.fasta
augur parse \
--sequences usvi.fasta \
--output-sequences sequences_usvi.fasta \
--output-metadata raw_metadata_usvi.tsv \
--fields strain virus accession date region country division location institution segment authors url title journal paper_url \
--prettify-fields region country division location

augur parse \
--sequences usvi.fasta \
--output-sequences sequences_usvi.fasta \
--output-metadata no.tsv \
--fields a b strain c d e f g h i j k l m n

# Add sequence lengths to metadata
echo "accession|length" | tr '|' '\t' > lengths_usvi.tsv
smof stat --length --byseq sequences_usvi.fasta >> lengths_usvi.tsv

tsv-join -H \
--filter-file lengths_usvi.tsv\
--key-fields accession \
--append-fields length \
raw_metadata_usvi.tsv \
| tsv-select -H \
--fields accession,strain,date,region,country,division,location,length,authors,institution,url \
> metadata_usvi.tsv
```

52 changes: 26 additions & 26 deletions phylogenetic/data/metadata_usvi.tsv
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
genbank_accession genbank_accession_rev accession strain date region country division location length host release_date update_date sra_accessions authors institution url
USVI/37/2016 VI37 USVI/37/2016 2016-10-06 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/2/2016 VI2 USVI/2/2016 2016-09-28 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/42/2016 VI42 USVI/42/2016 2016-10-26 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/41/2016 VI41 USVI/41/2016 2016-11-10 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/6/2016 VI6 USVI/6/2016 2016-10-19 North America Usvi Saint John Saint John 10636 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/12/2016 VI12 USVI/12/2016 2016-11-04 North America Usvi Saint Croix Saint Croix 10636 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/36/2016 VI36 USVI/36/2016 2016-09-13 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/30/2016 VI30_1d USVI/30/2016 2016-08-07 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/13/2016 VI13 USVI/13/2016 2016-08-13 North America Usvi Saint Thomas Saint Thomas 10636 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/32/2016 VI32_12plex USVI/32/2016 2016-08-11 North America Usvi Saint Croix Saint Croix 10792 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/7/2016 VI7 USVI/7/2016 2016-10-27 North America Usvi Saint Thomas Saint Thomas 10636 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/28/2016 VI28_1d USVI/28/2016 2016-11-28 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/34/2016 VI34 USVI/34/2016 2016-08-01 North America Usvi Saint Thomas Saint Thomas 10792 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/45/2016 VI45 USVI/45/2016 2016-08-03 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/3/2016 VI3 USVI/3/2016 2016-09-26 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/1/2016 VI1_1d USVI/1/2016 2016-09-28 North America Usvi Saint Croix Saint Croix 10792 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/46/2016 VI46 USVI/46/2016 2016-07-15 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/11/2016 VI11 USVI/11/2016 2016-03-22 North America Usvi Saint Thomas Saint Thomas 10636 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/38/2016 VI38 USVI/38/2016 2016-10-25 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/27/2016 VI27_1d USVI/27/2016 2016-08-19 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/20/2016 VI20_12plex USVI/20/2016 2016-10-13 North America Usvi Saint Croix Saint Croix 10792 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/39/2016 VI39_12plex USVI/39/2016 2016-11-09 North America Usvi Saint Croix Saint Croix 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/44/2016 VI44 USVI/44/2016 2016-10-17 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/19/2016 VI19_12plex USVI/19/2016 2016-11-21 North America Usvi Saint Croix Saint Croix 10792 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
USVI/23/2016 VI23_12plex USVI/23/2016 2016-07-12 North America Usvi Saint Thomas Saint Thomas 10807 Homo sapiens Black et al FH https://github.com/blab/zika-usvi/
accession strain date region country division location length authors institution url
VI44 USVI/44/2016 2016-10-17 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI20_12plex USVI/20/2016 2016-10-13 North America Usvi Saint Croix Saint Croix 10792 Black et al fh https://github.com/blab/zika-usvi/
VI41 USVI/41/2016 2016-11-10 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI32_12plex USVI/32/2016 2016-08-11 North America Usvi Saint Croix Saint Croix 10792 Black et al fh https://github.com/blab/zika-usvi/
VI12 USVI/12/2016 2016-11-04 North America Usvi Saint Croix Saint Croix 10636 Black et al fh https://github.com/blab/zika-usvi/
VI46 USVI/46/2016 2016-07-15 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI23_12plex USVI/23/2016 2016-07-12 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI28_1d USVI/28/2016 2016-11-28 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI45 USVI/45/2016 2016-08-03 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI37 USVI/37/2016 2016-10-06 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI39_12plex USVI/39/2016 2016-11-09 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI36 USVI/36/2016 2016-09-13 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI34 USVI/34/2016 2016-08-01 North America Usvi Saint Thomas Saint Thomas 10792 Black et al fh https://github.com/blab/zika-usvi/
VI11 USVI/11/2016 2016-03-22 North America Usvi Saint Thomas Saint Thomas 10636 Black et al fh https://github.com/blab/zika-usvi/
VI6 USVI/6/2016 2016-10-19 North America Usvi Saint John Saint John 10636 Black et al fh https://github.com/blab/zika-usvi/
VI3 USVI/3/2016 2016-09-26 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI27_1d USVI/27/2016 2016-08-19 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI13 USVI/13/2016 2016-08-13 North America Usvi Saint Thomas Saint Thomas 10636 Black et al fh https://github.com/blab/zika-usvi/
VI7 USVI/7/2016 2016-10-27 North America Usvi Saint Thomas Saint Thomas 10636 Black et al fh https://github.com/blab/zika-usvi/
VI38 USVI/38/2016 2016-10-25 North America Usvi Saint Thomas Saint Thomas 10807 Black et al fh https://github.com/blab/zika-usvi/
VI19_12plex USVI/19/2016 2016-11-21 North America Usvi Saint Croix Saint Croix 10792 Black et al fh https://github.com/blab/zika-usvi/
VI1_1d USVI/1/2016 2016-09-28 North America Usvi Saint Croix Saint Croix 10792 Black et al fh https://github.com/blab/zika-usvi/
VI42 USVI/42/2016 2016-10-26 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI30_1d USVI/30/2016 2016-08-07 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
VI2 USVI/2/2016 2016-09-28 North America Usvi Saint Croix Saint Croix 10807 Black et al fh https://github.com/blab/zika-usvi/
Loading

0 comments on commit f9eff33

Please sign in to comment.