To use covSampler to analyze your own data, you’ll need to prepare two files:
-
A FASTA file with viral genomic sequences.
-
A corresponding TSV file with metadata describing each sequence.
Prepare your nucleotide sequences in a FASTA format file named sequences.fasta
.
You can see a formatted example sequence file here.
Prepare your metadata in a TSV format file named metadata.tsv
.
A metadata file must include the following fields:
Fields | Description | Format |
---|---|---|
strain | Sequence name | The strain values in the metadata file must match them in the fasta file |
date | Collection date | YYYY-MM-DD (Ambiguous value is unacceptable) |
region_exposure | Continent | Africa / Asia / Europe / North America / Oceania / South America |
country_exposure | Country | Country |
division_exposure | Administrative division | Division |
pango_lineage* | Viral lineage under the Pango nomenclature | See the lastest Pango lineage list |
* Currently covSampler workflow does not include Pango lineage assignment. You can perform the Pango lineage assignment using pangolin or nextclade.
You can see a formatted example metadata file here.
All data are in the data/
directory. The raw data and intermediate data of each project will be stored in its corresponding directory.
For a new project (here named tutorial_project
):
-
Create your project data folder in
data/
. -
Create
rawdata/
folder indata/tutorial_project
. -
Move your sequence data and metadata into
data/turotial_project/rawdata/
folder.
Now, the data/
directory structure should look like this:
data
├── README.md
├── example_project
│ └── rawdata
│ ├── metadata.tsv
│ └── sequences.fasta
└── tutorial_project
└── rawdata
├── metadata.tsv
└── sequences.fasta