Skip to content

Import spreadsheet

martinghunt edited this page Apr 23, 2018 · 1 revision

Import speadsheet

This page describes the format of the spreadsheet that is required as input to the import pipeline. The columns in the spreadsheet were primarily designed for the CRyPTIC project, but can be applied to other data.

A template spreadsheet called out.xlsx can be generated using this command:

singularity exec clockwork_container.img clockwork make_import_spreadsheet out.xlsx

The column headings must not be edited, otherwise the import pipeline will fail!

The columns are as follows:

  • subject_id. This should be an ID corresponding to the subject/patient ID.

  • site_id. This is the name of the site where the data were generated.

  • lab_id. The ID for the sample that the lab used.

  • isolate_number. An isolate number for the sample.

  • sequence_replicate_number. The sequencing replicate number for this isolate.

  • submission_date. Any date you like (in the CRyPTIC project, this is the date when the sample was submitted for processing).

  • reads_file_1. The name of the first FASTQ file. This must exactly match the name of the file on disk when running import.

  • reads_file_1_md5. The md5 sum of reads_file_1.

  • reads_file_2. The name of the second FASTQ file. This must exactly match the name of the file on disk when running import.

  • reads_file_2_md5. The md5 sum of reads_file_2.

  • dataset_name. Your data can be grouped together into datasets of your choosing. For example, a dataset for a publication. Pipelines can also be run on a per-dataset_name bases (instead of the default to run on all samples). The same dataset_name can be used in different spreadsheets, allowing reads to be added to a set at a later date. If you do this, ensure that the dataset_name is the same across all spreadsheets.

  • instrument_model. The Illumina instrument used for sequencing. It must be one of the following: Illumina Genome Analyzer, Illumina Genome Analyzer II, Illumina Genome Analyzer IIx, Illumina HiSeq 2500, Illumina HiSeq 2000, Illumina HiSeq 1500, Illumina HiSeq 1000, Illumina MiSeq, Illumina HiScanSQ, HiSeq X Ten, NextSeq 500, HiSeq X Five, Illumina HiSeq 3000, Illumina HiSeq 4000, NextSeq 550.

  • ena_center_name. If reads are submitted to the ENA, this is the name of the submitting centre that will be used.

  • submit_to_ena. Must be 0 or 1. Set this to 1 if you want the reads to be submitted when running the ENA submission pipeline. Otherwise, use 0.

  • ena_on_hold. Must be 0 or 1. Only applies if submite_to_ena is 1. Setting ena_on_hold will keep the data hidden from the public for up to two years. Setting this to 0 will make the data public upon submission.

  • ena_run_accession. Set this to 0.

  • ena_sample_accession. Set this to 0.