Modern DNA sequencing machines generate several gigabytes (GB) of data per run. Organizing and archiving this data presents a challenge for small labs. We present a Sequence Upload and Data Archiving System (SeqUDAS) that aims to ease the task of maintaining a sequence data repository through process automation and an intuitive web interface.
- Automated upload and storage of sequence data to a central storage server.
- Data validation with MD5 checksums for data integrity assurance
- Illumina modules are incorpated to parse metrics binaries and generate a report similar to Illumina SAV.
- FASTQC and MultiQC workflows are included to perform QC analysis automatically.
- A taxonomic report will be generated based on Kraken report
- Archival information, QC results and taxonomic report can be viewed through a mobile-friendly web interface
- Pass sequence data along to another remote server via API (IRIDA)
SeqUDAS consists of three components:
- Data manager: Installed on a PC directly attached to an illumina sequencing machine.
- Data analyzer: Installed on a server to run the data analysis jobs.
- Web Application: Installed on a web server to provide account management and report viewing.
The package requires:
Software requirements | |
---|---|
Data Manager | Cygwin (Python, OpenSSH, cron, rsync, md5deep) |
Data Analyzer | Python, Kraken, FastQC, MultiQC, md5deep |
Web Application | Apache, MySQL, PHP, UserSpice, Bootstrap |
You must provide a configuration file on both Data manager and Data analyzer.
Here is an example for Data Manager:
[basic]
sequencer = machine_name
run_id_prefix = BCCDCN
# Complete list of timizones: https://gist.github.com/heyalexej/8bf688fd67d7199be4a1682b3eec7568
timezone = Canada/Pacific
write_logfile_details = False
old_file_days_limit = 180
admin_email = [email protected]
logfile_dir = Log
send_email = False
[sequencer]
run_dirs = //machine/MiSeqAnalysis
[local]
ssh_primate_key= /home/sequdas/.ssh/id_rsa
[server]
server_ssh_host = [email protected]
server_data_dir = /data/sequdas
[mysql_account]
mysql_host = 127.0.1
mysql_user = test
mysql_passwd = test
mysql_db = sequdas
[email_account]
gmail_user = test
gmail_pass = test
Here is an example for Data Analyzer:
[basic]
write_logfile_details = False
admin_email = [email protected]
logfile_dir = Log
send_email = False
[reporter]
reporter_ssh_host = [email protected]
qc_dir = /home/sequdas/img
[mysql_account]
mysql_host = 127.0.0.1
mysql_user = test
mysql_passwd = test
mysql_db = sequdas
[email_account]
gmail_user = test
gmail_pass = test
SeqUDAS uses Cron to triger job based on a time schedule. Once you have installed the packages, you can install cron as a Windows Service using cygrunsrv.
cygrunsrv --install cron --path /usr/sbin/cron --args –n
If you want to schedule the archiving time as 10:10 pm every day, you can edit the config file as:
crontab -e
30 10 * * * python //path_for_sequdas_client/sequdas_client.py
python sequdas_server.py
sequdas_server.py -i <input_directory> -o <output_directory>
-h --help
-i --in_dir input_directory (required)
-o --out_dir input_directory (required)
-s --step step (required)
step 1: Run MiSeq reporter
step 2: Run FastQC
step 3: Run MultiQC
step 4: Run Kraken
step 5: Run IRIDA uploader
-u Sequdas id
-e
False: won't send email (default)
True: end email.
-n
False: won't run the IRIDA uploader (default)
True: run IRIDA uploader.
-t
False: only on step (default)
True: run current step and the followings.
-k
False: won't keep the Kraken result (default)
True: keep the Kraken result
An example for viewing the report:
A Illumia SAV report:
A MultiQC report:
A Kraken report:
Version v0.1.2
- Added MultiQC pipeline to the QC report.
- Added the taxonomic analysis report (Kraken).
- Added the contamination detection results. Detected organisms will be displayed on genus level.
- Changed the table to bootstrap table to sopport sorting and provide better suppprot for tablet, phone, and PC.
- Added the sample information to the collapse table.
Version v0.1.3
- Separated the code into different libraries and modules.
- Switching the analysis pipeline run on the server only to avoid internet interrupt.
- Fixed issue where sample name contains space, or dash.
Our implementation of uploading data to IRIDA utilizes code from IRIDA miseq uploader.
Jun Duan: [email protected]
Dan Fornika: [email protected]
Damion Dooley: [email protected]
William Hsiao (PI): [email protected]