Skip to content

Latest commit

 

History

History
94 lines (74 loc) · 3.74 KB

README.md

File metadata and controls

94 lines (74 loc) · 3.74 KB

visualization_pipeline

Documentation:

->Requirements:
	flask
	scikit-allel
	GEOparse
	wget
	pickle
	numpy 
	matplotlib
	seaborn
	bcolz
	pandas
	

->GUI Website:
	In the src/ folder we have the instrastructure for a flask webpage.  The subdirectory venv contains the environment variables, the subdirectory app contains the material for the front facing webseries, and the subdirectory pipeline contains the python programs to run with the webpage.  

	To run the website: 
		Activate the virtual environment: 
			venv\Scripts\activate
		Tell Flask where to find the webpage file:
			export FLASK_APP=webpage_test.py
			or for windows set FLASK_APP=webpage_test.py
			OR you can permanently set it by adding FLASK_APP=webpage_test.py to .flaskenv
		Run: 
			On the command line run: flask run 
		Open up the local page in a web browser window.  
		
	Input Fields
		GEO Accession Number:  the series accession number of the study of interest which can be found at Gene Expression Omnibus (GEO)
		Output Label:  output label for resulting files (vcf, bcf, tab) and image files (for any figures)
		ChipType: Illumina Beadchip used in study.  This could be found on the GEO page for the study under platform. If you encounter errors in the pipeline, it may be that the version is incorrect.  
		Visualization: Types of visualization graphs to output. These will be saves as jpgs in the series folder.  
		Subsample: If you would like to just select certain samples from the series.  Once all the files are read in then the user can select the samples of interest.  
		
	TO DO:
		- allow user to select column in metadata
		- allow user to select subset of data 
	
->Command Line Program:
	Input Fields:
	
	TO DO: 
		- modify web pipeline program to run at the command line 
	

->Pipeline:
	1. Check if series (GSE) has been downloaded yet and create series directory.  
	2. Identify the files needed based on Beadchip type.  
	3. Download all the sample IDAT files and download metadata.
	4. Create bash file and run bash file which will:
		a. Move IDAT files into SAMPLE directories. 
		b. Move metadata into SAMPLE directories.  
		c. Run Autoconvert on all IDAT files > converts to GTC format  
		d. Create filelist of all GTC files (there will be one per sample)
		e. Convert GTC files to BCF format using BCFtools GTC2VCF
		f. Sort, Normalize, and Index BCF file. 
		g. Create VCF format from BCF format.  
	
	TO DO for Pipeline: 
		- compare a single sample to the other samples 
	
->Types of Visualization: 
	Heterozygosity - Graphs the heterozygosity (percent of 0/1 or 1/0 per variant) across the genome
	PCA - Principal Components Analysis 
	Circos - circular graph that can contain SNP density, heterozygosity, and FST
	Site Frequency Spectrum - Graph of the number of variants for each derived allele count
	Heat Topology - 
	
	TO DO: 
		- allow user to select specific tracts for the circos plots
		- more forms of visualization 

-> References: 
	- https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world
	- Scikit-allel
	- https://github.com/freeseek/gtc2vcf


-> Possible Tutorials to Explore: 
	-http://corearray.sourceforge.net/tutorials/SNPRelate/
	-https://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/clustering/hierarchical/clust_complete_linkage.ipynb
	-https://sebastianraschka.com/Articles/heatmaps_in_r.html
	-PCA http://hardingnj.github.io/2018/08/15/making-interactive-pca-plots.html
	-Scikit-allel http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html
	-https://plantgenomics.univie.ac.at/radseq/snpfiltering/ heatmap2 R function 
	-http://people.virginia.edu/~wc9c/KING/manual.html
	-Saving Plots in R https://www.stat.berkeley.edu/~s133/saving.html