#ePygenetics
This programme extracts data from .wig files based on user input. The output of the programme is a database called ePygenetics.csv
which contains all data extracted by the user.
##Project Description
This programme was designed as part of Medsci 736, a course on Digital Research Skills at the University of Auckland, New Zealand.
The project was designed by Dr Justin O'Sullivan from the Liggins Institute who wanted to add the epigenetic status of single nucleotide polymorphisms to his genetic data analysis.
In simple terms, this programme allows scientists to get extra information about person to person genetic variation. This information can then be used to better understand how genetics is related to disease and to look for treatments for diseases with a genetic component.
The programme was designed using data from the NIH Roadmap Epigenomics Repository which is open access. A subset of 6 modified files has been included in the test-data
folder for testing the programme. For more information see the data readme file
##Contributors
- Callum Chalmers
- Kreshnik Pireva
- Luis Miguel
- Bibiana Lee
##Licensing
All code and associated documentation in this repository is under the MIT license. Data in the test-data file
was downloaded from here and is covered by the NIH Epigenomic Data Policy. All other files are licensed under CC-BY-SA 4.0 International. For more information see the LICENSE file.
##Prerequisites
- Operating System
- The programme was scripted for Ubuntu 16.04 LTS, it has been tested and is also compatible with macOS Sierra 10.12 and Windows 10 Home
- If you are using the script on Windows, on line 13 of
ePygenetics.py
you will need to change the code to reados.system('cls')
for the programme to clear the screen
- Python
- To run the software you will need Python 3.5.2 or a compatible version
- Follow this link to download Python 3.5.2
- Alternatively under the MIT license you can update the code to be compatible with your own version of Python
- Python Packages
- The software requires the Python package Colorama version 0.3.7 or a compatible version
- To install this package:
- On Windows, open the command line and run the command
py -m pip install colorama==0.3.7
- On Linux, open the command line and run the command
pip install colorama==0.3.7
- On OS X, open the command line and run the command
pip3 install colorama==0.3.7
- If you already have colorama installed and the version is incompatible add the optional
-I
argument to ignore previous versions
- On Windows, open the command line and run the command
- For testing, the software requires the Python package pytest version 2.9.2 or a compatible version, see "Testing the programme" for instructions on how to run this
##Input Data Requirements
- The programme can only read Wiggle (.wig) files with a fixed step of 20, this is a standard file format for genetic data with a very specific structure, see this link for more information about this file type
- For the programme to read the files, both the script and the files need to be in the same directory
- At least one .wig file is needed to run the programme
- Files must contain a "-" in their name to be read by the programme, whatever text is before the dash will be the column heading in the output database so bear this in mind when naming files
##Limitations
- The last block of base pairs of each chromosome (the remainder when the chromosome length is divided by 1000) cannot be read due to a change in the file structure at this point, any SNP in this range will be treated as if it is not in the file and return NaN
- For the output to be valid, all files must use the same genome build, see this link for more information about genome releases
-
Download
test-data
and transfer the contents to the same directory as theePygenetics.py
script -
Open the command line to folder containing the
ePygenetics.py
script -
Type the command
python ePygenetics.py
-
You will be greeted with a menu with 4 options:
1. Add a cell line (loading a file) 2. Add a SNP 3. Help 4. Exit
###Loading the test data files
- Start by typing
1
and pressing enter
- You will be transferred to the Add a Cell Line function and a message will appear saying
Enter the cell line name (the characters before the '-' in the file name) or 0 to return to the main menu
- Type
CD34+
and press enter
- A message will appear which says
Cell line added
and you will be asked what you would like to do next
-
Type
1
and press enter -
Repeat the last two steps to add the following cell lines:
IMR90 FetalLung FetalKidney FetalHeart FetalAdrenal
-
If you enter something incorrectly, red text saying
Cell line file not in folder
will appear, press enter to continue and re-enter the cell line correctly -
If you enter a cell line that you have already entered, red text saying
Cell line already entered
will appear, enter1
to continue and re-enter a different cell line
- Once this is complete and you have added all the cell lines, type
2
and press enter to return to the main menu
###Adding SNPs
- On the main menu, type
2
and press enter
- You will be transferred to the Add a SNP function and a message will appear saying
Enter the chromosome and position of the SNP you wish to add or enter 0 to return to the main menu
-
Type
10
for the chromosome and press enter -
Type
58247
for the SNP position and press enter
-
The programme will then pause briefly while it searches through the loaded files
-
Once the data has been added, a message will appear saying
SNP added
and the programme will ask what you want to do next
-
Type
1
and press enter -
Repeat the last three steps using the following chromosome, SNP pairs:
chromosome = 10 SNP = 86145 chromosome = 10 SNP = 63726 chromosome = 10 SNP = 51102 chromosome = 10 SNP = 92003 chromosome = 10 SNP = 61994
- If you enter a SNP that you have already entered, red text saying
SNP already entered
will appear, enter1
to continue and re-enter a different SNP
- Once this is complete exit the programme by typing
3
and pressing enter
###Expected output
-
Go to the file containing the
ePygenetics.py
script and find the output databaseePygenetics.csv
-
If you open it with a text editor it should look like this:
snps,CD34+,IMR90,FetalLung,FetalHeart,FetalAdrenal,FetalKidney chr10-58247*,42,NaN,NaN,NaN,NaN,NaN, chr10-86145*,1,2,0,0,67,0, chr10-63726*,NaN,NaN,2,32,2,1, chr10-51102*,NaN,0,0,NaN,NaN,21, chr10-92003*,NaN,NaN,84,0,0,NaN, chr10-61994*,NaN,59,NaN,NaN,NaN,NaN,
-
If you open it with a Spreadsheet application it should look like this:
snps CD34+ IMR90 FetalLung FetalHeart FetalAdrenal FetalKidney chr10-58247* 42 NaN NaN NaN NaN NaN chr10-86145* 1 2 0 0 67 0 chr10-63726* NaN NaN 2 32 2 1 chr10-51102* NaN 0 0 NaN NaN 21 chr10-92003* NaN NaN 84 0 0 NaN chr10-61994* NaN 59 NaN NaN NaN NaN -
Alternatively, there is a file in the
test-data
folder calledePygenetics-sample-output.csv
which was generated by following these instructions and can be used as a comparison
##Running the programme using your own data
-
Open the command line to folder containing the Python script called
ePygenetics.py
-
Type the command
python ePygenetics.py
-
You will be greeted with a menu with 4 options:
1. Add a cell line (loading a file) 2. Add a SNP 3. Help 4. Exit
You will then be asked to choose which of the 4 options you would like to do. Type the number corresponding to the action you would like to perform and press enter.
###Adding a cell line (loading a file)
- A message will pop up saying
Enter the cell line name (the characters before the '-' in the file name) or 0 to return to the main menu
- If the output database
ePygenetics.csv
does not exist, it will be created as part of this process - For this action to work the corresponding file must be in the same directory as the
ePygenetics.py
script - The file must be named
[cell line]-[any text].wig
- The text before the dash in the file name must be unique for the programme to function correctly, for example if you have two files from the same cell line call them
Epithelial1-xxx.wig
andEpithelial2-xxx.wig
- Once a file is added, it must stay in the folder for the programme to run correctly, if the file is removed, the output will be invalid
- Type the cell line name and press enter
- If you enter a valid cell line, the programme will add a column to the database and search the file for any SNPs that have been entered into the file
- If you enter a cell line that does not have a corresponding file in the directory an error message will pop up saying
Cell line file not in folder
, to clear move the file into the directory and then press enter - If you enter a cell line that is already in the database, an error message will pop up saying
Cell line already entered
, see below for how to proceed - If you enter
0
you will returned to the menu - Once this is complete, the programme will notify you
Cell line added
and ask if you would like to add another cell line, return to the main menu or exit the programme - Type the character corresponding to how you would like to proceed and press enter
- If you press
1
, the you can add another file - If you press
2
, the programme will return to the menu - If you press
3
, the programme will exit
###Adding a SNP
- A message will pop up saying
Enter the chromosome and position of the SNP you wish to add or enter 0 to return to the main menu
- Type the chromosome your SNP is on and press enter
- If you enter a value that is not the numbers 0-23 or the letters m/M, x/X or y/Y, an error message will pop up saying
Illegal character entered
and you will be asked to press enter and will need to re-enter a valid input - If you enter
0
you will returned to the main menu - Then you will be asked to enter the position of the SNP on that chromosome. The position is the location of that SNP in base pairs.
- Type the SNP position and press enter
- If you enter
0
you will be returned to main menu - If the output database
ePygenetics.csv
does not exist then it will be created at this point - If you enter a SNP that is already in the database, an error message will appear saying
SNP already entered
, see below for how to proceed - If you enter a SNP by accident, the only way to remove it is to wait until the programme is finished and then to delete the entire from the output database
- At this point the programme will add a row to the database for that SNP
- The programme will then take the user parameters and use them to search for that SNP in all the files that have been loaded into the database using the Add a cell line function
- If no files have been loaded, it will simply notify you that your SNP has been added and ask how you want to proceed (see below)
- The programme will then return the number of contigs that aligned to the section of the genome that covers that SNP for each file. If the region of the genome that contains your SNP is not in the file, the programme will return NaN.
- The programme will then populate the row of the database using the values it extracted from each file
- A message will then appear saying
SNP added
- The programme will then ask if you want to add another SNP, want to return to the main menu or exit the programme
- Type the number corresponding to how you want to proceed and press enter
- If you press
1
, this cycle will repeat - If you press
2
, the programme will return to the menu - If you press
3
, the programme will exit
###Help
- You will be redirected to a brief manual that explains what each function does
- You will then be asked to press enter to continue and you will be redirected to the main menu
###Exit
- The programme will close
###Expected Output
- Once you are finished, all the data you have added can be visualised in the output database
ePygenetics.csv
which can be found in the same directory as theePygenetics.py
script. - This file can be opened in almost any text editor but is best visualised in Google Docs, Libre or Open Office Calc or Microsoft Excel.
- The output database should have all the added cell lines as column headings and all the added SNPs as row headings. All the cells within these columns and rows should be filled with data values or NaN
- Changing this database in any way could alter the way the programme runs so if you want to manipulate it, copy it to a different directory or copy it and rename it
##Testing the programme
- This software comes with a set of unit tests to check the programme is functioning correctly
- These were designed using pytest version 2.9.2
- To run the tests, follow these steps:
-
Download
test-data
and transfer the contents to the same directory as theePygenetics.py
script -
Download the
test_ePygenetics
folder and move the entire folder to the same directory as theePygenetics.py
script -
In the
ePygenetics.py
script, delete the last linemain()
and save the file -
To install pytest version 2.9.2, open the command line in the directory containing the
ePygenetics.py
script:- On Windows, run the command
py -m pip install pytest==2.9.2
- On Linux, run the command
pip install pytest==2.9.2
- On OS X, run the command
pip3 install pytest==2.9.2
- If you already have pytest installed, and the version is incompatible, include the optional
-I
argument to ignore previous versions
- On Windows, run the command
-
To check this worked correctly:
- On Windows, type the command
py -m pytest --version
- On Linux, type the command
python -m pytest --version
- On OS X, type the command
python3 -m pytest --version
- This should return a message similar to this, depending on your version, operating system and directory pathway:
This is pytest version 2.9.2, imported from /home/admin736/anaconda3/lib/python3.5/site-packages/pytest.py
- On Windows, type the command
-
Once this is complete:
- On Windows, run the command
py -m pytest -pyargs test-ePygenetics/test_ePygenetics.py
- On Linux, run the command
python -m pytest
- On OS X, run the command
python3 -m pytest
- On Windows, run the command
-
This will collect all the tests in the
test_ePygenetics
folder and run them -
The output should be similar to this, depending on your Python and pytest versions, your operation system and your rootdir, the most important thing is there are 16 tests and they all pass:
============================= test session starts ============================== platform linux -- Python 3.5.2, pytest-2.9.2, py-1.4.31, pluggy-0.3.1 rootdir: /home/admin736/Desktop/Assignments/CallumChalmers29-crispy-disco, inifile: collected 16 items test-ePygenetics/test_ePygenetics.py .................. ========================== 16 passed in 0.03 seconds ===========================
-
This confirms the programme is working correctly, if you do not get this screen, delete and redownload the contents of the GitHub repo, check all your package versions and try again
-
Once this is complete, in the
ePygenetics.py
script add the last linemain()
back in and save the file to run the programme properly
-