Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pip pkg #39

Merged
merged 10 commits into from
Nov 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions .github/workflows/deploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
name: Build and upload to PyPI

# Build on every branch push, tag push, and pull request change:
on: [push, pull_request]
# Alternatively, to publish when a (published) GitHub Release is created, use the following:
# on:
# push:
# pull_request:
# release:
# types:
# - published

jobs:
build_wheels:
name: Build wheels on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-20.04]

steps:
- uses: actions/checkout@v2

- name: Build wheels
uses: pypa/[email protected]

- uses: actions/upload-artifact@v2
with:
path: ./wheelhouse/*.whl

build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Build sdist
run: pipx run build --sdist

- uses: actions/upload-artifact@v2
with:
path: dist/*.tar.gz

upload_pypi:
needs: [build_wheels, build_sdist]
runs-on: ubuntu-latest
# upload to PyPI on every tag starting with 'v'
if: github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/v')
# alternatively, to publish when a GitHub Release is created, use the following rule:
# if: github.event_name == 'release' && github.event.action == 'published'
steps:
- uses: actions/download-artifact@v2
with:
name: artifact
path: dist

- uses: pypa/[email protected]
with:
user: __token__
repository_url: https://test.pypi.org/legacy/
password: ${{ secrets.PYPI_TOKEN }}
# To test: repository_url: https://test.pypi.org/legacy/
5 changes: 5 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
include diploshic/shicstats.pyf
include diploshic/testEmpirical.fvec
include diploshic/testing/*
include diploshic/training/*
include diploshic/utils.c
82 changes: 40 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,48 +22,44 @@ such as `conda` or `pip`. The complete list of dependencies looks like this:

## Install on linux
I'm going to focus on the steps involved to install on a linux machine using Anaconda as our python source / main
package manager. First download and install Anaconda
package manager. Assuming you have conda installed, create a new conda env

```
$ wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
$ bash Anaconda3-5.0.1-Linux-x86_64.sh
```
That will give us the basics (numpy, scipy, scikit-learn, etc). Next lets install scikit-allel using `conda`
```
$ conda install -c conda-forge scikit-allel
```
That's easy. Installing tensorflow and keras can be slightly more touchy. You will need to determine if
you want to use a CPU-only implementation (probably) or a GPU implementation of tensorflow. See
https://www.tensorflow.org/install/install_linux for install instructions. I'm going to install the
CPU version for simplicity. tensorflow and keras are the libraries which handle the deep learning
portion of things so it is important to make sure the versions of these libraries play well together
```
$ pip install tensorflow
$ pip install keras
$ conda create -n diploshic python=3.9 --yes
```

Note that because I'm using the Anaconda version of python, pip will only install this in the anaconda directory
which is a good thing. Okay that should be the basics of dependencies. Now we are ready to install `diploS/HIC` itself
which is a good thing. Now we are ready to install `diploS/HIC` itself

```
$ git clone https://github.com/kern-lab/diploSHIC.git
$ cd diploSHIC
$ python setup.py install
$ pip install .
```
Assuming all the dependencies were installed this should be all set

This should automatically install all the dependencies including tensorflow.
You will need to determine if
you want to use a CPU-only implementation (probably) or a GPU implementation of tensorflow. See
https://www.tensorflow.org/install/install_linux for install instructions.


## Usage
The main program that you will interface with is `diploSHIC.py`. This script has four run modes that allow the user to
The main program that you will interface with is `diploSHIC`. This script is installed by default
in the conda environment `bin` directory.
This script has four run modes that allow the user to
perform each of the main steps in the supervised machine learning process. We will briefly lay out the modes of use
and then will provide a complete example of how to use the program for fun and profit.

`diploSHIC.py` uses the `argparse` module in python to try to give the user a complete, command line based help menu.
`diploSHIC` uses the `argparse` module in python to try to give the user a complete, command line based help menu.
We can see the top level of this help by typing

```
$ python diploSHIC.py -h
usage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...
$ diploSHIC -h
usage: diploSHIC [-h] {train,predict,fvecSim,fvecVcf} ...

calculate feature vectors, train, or predict with diploSHIC

possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:
possible modes (enter \'diploSHIC modeName -h\' for modeName\'s help message:
{fvecSim,makeTrainingSets,train,fvecVcf,predict}
sub-command help
fvecSim Generate feature vectors from simulated data
Expand All @@ -76,6 +72,8 @@ possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help mess
optional arguments:
-h, --help show this help message and exit
```


### before running diploSHIC: simulating training/testing data
All flavors of S/HIC require simulated data for training (and ideally, testing). Users can select whatever simulator
they prefer and parameterize them however they wish. We have included an example script in this respository
Expand All @@ -84,15 +82,15 @@ https://github.com/kern-lab/discoal).

### feature vector generation modes
The first task in our pipeline is generating feature vectors from simulation data (or empirical data) to
use with the CNN that we will train and then use for prediction. The `diploSHIC.py` script eases this
use with the CNN that we will train and then use for prediction. The `diploSHIC` script eases this
process with two run modes

#### fvecSim mode
The fvecSim run mode is used for turning ms-style output into feature vectors compatible with `diploSHIC.py`. The
The fvecSim run mode is used for turning ms-style output into feature vectors compatible with `diploSHIC`. The
help message from this mode looks like this
```
$ python diploSHIC.py fvecSim -h
usage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
$ diploSHIC fvecSim -h
usage: diploSHIC fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
[--numSubWins NUMSUBWINS]
[--maskFileName MASKFILENAME]
[--chrArmsForMasking CHRARMSFORMASKING]
Expand Down Expand Up @@ -152,8 +150,8 @@ for a fleshed out example of how to use these features.
The fvecVcf mode is used for calculating feature vectors from data that is stored as a VCF file.
The help message from this mode is as follows
```
$ python diploSHIC.py fvecVcf -h
usage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]
$ diploSHIC fvecVcf -h
usage: diploSHIC fvecVcf [-h] [--targetPop TARGETPOP]
[--sampleToPopFileName SAMPLETOPOPFILENAME]
[--winSize WINSIZE] [--numSubWins NUMSUBWINS]
[--maskFileName MASKFILENAME]
Expand Down Expand Up @@ -219,8 +217,8 @@ Once we have feature vector files ready to go we can train and test our CNN and
Before entering train mode we need to consolidate our training set into 5 files, one for each class. This is done using the
makeTrainingSets mode whose help message is as follows:
```
$ python diploSHIC.py makeTrainingSets -h
usage: diploSHIC.py makeTrainingSets [-h]
$ diploSHIC makeTrainingSets -h
usage: diploSHIC makeTrainingSets [-h]
neutTrainingFileName
softTrainingFilePrefix
hardTrainingFilePrefix
Expand Down Expand Up @@ -251,10 +249,10 @@ optional arguments:
-h, --help show this help message and exit
```
#### train mode
Here is the help message for the train mode of `diploSHIC.py`
Here is the help message for the train mode of `diploSHIC`
```
$ python diploSHIC.py train -h
usage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
$ diploSHIC train -h
usage: diploSHIC train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
trainDir testDir outputModel

required arguments:
Expand All @@ -273,19 +271,19 @@ optional arguments:
As you will see in a moment train mode is used for training the deep learning classifier. Its required
arguments are trainDir (the directory where the training feature vectors
are kept), testDir (the directory where the testing feature vectors are kept), and outputModel the file name for the trained
network. One note -- `diploSHIC.py` expects five files named `hard.fvec`, `soft.fvec`, `neut.fvec`, `linkedSoft.fvec`, and
network. One note -- `diploSHIC` expects five files named `hard.fvec`, `soft.fvec`, `neut.fvec`, `linkedSoft.fvec`, and
`linkedHard.fvec` in the training and testing directories. The training and testing directory can be the same directory in
which case 20% of the training examples are held out for use in testing and validation.

train mode has two options, the number of subwindows used for the feature vectors and the number of training epochs for the
network.

### predict mode
Once a classifier has been trained, one uses the predict mode of `diploSHIC.py` to classify empirical data. Here is the help
Once a classifier has been trained, one uses the predict mode of `diploSHIC` to classify empirical data. Here is the help
statement
```
$ python diploSHIC.py predict -h
usage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]
$ diploSHIC predict -h
usage: diploSHIC predict [-h] [--numSubWins NUMSUBWINS]
modelStructure modelWeights predictFile
predictFileOutput

Expand All @@ -311,11 +309,11 @@ genomic data). Let's quickly give that code a spin. The directories `testing/` a
formatted diploid feature vectors that are ready to be fed into diploSHIC. First we will train the diploSHIC CNN, but we will
restrict the number of training epochs to 10 to keep things relatively brief (this runs in less than 5 minutes on our server).
```
$ python diploSHIC.py train training/ testing/ fooModel --epochs 10
$ diploSHIC train training/ testing/ fooModel --epochs 10
```
as it runs a bunch of information monitoring the training of the network will apear. We are tracking the loss and accuracy in the
validation set. When optimization is complete our trained network will be contained in two files, `fooModel.json` and
`fooModel.weights.hdf5`. The last bit of output from `diploSHIC.py` gives us information about the loss and accuracy on
`fooModel.weights.hdf5`. The last bit of output from `diploSHIC` gives us information about the loss and accuracy on
the held out test data. From the above run my looks like this:
```
evaluation on test set:
Expand All @@ -326,7 +324,7 @@ Not bad. In practice I would set the `--epochs` value much higher than 10- the d
Now that we have a trained model we can make predictions on some empirical data. In the repo there is a file called `testEmpirical.fvec`
that we will use as input
```
$ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
$ diploSHIC predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
```
the output predictions will be saved in `testEmpirical.preds` and should be straightforward to interpret.

Expand Down
2 changes: 2 additions & 0 deletions build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python setup.py build
python setup.py install --prefix=$PREFIX
Loading