kr-colab · andrewkern · Nov 24, 2021 · Nov 22, 2021 · Nov 23, 2021 · Nov 23, 2021
diff --git a/.github/workflows/deploy.yaml b/.github/workflows/deploy.yaml
@@ -0,0 +1,62 @@
+name: Build and upload to PyPI
+
+# Build on every branch push, tag push, and pull request change:
+on: [push, pull_request]
+# Alternatively, to publish when a (published) GitHub Release is created, use the following:
+# on:
+#   push:
+#   pull_request:
+#   release:
+#     types:
+#       - published
+
+jobs:
+  build_wheels:
+    name: Build wheels on ${{ matrix.os }}
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-20.04]
+
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Build wheels
+        uses: pypa/[email protected]
+
+      - uses: actions/upload-artifact@v2
+        with:
+          path: ./wheelhouse/*.whl
+
+  build_sdist:
+    name: Build source distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Build sdist
+        run: pipx run build --sdist
+
+      - uses: actions/upload-artifact@v2
+        with:
+          path: dist/*.tar.gz
+
+  upload_pypi:
+    needs: [build_wheels, build_sdist]
+    runs-on: ubuntu-latest
+    # upload to PyPI on every tag starting with 'v'
+    if: github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags/v')
+    # alternatively, to publish when a GitHub Release is created, use the following rule:
+    # if: github.event_name == 'release' && github.event.action == 'published'
+    steps:
+      - uses: actions/download-artifact@v2
+        with:
+          name: artifact
+          path: dist
+
+      - uses: pypa/[email protected]
+        with:
+          user: __token__
+          repository_url: https://test.pypi.org/legacy/
+          password: ${{ secrets.PYPI_TOKEN }}
+          # To test: repository_url: https://test.pypi.org/legacy/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,5 @@
+include diploshic/shicstats.pyf
+include diploshic/testEmpirical.fvec
+include diploshic/testing/*
+include diploshic/training/*
+include diploshic/utils.c
diff --git a/README.md b/README.md
@@ -22,48 +22,44 @@ such as `conda` or `pip`. The complete list of dependencies looks like this:
 
 ## Install on linux
 I'm going to focus on the steps involved to install on a linux machine using Anaconda as our python source / main
-package manager. First download and install Anaconda
+package manager. Assuming you have conda installed, create a new conda env
 
 ```
-$ wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
-$ bash Anaconda3-5.0.1-Linux-x86_64.sh
-```
-That will give us the basics (numpy, scipy, scikit-learn, etc). Next lets install scikit-allel using `conda`
-```
-$ conda install -c conda-forge scikit-allel
-```
-That's easy. Installing tensorflow and keras can be slightly more touchy. You will need to determine if 
-you want to use a CPU-only implementation (probably) or a GPU implementation of tensorflow. See
-https://www.tensorflow.org/install/install_linux for install instructions. I'm going to install the
-CPU version for simplicity. tensorflow and keras are the libraries which handle the deep learning
-portion of things so it is important to make sure the versions of these libraries play well together 
-```
-$ pip install tensorflow 
-$ pip install keras
+$ conda create -n diploshic python=3.9 --yes
 ```
+
 Note that because I'm using the Anaconda version of python, pip will only install this in the anaconda directory
-which is a good thing. Okay that should be the basics of dependencies. Now we are ready to install `diploS/HIC` itself
+which is a good thing. Now we are ready to install `diploS/HIC` itself
+
 ```
 $ git clone https://github.com/kern-lab/diploSHIC.git
 $ cd diploSHIC 
-$ python setup.py install
+$ pip install .
 ```
-Assuming all the dependencies were installed this should be all set
+
+This should automatically install all the dependencies including tensorflow.
+You will need to determine if 
+you want to use a CPU-only implementation (probably) or a GPU implementation of tensorflow. See
+https://www.tensorflow.org/install/install_linux for install instructions. 
+
 
 ## Usage
-The main program that you will interface with is `diploSHIC.py`. This script has four run modes that allow the user to 
+The main program that you will interface with is `diploSHIC`. This script is installed by default
+in the conda environment `bin` directory.
+This script has four run modes that allow the user to 
 perform each of the main steps in the supervised machine learning process. We will briefly lay out the modes of use
 and then will provide a complete example of how to use the program for fun and profit.
 
-`diploSHIC.py` uses the `argparse` module in python to try to give the user a complete, command line based help menu. 
+`diploSHIC` uses the `argparse` module in python to try to give the user a complete, command line based help menu. 
 We can see the top level of this help by typing
+
 ```
-$ python diploSHIC.py -h
-usage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...
+$ diploSHIC -h
+usage: diploSHIC [-h] {train,predict,fvecSim,fvecVcf} ...
 
 calculate feature vectors, train, or predict with diploSHIC
 
-possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:
+possible modes (enter \'diploSHIC modeName -h\' for modeName\'s help message:
   {fvecSim,makeTrainingSets,train,fvecVcf,predict}
                         sub-command help
     fvecSim             Generate feature vectors from simulated data
@@ -76,6 +72,8 @@ possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help mess
 optional arguments:
   -h, --help            show this help message and exit
 ```
+
+
 ### before running diploSHIC: simulating training/testing data
 All flavors of S/HIC require simulated data for training (and ideally, testing). Users can select whatever simulator 
 they prefer and parameterize them however they wish. We have included an example script in this respository 
@@ -84,15 +82,15 @@ https://github.com/kern-lab/discoal).
 
 ### feature vector generation modes
 The first task in our pipeline is generating feature vectors from simulation data (or empirical data) to
-use with the CNN that we will train and then use for prediction. The `diploSHIC.py` script eases this 
+use with the CNN that we will train and then use for prediction. The `diploSHIC` script eases this 
 process with two run modes
 
 #### fvecSim mode
-The fvecSim run mode is used for turning ms-style output into feature vectors compatible with `diploSHIC.py`. The
+The fvecSim run mode is used for turning ms-style output into feature vectors compatible with `diploSHIC`. The
 help message from this mode looks like this
 ```
-$ python diploSHIC.py fvecSim -h
-usage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
+$ diploSHIC fvecSim -h
+usage: diploSHIC fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
                             [--numSubWins NUMSUBWINS]
                             [--maskFileName MASKFILENAME]
                             [--chrArmsForMasking CHRARMSFORMASKING]
@@ -152,8 +150,8 @@ for a fleshed out example of how to use these features.
 The fvecVcf mode is used for calculating feature vectors from data that is stored as a VCF file. 
 The help message from this mode is as follows
 ```
-$ python diploSHIC.py fvecVcf -h
-usage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]
+$ diploSHIC fvecVcf -h
+usage: diploSHIC fvecVcf [-h] [--targetPop TARGETPOP]
                             [--sampleToPopFileName SAMPLETOPOPFILENAME]
                             [--winSize WINSIZE] [--numSubWins NUMSUBWINS]
                             [--maskFileName MASKFILENAME]
@@ -219,8 +217,8 @@ Once we have feature vector files ready to go we can train and test our CNN and
 Before entering train mode we need to consolidate our training set into 5 files, one for each class. This is done using the
 makeTrainingSets mode whose help message is as follows:
 ```
-$ python diploSHIC.py makeTrainingSets -h
-usage: diploSHIC.py makeTrainingSets [-h]
+$ diploSHIC makeTrainingSets -h
+usage: diploSHIC makeTrainingSets [-h]
                                      neutTrainingFileName
                                      softTrainingFilePrefix
                                      hardTrainingFilePrefix
@@ -251,10 +249,10 @@ optional arguments:
   -h, --help            show this help message and exit
 ```
 #### train mode
-Here is the help message for the train mode of `diploSHIC.py`
+Here is the help message for the train mode of `diploSHIC`
 ```
-$ python diploSHIC.py train -h
-usage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
+$ diploSHIC train -h
+usage: diploSHIC train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
                           trainDir testDir outputModel
 
 required arguments:
@@ -273,19 +271,19 @@ optional arguments:
 As you will see in a moment train mode is used for training the deep learning classifier. Its required
 arguments are trainDir (the directory where the training feature vectors
 are kept), testDir (the directory where the testing feature vectors are kept), and outputModel the file name for the trained
-network. One note -- `diploSHIC.py` expects five files named `hard.fvec`, `soft.fvec`, `neut.fvec`, `linkedSoft.fvec`, and 
+network. One note -- `diploSHIC` expects five files named `hard.fvec`, `soft.fvec`, `neut.fvec`, `linkedSoft.fvec`, and 
 `linkedHard.fvec` in the training and testing directories. The training and testing directory can be the same directory in 
 which case 20% of the training examples are held out for use in testing and validation.
 
 train mode has two options, the number of subwindows used for the feature vectors and the number of training epochs for the
 network.
 
 ### predict mode
-Once a classifier has been trained, one uses the predict mode of `diploSHIC.py` to classify empirical data. Here is the help
+Once a classifier has been trained, one uses the predict mode of `diploSHIC` to classify empirical data. Here is the help
 statement
 ```
-$ python diploSHIC.py predict -h
-usage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]
+$ diploSHIC predict -h
+usage: diploSHIC predict [-h] [--numSubWins NUMSUBWINS]
                             modelStructure modelWeights predictFile
                             predictFileOutput
 
@@ -311,11 +309,11 @@ genomic data). Let's quickly give that code a spin. The directories `testing/` a
 formatted diploid feature vectors that are ready to be fed into diploSHIC. First we will train the diploSHIC CNN, but we will
 restrict the number of training epochs to 10 to keep things relatively brief (this runs in less than 5 minutes on our server). 
 ```
-$ python diploSHIC.py train training/ testing/ fooModel --epochs 10
+$ diploSHIC train training/ testing/ fooModel --epochs 10
 ```
 as it runs a bunch of information monitoring the training of the network will apear. We are tracking the loss and accuracy in the
 validation set. When optimization is complete our trained network will be contained in two files, `fooModel.json` and 
-`fooModel.weights.hdf5`. The last bit of output from `diploSHIC.py` gives us information about the loss and accuracy on
+`fooModel.weights.hdf5`. The last bit of output from `diploSHIC` gives us information about the loss and accuracy on
 the held out test data. From the above run my looks like this:
 ```
 evaluation on test set:
@@ -326,7 +324,7 @@ Not bad. In practice I would set the `--epochs` value much higher than 10- the d
 Now that we have a trained model we can make predictions on some empirical data. In the repo there is a file called `testEmpirical.fvec`
 that we will use as input
 ```
-$ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
+$ diploSHIC predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
 ```
 the output predictions will be saved in `testEmpirical.preds` and should be straightforward to interpret.
 

diff --git a/build.sh b/build.sh
@@ -0,0 +1,2 @@
+python setup.py build
+python setup.py install --prefix=$PREFIX
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		python setup.py build
		python setup.py install --prefix=$PREFIX