Parallelize ci (#29)

* fixed assembly hashsum * added a 'three tries' for nucleotide download * parallelize dataset CI better * chunks of 25 * fix addition oopsie * add quotes for int * fix 50 to 25 per chunk * debug yaml int * are dashes the new underscore? * stash num per chunk in 'include' * m * back up to put num-per-chunk into matrix strategy * back to underscore * debugging this oddity * commented more lines * I found the syntax error for mathing * exit with pass when zero samples in chunk * what is github run number * removed github run number * can I bump up the parallel jobs to 20? * what about to 50?
CDCgov · Jun 10, 2022 · ee93027 · ee93027
1 parent 3191269
commit ee93027
Showing 1 changed file with 31 additions and 17 deletions.
diff --git a/.github/workflows/unit-testing.yml b/.github/workflows/unit-testing.yml
@@ -7,10 +7,10 @@ on: [push, create]
 jobs:
   build:
     runs-on: ubuntu-18.04
-    name: ${{ matrix.DATASET }}
+    name: ${{ matrix.DATASET }} (chunk${{ matrix.CHUNK }}, chunk size ${{ matrix.NUM_PER_CHUNK }})
     strategy:
       fail-fast: false
-      max-parallel: 3
+      max-parallel: 50
       matrix:
         DATASET:
           - datasets/sars-cov-2-voivoc.tsv
@@ -19,6 +19,10 @@ jobs:
           - datasets/sars-cov-2-coronahit-routine.tsv
           - datasets/sars-cov-2-SNF-A.tsv
           - datasets/sars-cov-2-failedQC.tsv
+        NUM_PER_CHUNK: 
+          - 25
+        # TODO is there a $SGE_TASK_ID equivalent instead of listing each chunk???
+        CHUNK: [25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400]
     steps:
       - name: Check out the repo
         uses: actions/checkout@v2
@@ -42,23 +46,33 @@ jobs:
       - name: unit testing - just env
         run:  |
           bats t/00_env.bats
-      - name: abbreviated unit testing with ${{ matrix.DATASET }}
-        if:   ${{ github.event_name != 'create' }}
+      - name: unit test chunk of ${{ matrix.DATASET }}
         run:  |
           export NCBI_API_KEY=${{ secrets.NCBI_API_KEY }}
           if [[ -z "$NCBI_API_KEY" ]]; then echo "NCBI_API_KEY not found in github secrets!"; fi;
-          # Get the header and just two samples for the abbreviated test
-          grep -B 999 -A 3 biosample_acc ${{ matrix.DATASET }} > ${{ matrix.DATASET }}.short
-          export DATASET=$(realpath ${{ matrix.DATASET }}).short
-          echo "Abbreviated dataset: $DATASET"
-          bats t/*
-      - name: full unit testing with ${{ matrix.DATASET }}
-        if:   ${{ github.event_name == 'create' }}
-        run:  |
-          export NCBI_API_KEY=${{ secrets.NCBI_API_KEY }}
-          if [[ -z "$NCBI_API_KEY" ]]; then echo "# NCBI_API_KEY not found in github secrets!"; fi;
-          echo "Full dataset: ${{ matrix.DATASET }}"
-          export DATASET=$(realpath ${{ matrix.DATASET }})
-          echo "DEBUG: allowing for error exit code in TAP"
+          
+          export DATASET=$(pwd -P)/${{ matrix.DATASET }}.${{ matrix.CHUNK }}.short
+          CHUNK=${{ matrix.CHUNK }}
+          NUM_PER_CHUNK=${{ matrix.NUM_PER_CHUNK }}
+
+          # Get the header of the dataset
+          grep -B 999 biosample_acc ${{ matrix.DATASET }} > $DATASET
+          # Get the samples of the dataset (everything past the header)
+          # and then get the number of lines dictated by CHUNK (e.g., 50, 100, 150,...)
+          #   with sed -n Xp
+          FIRST_LINE=$(($CHUNK - $NUM_PER_CHUNK + 1))
+          LAST_LINE=${{ matrix.CHUNK }}
+          grep -A 99999 biosample_acc ${{ matrix.DATASET }} | tail -n +2 | sed -n ${FIRST_LINE},${LAST_LINE}p >> $DATASET.body
+          cat $DATASET.body >> $DATASET
+
+          # If we have zero samples, just exit with pass
+          NUM_SAMPLES=$(wc -l < $DATASET.body)
+          if [[ $NUM_SAMPLES -lt 1 ]]; then
+            echo "Number of samples is zero; exiting with pass"
+            exit 0
+          fi
+
+          # Run the TAP compliant unit test which reads env variable $DATASET
+          echo "DATASET CHUNK $DATASET"
           bats t/*