Skip to content

Network Model Simulations on Hyak

Samuel Jenness edited this page Jun 23, 2016 · 5 revisions

This page describes the methods for simulating stochastic network models on Hyak for the CAMP project. This presumes that Step 1, Step 2, and Step 3 tutorials have been read.

This page is under revision. Some helpful notes are still included below, but contact Sam/Steve for help.

R Script

Example contents of your R simulation script.

library("methods")
suppressMessages(library("EpiModelHIVmsm"))
library("EpiModelHPC")

args <- commandArgs(trailingOnly = TRUE)
simno <- args[1]
jobno <- args[2]
fsimno <- paste(simno, jobno, sep = ".")
print(fsimno)

load("est/nwstats.rda")

param <- param_msm(nwstats = st)
init <- init_msm(nwstats = st)
control <- control_msm(simno = fsimno,
                       nsteps = 52 * 50,
                       nsims = 16,
                       ncores = 16,
                       save.int = 500,
                       save.network = TRUE,
                       save.other = c("attr", "temp"))

netsim_hpc("est/fit.rda", param, init, control,
           save.min = TRUE, save.max = FALSE, compress = "xz")

Shell Scripts

Example contents of your bash shell script, runsim.sh:

#!/bin/bash

### User specs
#PBS -N sim$SIMNO
#PBS -l nodes=1:ppn=16,mem=44gb,feature=16core,walltime=05:00:00
#PBS -o /gscratch/csde/camp/out
#PBS -e /gscratch/csde/camp/out
#PBS -j oe
#PBS -d /gscratch/csde/camp
#PBS -m n

### Standard specs
HYAK_NPE=$(wc -l < $PBS_NODEFILE)
HYAK_NNODES=$(uniq $PBS_NODEFILE | wc -l )
HYAK_TPN=$((HYAK_NPE/HYAK_NNODES))
NODEMEM=`grep MemTotal /proc/meminfo | awk '{print $2}'`
NODEFREE=$((NODEMEM-2097152))
MEMPERTASK=$((NODEFREE/HYAK_TPN))
ulimit -v $MEMPERTASK
export MX_RCACHE=0

### Modules
module load r_3.2.4

### App
ALLARGS="${SIMNO} ${PBS_ARRAYID}"
echo runsim variables: $ALLARGS
echo

### App
Rscript sim.R ${ALLARGS}

The #PBS lines set the Hyak job submission options, while the Standard specs lines set a specific issue of memory utilization. The current version of R that we are using for building packages should be loaded. The final line runs the R script in batch mode, pulling from variables for SIMNO and PBS_ARRAYID and passing them into the R script as necessary.

Transferring Scripts from CSDE to Hyak

Once the R and shell scripts are written, they can be transferred to Hyak for running. The Linux command scp (for secure copy) can be used to move data between Hyak and CSDE. To get started, log on to Hyak and make sure you are on a login node to transfer data.

In Step 1 we described how to set up an scp alias that would automate this. Typing campscr on Hyak will therefore copy all the files ending with .R or .sh from our shared CSDE folder into our shared Hyak folder.

alias campscr='scp libra:/net/proj/camp/rdiff/*.[Rs]* /gscratch/csde/camp/'

Running Scripts on Hyak

Running simulation jobs on Hyak consists of submitting the job and monitoring it for errors. The job submission software on Hyak automates many of these processes.

Job Submission

Once the R script and shell script are transferred to Hyak, they may be run with the qsub command. There are lots of options to this command, but at a bare minimum these are the necessary elements given our file structure and workflow.

qsub -t 1 -v SIMNO=1 runsim.sh

The -t parameter invokes an array job, in which there is just one sub-job in this case. We will use array subjobs to run the same parameter set multiple times (this is an alternative to running one big jobs with lots of simulations that performs better on Hyak when we use the backfill queue). The -v parameter passes environmental variables down into the runsim.sh shell script. In our case, SIMNO will be used as the primary simulation ID number; passing a SIMNO=X means that runsim.sh will execute a script called simX.R. In combination with the array ID, the unique ID of this simulation job would be 1.1. This will be used in the file name to save out the file.

To submit a series of 7 subjobs for simulation 2, and to submit that on the backfill queue, one would use the following additional arguments. Note that the runsim.sh file does not need to be changed. The default -q argument is batch, which submits jobs only to CSDE nodes.

qsub -q bf -t 1-7 -v SIMNO=2 runsim.sh

Any of the job specifications in the runsim.sh file may be overwritten by setting them on the command line. It is also possible to change computing specs like the walltime (total allotted time for the job); using a shorter time helps expedite job submissions on the backfill queue.

qsub -q bf -t 1-7 -v SIMNO=2 -l walltime=04:00:00 runsim.sh

Note that any of these qsub arguments may be placed together in their own bash shell script for easier execution and historical record. For example, this might be the contents of a file called master.sh:

#!/bin/bash

qsub -t 1-7 -v SIMNO=1 runsim.sh
qsub -q bf -t 1-7 -v SIMNO=2 runsim.sh

To execute that shell file, and thereby run the two qsub commands, use the bash command:

bash master.sh

Monitoring Jobs

Once a job is started, it will show up in the job queue. Once a node is free for running it, it will move from an eligible job to an active job. It will stay as an active job until it completes, it errors, or is terminated for some reason.

To check on the status of any jobs, use the showq -u <username> command. One example of its output:

JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

2882468[7]         sjenness    Running    16     6:18:51  Mon May 11 17:08:50
2882468[1]         sjenness    Running    16     6:18:51  Mon May 11 17:08:50
2882468[6]         sjenness    Running    16     6:19:36  Mon May 11 17:09:35
2882468[2]         sjenness    Running    16     6:19:36  Mon May 11 17:09:35
2882468[4]         sjenness    Running    16     6:19:36  Mon May 11 17:09:35
2882468[5]         sjenness    Running    16     6:19:36  Mon May 11 17:09:35
2882468[3]         sjenness    Running    16     6:19:36  Mon May 11 17:09:35

This shows an example of 7 subjobs for one qsub submission that I started. Additional information on where these jobs are running is available if you use the -r parameter along with showq.

You could create a shorthand for using showq that displays extra information on just your jobs with:

alias myq='showq -u sjenness -r'

Another potentially helpful command is checkjob. Use this to get complete information on a particular job process by entering its job ID, as in: checkjob 2882468[1].

To check on the overall usage of CSDE's nodes (for example, to determine whether one should submit something to a backfill job), use nodestate csde, which outputs information on the job and job class (see below):

n0607 = Busy : 2948479[4] -> batch : sjenness
n0608 = Busy : 2948479[5] -> batch : sjenness
n0784 = Busy : 2948479[2] -> batch : sjenness
n0785 = Busy : 2948479[3] -> batch : sjenness
n0786 = Busy : 2948479[7] -> batch : sjenness
n0787 = Busy : 2948479[6] -> batch : sjenness
n0788 = Busy : 2948479[1] -> batch : sjenness
n0789 = Busy : 2948480[6] -> batch : sjenness
n0790 = Busy : 2948480[1] -> batch : sjenness
n0791 = Busy : 2948480[3] -> batch : sjenness
n0792 = Busy : 2948480[4] -> batch : sjenness
n0793 = Busy : 2948480[5] -> batch : sjenness
n0794 = Busy : 2948480[2] -> batch : sjenness
n0795 = Busy : 2948480[7] -> batch : sjenness

Finally, it is important to know how to manually cancel a job that has been started. To do so, use the following command:

mjobctl -c <JOBID>

Changing Job Class

If backfill jobs are taking a long time to get started, it is always possible to change the status of previously submitted backfill jobs to run in non-backfill mode, which we will call batch mode, on CSDE's nodes. This involves changing three attributes of the job as follows:

mjobctl -m class=batch -m account=csde -m qos=csde <JOBID>

The JOBID should correspond to that seen when running showq.

It may be helpful to create the following alias:

alias tobatch='mjobctl -m class=batch -m account=csde -m qos=csde'

To use this alias, one must type tobatch <JOBID> in the shell. If you are using array jobs, the JOBID will look like this: 294882[1] or 294882[2] and so on for the number of array jobs tied to that master job ID. To change the jobs to batch mode for all array jobs of a particular master job ID, use the following syntax:

tobatch x:294882.*

where the specific JOBID would be replaced. Note this wild card syntax also works for other calls to mjobctl like canceling jobs as well.

To change a job from batch class to backfill class, use a similar alias:

alias tobf='mjobctl -m class=bf -m account=csde-bf -m qos=csde-bf'

Transferring Data from Hyak to CSDE

By default, our R script will save a series of R data files within the data subfolder of our shared drive on Hyak. For data analysis purposes, we are interested in transferring the "minimum" files, which are smaller files that are quicker load. In Step 1 we set up an alias, campdat, to transfer those data files from Hyak back to CSDE for interactive analysis. Based on the alias definition below, we see what files are transferred and where they end up.

alias campdat='scp /gscratch/csde/camp/data/*.min.rda libra:/net/proj/camp/rdiff/data'