-
Notifications
You must be signed in to change notification settings - Fork 4
Network Model Simulations on Hyak
This page describes the methods for simulating stochastic network models on Hyak for the CAMP project. This presumes that Step 1, Step 2, and Step 3 tutorials have been read.
This page is under revision. Some helpful notes are still included below, but contact Sam/Steve for help.
Example contents of your R simulation script.
library("methods")
suppressMessages(library("EpiModelHIVmsm"))
library("EpiModelHPC")
args <- commandArgs(trailingOnly = TRUE)
simno <- args[1]
jobno <- args[2]
fsimno <- paste(simno, jobno, sep = ".")
print(fsimno)
load("est/nwstats.rda")
param <- param_msm(nwstats = st)
init <- init_msm(nwstats = st)
control <- control_msm(simno = fsimno,
nsteps = 52 * 50,
nsims = 16,
ncores = 16,
save.int = 500,
save.network = TRUE,
save.other = c("attr", "temp"))
netsim_hpc("est/fit.rda", param, init, control,
save.min = TRUE, save.max = FALSE, compress = "xz")
Example contents of your bash shell script, runsim.sh
:
#!/bin/bash
### User specs
#PBS -N sim$SIMNO
#PBS -l nodes=1:ppn=16,mem=44gb,feature=16core,walltime=05:00:00
#PBS -o /gscratch/csde/camp/out
#PBS -e /gscratch/csde/camp/out
#PBS -j oe
#PBS -d /gscratch/csde/camp
#PBS -m n
### Standard specs
HYAK_NPE=$(wc -l < $PBS_NODEFILE)
HYAK_NNODES=$(uniq $PBS_NODEFILE | wc -l )
HYAK_TPN=$((HYAK_NPE/HYAK_NNODES))
NODEMEM=`grep MemTotal /proc/meminfo | awk '{print $2}'`
NODEFREE=$((NODEMEM-2097152))
MEMPERTASK=$((NODEFREE/HYAK_TPN))
ulimit -v $MEMPERTASK
export MX_RCACHE=0
### Modules
module load r_3.2.4
### App
ALLARGS="${SIMNO} ${PBS_ARRAYID}"
echo runsim variables: $ALLARGS
echo
### App
Rscript sim.R ${ALLARGS}
The #PBS
lines set the Hyak job submission options, while the Standard specs lines set a specific issue of memory utilization. The current version of R that we are using for building packages should be loaded. The final line runs the R script in batch mode, pulling from variables for SIMNO
and PBS_ARRAYID
and passing them into the R script as necessary.
Once the R and shell scripts are written, they can be transferred to Hyak for running. The Linux command scp
(for secure copy) can be used to move data between Hyak and CSDE. To get started, log on to Hyak and make sure you are on a login node to transfer data.
In Step 1 we described how to set up an scp
alias that would automate this. Typing campscr
on Hyak will therefore copy all the files ending with .R or .sh from our shared CSDE folder into our shared Hyak folder.
alias campscr='scp libra:/net/proj/camp/rdiff/*.[Rs]* /gscratch/csde/camp/'
Running simulation jobs on Hyak consists of submitting the job and monitoring it for errors. The job submission software on Hyak automates many of these processes.
Once the R script and shell script are transferred to Hyak, they may be run with the qsub
command. There are lots of options to this command, but at a bare minimum these are the necessary elements given our file structure and workflow.
qsub -t 1 -v SIMNO=1 runsim.sh
The -t
parameter invokes an array job, in which there is just one sub-job in this case. We will use array subjobs to run the same parameter set multiple times (this is an alternative to running one big jobs with lots of simulations that performs better on Hyak when we use the backfill queue). The -v
parameter passes environmental variables down into the runsim.sh
shell script. In our case, SIMNO
will be used as the primary simulation ID number; passing a SIMNO=X
means that runsim.sh will execute a script called simX.R
. In combination with the array ID, the unique ID of this simulation job would be 1.1. This will be used in the file name to save out the file.
To submit a series of 7 subjobs for simulation 2, and to submit that on the backfill queue, one would use the following additional arguments. Note that the runsim.sh
file does not need to be changed. The default -q
argument is batch
, which submits jobs only to CSDE nodes.
qsub -q bf -t 1-7 -v SIMNO=2 runsim.sh
Any of the job specifications in the runsim.sh
file may be overwritten by setting them on the command line. It is also possible to change computing specs like the walltime (total allotted time for the job); using a shorter time helps expedite job submissions on the backfill queue.
qsub -q bf -t 1-7 -v SIMNO=2 -l walltime=04:00:00 runsim.sh
Note that any of these qsub
arguments may be placed together in their own bash shell script for easier execution and historical record. For example, this might be the contents of a file called master.sh
:
#!/bin/bash
qsub -t 1-7 -v SIMNO=1 runsim.sh
qsub -q bf -t 1-7 -v SIMNO=2 runsim.sh
To execute that shell file, and thereby run the two qsub
commands, use the bash
command:
bash master.sh
Once a job is started, it will show up in the job queue. Once a node is free for running it, it will move from an eligible job to an active job. It will stay as an active job until it completes, it errors, or is terminated for some reason.
To check on the status of any jobs, use the showq -u <username>
command. One example of its output:
JOBID USERNAME STATE PROCS REMAINING STARTTIME
2882468[7] sjenness Running 16 6:18:51 Mon May 11 17:08:50
2882468[1] sjenness Running 16 6:18:51 Mon May 11 17:08:50
2882468[6] sjenness Running 16 6:19:36 Mon May 11 17:09:35
2882468[2] sjenness Running 16 6:19:36 Mon May 11 17:09:35
2882468[4] sjenness Running 16 6:19:36 Mon May 11 17:09:35
2882468[5] sjenness Running 16 6:19:36 Mon May 11 17:09:35
2882468[3] sjenness Running 16 6:19:36 Mon May 11 17:09:35
This shows an example of 7 subjobs for one qsub
submission that I started. Additional information on where these jobs are running is available if you use the -r
parameter along with showq
.
You could create a shorthand for using showq that displays extra information on just your jobs with:
alias myq='showq -u sjenness -r'
Another potentially helpful command is checkjob
. Use this to get complete information on a particular job process by entering its job ID, as in: checkjob 2882468[1]
.
To check on the overall usage of CSDE's nodes (for example, to determine whether one should submit something to a backfill job), use nodestate csde
, which outputs information on the job and job class (see below):
n0607 = Busy : 2948479[4] -> batch : sjenness
n0608 = Busy : 2948479[5] -> batch : sjenness
n0784 = Busy : 2948479[2] -> batch : sjenness
n0785 = Busy : 2948479[3] -> batch : sjenness
n0786 = Busy : 2948479[7] -> batch : sjenness
n0787 = Busy : 2948479[6] -> batch : sjenness
n0788 = Busy : 2948479[1] -> batch : sjenness
n0789 = Busy : 2948480[6] -> batch : sjenness
n0790 = Busy : 2948480[1] -> batch : sjenness
n0791 = Busy : 2948480[3] -> batch : sjenness
n0792 = Busy : 2948480[4] -> batch : sjenness
n0793 = Busy : 2948480[5] -> batch : sjenness
n0794 = Busy : 2948480[2] -> batch : sjenness
n0795 = Busy : 2948480[7] -> batch : sjenness
Finally, it is important to know how to manually cancel a job that has been started. To do so, use the following command:
mjobctl -c <JOBID>
If backfill jobs are taking a long time to get started, it is always possible to change the status of previously submitted backfill jobs to run in non-backfill mode, which we will call batch mode, on CSDE's nodes. This involves changing three attributes of the job as follows:
mjobctl -m class=batch -m account=csde -m qos=csde <JOBID>
The JOBID
should correspond to that seen when running showq
.
It may be helpful to create the following alias:
alias tobatch='mjobctl -m class=batch -m account=csde -m qos=csde'
To use this alias, one must type tobatch <JOBID>
in the shell. If you are using array jobs, the JOBID
will look like this: 294882[1]
or 294882[2]
and so on for the number of array jobs tied to that master job ID. To change the jobs to batch mode for all array jobs of a particular master job ID, use the following syntax:
tobatch x:294882.*
where the specific JOBID
would be replaced. Note this wild card syntax also works for other calls to mjobctl
like canceling jobs as well.
To change a job from batch class to backfill class, use a similar alias:
alias tobf='mjobctl -m class=bf -m account=csde-bf -m qos=csde-bf'
By default, our R script will save a series of R data files within the data
subfolder of our shared drive on Hyak. For data analysis purposes, we are interested in transferring the "minimum" files, which are smaller files that are quicker load. In Step 1 we set up an alias, campdat
, to transfer those data files from Hyak back to CSDE for interactive analysis. Based on the alias definition below, we see what files are transferred and where they end up.
alias campdat='scp /gscratch/csde/camp/data/*.min.rda libra:/net/proj/camp/rdiff/data'