-
Notifications
You must be signed in to change notification settings - Fork 4
Running HJCFIT on ARCHER
There's good documentation about ARCHER on their website, but here's an extract of what has been useful to know so far.
####Filesystems Once logged in ARCHER, there are two different filesystems we need to be aware of:
-
/home
: This is the place were we land when logging in ARCHER. It is only visible from the login nodes and the MOM nodes (the ones that can runaprun
and send tasks to the compute nodes). You should use this space for storing job files (PBS files), outputs and the like. -
/work
: This is visible from everywhere: login, MOM and compute nodes. You should use this for storing executable files that'll be run by the compute nodes and other files that the compute nodes need to have access to when running those executables.
Note we have a shared folder for RSDG and DCProgs team under /work/ecse0506/ecse0506/shared
where we can place test data sets, etc.
####.bashrc
To make your life easier, it is a good idea to add in your .bashrc
two lines to create variables to move around the filesystem:
export WORK=/work/ecse0506/ecse0506/$USER
export SHARED=/work/ecse0506/ecse0506/shared
You can also configure aliases, like:
alias qu="qstat -u $USER"
And make the system put you on the $WORK filesystem from start, load a virtual environment or source a script to load all necessary modules to work on HJCFIT, so you don't have to do it everytime you log into ARCHER. See a sample .bashrc
here.
####Python Virtual Environment
To work with Python on ARCHER, we are using a virtual environment, which is the strategy recommended by ARCHER.
To create it, you can run this script that will install all the necessary packages to run HJCFIT in a virtual environment called dcprogs
.
You will also need to install any extra packages or projects you need, for example to work with DCPYPS, you'll need to clone it and then install it:
cd $WORK
pip install git+https://github.com/DCPROGS/DCPYPS.git
Once the virtual environment is ready, you can activate or deactivate it with:
source activate dcprogs
source deactivate dcprogs
Since we all have SAFE accounts configured for the project, we just need to do:
To get the code, build it and test it in the login node, do as usual for Unix systems:
git clone -b develop https://github.com/DCPROGS/HJCFIT.git
cd HJCFIT
mkdir build
cd build
cmake ..
make install
make test
Note that for this work, your .bashrc should have loaded all the necessary modules. See the "Environment" section.
####Job files
Job scripts are written on a PBS file and follow a specific structure. Here's a sample job script that runs a hello world bash script :
#!/bin/bash --login
#PBS -N hello_archer
#PBS -l select=1
#PBS -l walltime=0:0:30
#PBS -A $BUDGET
# This shifts to the $WORK directory
cd $WORK
aprun -n 24 ./scripts/hello_archer.sh
Job files need a few parameters to be set in the header of the PBS script:
-
-N <string>
: Specifies job name -
-l select=<number>
: Number of nodes needed -
-l walltime=<hours>:<minutes>:<seconds>
: time requested for the job -
-A <project_code>
: budget code from where the used time will be subtracted
The body of the job script cd
s to the $WORK
folder where we have our executables and then uses aprun
to execute the script in parallel using 24 nodes.
To submit a job in the queue, you can do this:
qsub myjobfile.pbs
This will submit it to the general queue, and you can check its status with
qstat -u $USER
Or
checkQueue
To delete a job you have submitted:
qdel <job_ID_seen_in_queue>
Use man qsub
, etc., for more info.
The standard queue takes sometimes too long for jobs to be run.
From 9am to 5pm, Monday to Friday, there is a short queue available to run interactive jobs. You will land on a MOM node once you've launched the job, and will be able to run aprun
that'll trigger tasks in the compute nodes. This is very handy for short tests for example when testing project configuration. Jobs in this queue are restricted to 20 minutes walltime and a maximum of 8 nodes. This is the command you need to run:
qsub -q short -IVl select=1,walltime=0:5:0 -A $BUDGET
-
-q short
: Indicates we don't want to use the standard queue, but the short one. -
-I
indicates the job is interactive. -
-V
exports the user's environment (I think it runs ~/.bashrc) -
-l
followed by resource list:-
select=1
indicates one node will be used -
walltime:0:10:0
indicates 10 minutes of time available for our job
-
-
-A $BUDGET
followed by project code indicates the budget the time/resources allocates should go to.
In a short time, you'll be on a MOM node and sent to your $HOME folder. Remember to cd to $WORK
again, otherwise you can't run aprun
.
Once there, you can do things like running a likelihood test in 4 nodes:
aprun -n 4 ./HJCFIT/build/likelihood/test_likelihood
More on aprun
here.
See other kinds of ARCHER queues here.
####Copying files to and from ARCHER
You might need to copy files to/from ARCHER. This can be done via scp
, for example. See ARCHER's documentation.
Note that if you are on a Windows machine and want to run scp
from the command line, you can use chocolatey and install it like this:
choco install win32-openssh
####Virtualenvs on archer.
When running Anaconda python in a virtual env you may see something like.
python: error while loading shared libraries: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory
This happens because aprun copies python to the compute node.
It breaks because the r path to libpython3.5m.so.1.0 is is defined as $ORIGIN/../lib/
and the linker resolves $ORIGIN to the directory where the executable is installed. You can prevent aprun from coping the executable by passing -b to it (see the aprun man page) Alternatively you can set LD_LIBRARY_PATH to help python find the library.
####Building mpi4py on Archer
Building mpi4py on Archer for use with the cray mpi and Anaconda was a bit complicated
The following worked for me:
export CC=cc
export CXX=CC
python setup.py build --mpi=mpich3
python setup.py install
where mpi.cfg contained
# MPICH3 example
# --------------
[mpich3]
mpi_dir = /opt/cray/mpt/7.2.6/gni/mpich-gnu/51
mpicc = /opt/cray/craype/2.4.2/bin/cc
mpicxx = /opt/cray/craype/2.4.2/bin/CC
#include_dirs = %(mpi_dir)s/include
#libraries = mpich opa mpl rt pthread
#library_dirs = %(mpi_dir)s/lib
#runtime_library_dirs = %(library_dirs)s