This repo publishes the traces and steps to reproduce the experimental results in our paper “Dynamically Negotiating Capacity Between On-Demand and Batch Clusters” to be appeared in SC18. The following instructions are based on running your experiments on Chameleon (www.chameleoncloud.org). But given that our test environment is wrapped in Docker images, you should be able to run it on any infrastructures.
- Disk image: CC-Ubuntu16.04
- Baremetal instances: start 1 master instance and (LCRC_WORKERS / 24) worker instances. E.g. to run 372 LCRC workers you need to start 1 master + 16 worker instances
- Clone the ‘master’ branch of the repository located at: https://github.com/Francis-Liu/LcrcOnDemand
Installation (on the master instance):
# upload your private key ~/.ssh/id_rsa
chmod 400 ~/.ssh/id_rsa
vim ~/pssh_hosts.txt # add all worker instance ip, one in each line
sudo apt-get update && sudo apt-get install -y pssh python-pip python-dev build-essential && sudo pip install --upgrade pip && sudo pip install spur
git clone [email protected]:Francis-Liu/LcrcOnDemand.git # download src
# install docker
cd LcrcOnDemand/docker
./install_docker.sh
<~/pssh_hosts.txt xargs -I % ssh-keyscan % >> ~/.ssh/known_hosts
parallel-rsync -v -h ~/pssh_hosts.txt install_docker.sh /home/cc/
parallel-ssh -h ~/pssh_hosts.txt -t 900 -i './install_docker.sh'
# run the following command to verify that docker is up and running on all worker instances
parallel-ssh -h ~/pssh_hosts.txt -t 900 -i 'docker ps'
After installing docker, logout and login again to the master instance.
If you want to run the baseline experiment (batch only), run the following.
cd LcrcOnDemand/docker
# edit: docker-compose.yml
# substitute 372 with the desired number of compute nodes you want to launch in your cluster
./launch-torque-only-cluster.sh
# run "docker service ps hpc_cluster_master" to locate at which server is the lcrc-master container running
# log on to that server, run "docker ps | grep master" to find the container id
Otherwise (batch + on-demand cluster), run the following:
# start a docker swarm
docker swarm init --advertise-addr <master ip> # use non-floating ip
parallel-ssh -h ~/pssh_hosts.txt -i 'docker swarm join --token ...' # copy & paste the command line output of docker swarm init
docker node ls # check all workers are added to the swarm
docker network create --driver overlay --subnet=10.0.0.0/16 --attachable hpc-cluster-network
# run lcrc-head, replace 372 with the number of LCRC workers you want to run, e.g. 2
docker pull jtqv84/openstack_torque_master # make sure you pull the latest docker image
docker run -it -e "LCRC_WORKERS=372" --rm --privileged --name lcrc-head --hostname lcrc-head --network hpc-cluster-network -v /lib/modules:/lib/modules jtqv84/openstack_torque_master
# lcrc-head will quickly print a message "Waiting for all workers to become active", leave it alone
# open another terminal to start lcrc-workers
cd LcrcOnDemand/docker
./contextualize.py 372 # replace 372 with the number of workers you specified in "LCRC_WORKERS=" when running lcrc-head
# Go back to lcrc-head, check its output. You will see "Notify all workers I'm ready OK" at the end of startup. Press Ctrl-[P + Q] to detach from lcrc-head.
# wait until all the lcrc-workers are up an running
docker exec -it lcrc-head bash # start a terminal on the lcrc-head container
showq # expected to see "0 of XXX Processors Active" where XXX should = LCRC_WORKERS
source /devstack/openrc admin admin; nova service-list | grep nova-compute | grep up | wc -l # expected to return LCRC_WORKERS
nova net-list
nova net-delete <network ID> # delete the default private network, to accelerate instance launch speed
# On the master node:
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /set_pbsnodes.py 372 372; /pypy/bin/pypy -O /app.py 372 5 12 2>&1 | tee app.py.txt # app.py accepts 3 parameters: (1) LCRC_WORKERS; (2) Wait time; (3) Reserved nodes
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /run_workload.py BASE_RUN1 /weekly_trace/2015-04-05/ 2>&1 | tee run_workload.txt # run workload, 'BASE_RUN1' is any name you want to call your experiment
# On the master node:
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /set_pbsnodes.py 372 372; /pypy/bin/pypy -O /app.py 372 0 0 --hint 2>&1 | tee app.py.txt # --hint enable the hint algorithm
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /run_workload.py HINT_RUN1 /synthetic_traces/LW_10_basic/ /hint_traces2/M_10_H15.csv H 2>&1 | tee run_workload.txt # M_10_H15.csv is hint data file, 'H' indicates run the hint algorithm
# On the master node:
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /set_pbsnodes.py 372 372; /pypy/bin/pypy -O /app.py 372 0 0 --enable 2>&1 | tee app.py.txt # --enable runs the predictive algorithm
docker exec -it lcrc-head bash # start another terminal on the lcrc-head container
source /devstack/openrc admin admin; /run_workload.py PREDICTIVE_RUN1 /synthetic_traces/M_10_basic/ /dynamic_reserve_trace/M_10_dynamic1.csv 2>&1 | tee run_workload.txt # M_10_dynamic1.csv is the predictive reserve historical data file
On the master node:
docker stop lcrc-head
parallel-ssh -h ~/pssh_hosts.txt -t 300 -i 'docker stop $(docker ps -a -q)'