Skip to content

Cluster Usage Notes

gforney edited this page Jun 1, 2018 · 4 revisions

Table of Contents

Becoming root

  • To become root, type:
 su -
  • Then type the password.
Note the "-" in "su -" is important. It causes root's startup files to be invoked (i.e., just typing su will not work).

Adding a user account

If user has an account on the old cluster then use the uid defined in /etc/passwd.BAK on leo.cfr.nist.gov

 useradd -u uid user

On burn use

 useradd -d /home4/user -u uid user

Set password and synchronize

 passwd user
 pwconv
 passsync

Add information to /etc/passwd about the user you just added. For example, change

 jdoe:x:12345:18660::/home/jdoe:/bin/bash

to

 jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash

See /etc/passwd for other examples. Then type passsync to update your changes.

Add a samba account

 smbpasswd -a user

Set up ssh keys using instructions in the following section so that the user can login into each cluster node without a password and run jobs on any or all nodes.

Setting Up ssh keys

Type:

 ssh-keygen -t rsa

hit return until command completes.

cd into the directory .ssh and type

 cat id_rsa.pub >> authorized_keys

Type:

setup_ssh.sh

to confirm that you can login to each cluster node without a password.

The first time you do this you will have to type yes for each cluster node. Note, having to type 'yes' multiple times is a bug in the cluster configuration which is being addressed.

Removing a user account

Edit the /etc/passwd file to remove the user's password entry

Update the passwd file by typing:

 pwconv
 passsync

Remove the user's files with:

 cd /home
 rm -r user_name

Rebooting a node

If a node becomes unresponsive (say blaze001) type the following command as root

cluster_node_reboot.sh blaze001

Rebooting the cluster

  • To reboot the blaze compute nodes, login to blaze as root, cd to /usr/local/bin and type:
   ./cluster_blaze_reboot.sh
  • Likewise, to reboot the burn compute nodes, login to burn as root, cd to /usr/local/bin and type:
   ./cluster_burn_reboot.sh
  • To reboot blaze or burn type
   reboot
  • Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible.

Powering down the cluster

Power down the blaze compute nodes

  • login to blaze as root. Note you need to log in as root directly you CANNOT log in as yourself then switch to root
  • before proceeding make sure ALL USERS are logged off (you'll have trouble umounting file systems if you don't)
  • unmount all NFS mounted file systems. type:
  umount /firestore
  umount /home4
  umount /home2/smokevis
  umount /home2/smokevis2

Type

   df -k

to confirm that the file systems are umounted. If for some reason you run into problems with the above umount commands, proceed with powering down the nodes anyway. Note, if users were logged in (including yourself) then the above umount commands might not work.

  • cd to /usr/local/bin and type
  ./cluster_blaze_off.sh

This will power off nodes blaze001->blaze119 (or whatever is the last node)

Power down the burn compute nodes

  • login to burn as root (if you were logged into blaze as root you may ssh from blaze). Again, note you need to log in as root directly you CANNOT log in as yourself then switch to root
  • before proceeding make sure ALL USERS are logged off (again, you'll have trouble unmounting file systems if you don't)
  • unmount all NFS mounted file systems. type:
  umount /home
  umount /home2/smokevis
  umount /home2/smokevis2

Type

   df -k

to confirm that the file systems are umounted. If for some reason you run into problems with the above umount commands, proceed with powering down the nodes anyway. Note, if users were logged in (including yourself) then the above umount commands might not work.

  • cd to /usr/local/bin and type
  ./cluster_burn_off.sh

This will power off nodes burn001->burn036 (or whatever is the last node)

Powering down the blaze and burn head nodes

To power down blaze and burn, type the following while logged in (obviously if you logged into burn from blaze, you'll need to power down burn first)

   poweroff

Shutting down Power in Cluster Room

Now you have to go to the cluster room.

  • Turn off the UPS at the bottom of the blaze and burn cluster cabinets (check UPSs in other two cabinets don't think they are working)
  • Pull down four large power switches on wall to right of cluster cabinet to the off position.
  • Turn off A/C (push small power off button, then turn large black switch to off).

Powering up the cluster

  • Turn on (four) main circuit boxes on each side of room, 8 total.
  • Wait few minutes - this gives the network switches time to "boot up" .
  • Turn on blaze and burn UPSs at bottom of cabinets.
  • Pull out the blaze console and enter password 00000000 (8 0's) The password is located on the blaze console keyboard.
  • Press power button on blaze. After a few moments the screen should start giving messages.
  • Push button 2 on the blaze KVM switch to activate the burn console.
  • Turn on burn master node (red button on right)
  • Turn on smokevis, firevis and firestore (firestore large chassis directly above smokevis and firevis)
  • After blaze has booted up, cd to /usr/local/bin and type:
   ./cluster_blaze_on.sh 
  • After burn has booted up, login to burn, cd to /usr/local/bin and type:
  ./cluster_burn_on.sh
  • Turn on A/C!!! (first, turn large black switch, then hold power on button until unit kicks on).
  • After a few minutes, run the script /usr/local/bin/check_cluster.sh on both blaze and burn to verify that all nodes are up and accessible

Setting clocks to the correct time

On blaze, run the following script as root

  /usr/local/bin/RESET_CLOCK.sh

Troubleshooting nodes and taking them off the batch queue

For example, to take blaze011 of the default (batch) queue, perform the following steps:

  • Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines
 blaze011 np=8 16g compute

to

 blaze011 np=8 16g testqueue
  • Create a new queue (mine is called testing_queue) using the following commands:
 qmgr -c "create queue testing_queue"
 qmgr -c "set queue testing_queue queue_type = Execution"
 qmgr -c "set queue testing_queue resources_default.neednodes = testqueue"
 qmgr -c "set queue testing_queue resources_default.nodes = 1"
 qmgr -c "set queue testing_queue enabled = True"
 qmgr -c "set queue testing_queue started = True"
  • Restart pbs_server and maui
Important! You should restart the PBS server with the following commands (NOT pbs_server stop and NOT pbs_server restart), or you risk stopping jobs that are currently running:
 qterm -t quick
 /etc/init.d/pbs_server start
 /etc/init.d/maui restart

Now, blaze011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:

 qfds.sh -r -q testing_queue casename.fds
  • Useful queuing commands
  list available queues: qmgr -c 'p s'  
  delete a queue: qmgr -c 'delete queue fire60s'
  make sure the following is used when setting up torque
  qmgr -c "set server scheduling = True"

Queues

Type

  qstat -q

to see a list of queues

Click on http://blaze.nist.gov/summary.html or http://burn.nist.gov/summary.html to see how queues on blaze and burn clusters are being used.

Torque configuration changes

A file named /etc/sysconfig/pbs_mom containing

 #!/bin/bash
 ulimit -s unlimited

was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.

Restarting ganglia

  • restart gmond and gmetad on head node with
  /etc/init.d/gmond restart
  /etc/init.d/gmetad restart
  • restart gmond on all nodes with (gmetad does not run on compute nodes)
/usr/local/bin/ganglia_restart.sh

Fixing cluster problems

  • Make a User's home directory readable
  chmod 755 ~username
  • Make all files in a directory tree readable (ie accessible to everyone)
cd to "one" level above the directory you wish to make readable and type the following command. If you don't "own" the directory, you'll need to be root
  chmod +r -R directory_name
  • Samba is not working
  see if the samba daemon is running by typing: 
  ps -el | grep smb
  in a command shell.  If you don't see anything (or even if you do) type as root:
  /etc/init.d/smb restart
  to restart the daemon
  • Checking disk usage
  To see how much space is used by the dircectory named dir, type:
  du -ks dir
  To see how much space is used by all files/directories in the current directory, type:
  du -ks `ls`

Change email From addresses

 edit /etc/postmap/canonical by adding lines of the form:
 [email protected] [email protected]
 postmap /etc/init.d/canonical
 /etc/init.d/postfix restart