-
Notifications
You must be signed in to change notification settings - Fork 0
Cluster Usage Notes
- To become root, type:
su -
- Then type the password.
If user has an account on the old cluster then use the uid defined in /etc/passwd.BAK on leo.cfr.nist.gov
useradd -u uid user
On burn use
useradd -d /home4/user -u uid user
Set password and synchronize
passwd user pwconv passsync
Add information to /etc/passwd about the user you just added. For example, change
jdoe:x:12345:18660::/home/jdoe:/bin/bash
to
jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash
See /etc/passwd for other examples. Then type passsync to update your changes.
Add a samba account
smbpasswd -a user
Set up ssh keys using instructions in the following section so that the user can login into each cluster node without a password and run jobs on any or all nodes.
Type:
ssh-keygen -t rsa
hit return until command completes.
cd into the directory .ssh and type
cat id_rsa.pub >> authorized_keys
Type:
setup_ssh.sh
to confirm that you can login to each cluster node without a password.
The first time you do this you will have to type yes for each cluster node. Note, having to type 'yes' multiple times is a bug in the cluster configuration which is being addressed.
Edit the /etc/passwd file to remove the user's password entry
Update the passwd file by typing:
pwconv passsync
Remove the user's files with:
cd /home rm -r user_name
If a node becomes unresponsive (say blaze001) type the following command as root
cluster_node_reboot.sh blaze001
- To reboot the blaze compute nodes, login to blaze as root, cd to /usr/local/bin and type:
./cluster_blaze_reboot.sh
- Likewise, to reboot the burn compute nodes, login to burn as root, cd to /usr/local/bin and type:
./cluster_burn_reboot.sh
- To reboot blaze or burn type
reboot
- Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible.
- login to blaze as root. Note you need to log in as root directly you CANNOT log in as yourself then switch to root
- before proceeding make sure ALL USERS are logged off (you'll have trouble umounting file systems if you don't)
- unmount all NFS mounted file systems. type:
umount /firestore umount /home4 umount /home2/smokevis umount /home2/smokevis2
Type
df -k
to confirm that the file systems are umounted. If for some reason you run into problems with the above umount commands, proceed with powering down the nodes anyway. Note, if users were logged in (including yourself) then the above umount commands might not work.
- cd to /usr/local/bin and type
./cluster_blaze_off.sh
This will power off nodes blaze001->blaze119 (or whatever is the last node)
- login to burn as root (if you were logged into blaze as root you may ssh from blaze). Again, note you need to log in as root directly you CANNOT log in as yourself then switch to root
- before proceeding make sure ALL USERS are logged off (again, you'll have trouble unmounting file systems if you don't)
- unmount all NFS mounted file systems. type:
umount /home umount /home2/smokevis umount /home2/smokevis2
Type
df -k
to confirm that the file systems are umounted. If for some reason you run into problems with the above umount commands, proceed with powering down the nodes anyway. Note, if users were logged in (including yourself) then the above umount commands might not work.
- cd to /usr/local/bin and type
./cluster_burn_off.sh
This will power off nodes burn001->burn036 (or whatever is the last node)
To power down blaze and burn, type the following while logged in (obviously if you logged into burn from blaze, you'll need to power down burn first)
poweroff
Now you have to go to the cluster room.
- Turn off the UPS at the bottom of the blaze and burn cluster cabinets (check UPSs in other two cabinets don't think they are working)
- Pull down four large power switches on wall to right of cluster cabinet to the off position.
- Turn off A/C (push small power off button, then turn large black switch to off).
- Turn on (four) main circuit boxes on each side of room, 8 total.
- Wait few minutes - this gives the network switches time to "boot up" .
- Turn on blaze and burn UPSs at bottom of cabinets.
- Pull out the blaze console and enter password 00000000 (8 0's) The password is located on the blaze console keyboard.
- Press power button on blaze. After a few moments the screen should start giving messages.
- Push button 2 on the blaze KVM switch to activate the burn console.
- Turn on burn master node (red button on right)
- Turn on smokevis, firevis and firestore (firestore large chassis directly above smokevis and firevis)
- After blaze has booted up, cd to /usr/local/bin and type:
./cluster_blaze_on.sh
- After burn has booted up, login to burn, cd to /usr/local/bin and type:
./cluster_burn_on.sh
- Turn on A/C!!! (first, turn large black switch, then hold power on button until unit kicks on).
- After a few minutes, run the script /usr/local/bin/check_cluster.sh on both blaze and burn to verify that all nodes are up and accessible
On blaze, run the following script as root
/usr/local/bin/RESET_CLOCK.sh
For example, to take blaze011 of the default (batch) queue, perform the following steps:
- Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines
blaze011 np=8 16g compute
to
blaze011 np=8 16g testqueue
- Create a new queue (mine is called testing_queue) using the following commands:
qmgr -c "create queue testing_queue" qmgr -c "set queue testing_queue queue_type = Execution" qmgr -c "set queue testing_queue resources_default.neednodes = testqueue" qmgr -c "set queue testing_queue resources_default.nodes = 1" qmgr -c "set queue testing_queue enabled = True" qmgr -c "set queue testing_queue started = True"
- Restart pbs_server and maui
qterm -t quick /etc/init.d/pbs_server start /etc/init.d/maui restart
Now, blaze011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:
qfds.sh -r -q testing_queue casename.fds
- Useful queuing commands
list available queues: qmgr -c 'p s' delete a queue: qmgr -c 'delete queue fire60s'
make sure the following is used when setting up torque qmgr -c "set server scheduling = True"
Type
qstat -q
to see a list of queues
Click on http://blaze.nist.gov/summary.html or http://burn.nist.gov/summary.html to see how queues on blaze and burn clusters are being used.
A file named /etc/sysconfig/pbs_mom containing
#!/bin/bash ulimit -s unlimited
was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.
- restart gmond and gmetad on head node with
/etc/init.d/gmond restart /etc/init.d/gmetad restart
- restart gmond on all nodes with (gmetad does not run on compute nodes)
- Make a User's home directory readable
chmod 755 ~username
- Make all files in a directory tree readable (ie accessible to everyone)
chmod +r -R directory_name
- Samba is not working
see if the samba daemon is running by typing: ps -el | grep smb in a command shell. If you don't see anything (or even if you do) type as root:
/etc/init.d/smb restart
to restart the daemon
- Checking disk usage
To see how much space is used by the dircectory named dir, type: du -ks dir
To see how much space is used by all files/directories in the current directory, type: du -ks `ls`
edit /etc/postmap/canonical by adding lines of the form: [email protected] [email protected]
postmap /etc/init.d/canonical /etc/init.d/postfix restart