-
Notifications
You must be signed in to change notification settings - Fork 1
/
07-admin.Rmd
139 lines (93 loc) · 3 KB
/
07-admin.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Admin Tasks
In the following, it is assumed that `$CHROOT` resolves to `/opt/ohpc/admin/images/<version>`.
## Warewulf
Cluster management.
### Sharing of directories between `edi` and nodes
On edi: Add directory into `/etc/exports`
```
# /home *(rw,no_subtree_check,fsid=10,no_root_squash)
/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
/opt/spack *(rw,no_subtree_check,no_root_squash)
/opt/R *(rw,no_subtree_check,no_root_squash)
```
and call `exportfs -a`
For the nodes, the entry needs to go to $CHROOT/etc/fstab` like
```
# edi
192.168.1.1:/opt/spack /opt/spack nfs nfsvers=3,nodev 0 0
192.168.1.1:/opt/R /opt/R nfs nfsvers=3,nodev 0 0
# mars
10.232.16.12:/mnt/vol2/hpc/edi /home nfs rw,sync,user,hard,intr,_netdev,exec 0 0
```
### `renv` cache
The `renv` cache is mapped centrally to `/opt/R/renv` in RSW.
To share the RSW cache and `edi` cache with the nodes, an NFS share has been added.
See the previous section for more details.
### Enable systemd services on nodes
```
pdsh -w c[0-5] systemctl <command>
```
### Enable systemd service in image
```
export CHROOT=<some path>
chroot $CHROOT systemctl enable <service>
```
### Updating image nodes
`/root/update-nodes.sh`
```{block, type='rmdcaution'}
Sometimes `munge` does not start after updating the nodes, causing the nodes to be out of sync with the controller.
Check `systemctl status munge` and eventually restart munge on all nodes:
```
```
pdsh -w c[0-5] mkdir /var/log/munge
pdsh -w c[0-5] chown -R munge:munge /var/log/munge
pdsh -w c[0-5] systemctl restart munge
scontrol update nodename=c[0-5] state=resume
```
In addition, permissions on `/opt/R/renv` should be public r+w which is sometimes also not true and causes problems in combination with `renv`.
```
pdsh -w c[0-5] chmod -R 777 /opt/R/renv
```
### Sharing hostnames
The nodes must be aware of the RSW hostname and internal IP (docker0 gateway).
To do so, add a hostname/IP mapping into `$CHROOT/etc/hosts` and reboot the nodes.
```
192.168.1.1 edi
172.18.0.3 rsw
172.18.0.4 rsw-docker
```
## SLURM
Some notes:
- `/etc/slurm/slurm.conf` must always be identical everywhere (RSW, edi, nodes)
- In `/etc/slurm/slurm.conf` two `SlurmctldHost` entries are needed (one for edi, one for RSW in the container)
-
### Undrain a node
If a node is in state "drain", one can undrain it via
```
scontrol update NodeName=<node> State=DOWN Reason="undraining"
scontrol update NodeName=<node> State=RESUME
scontrol update nodename=c[0-5] state=resume
```
### Reconfigure Slurm
E.g. after settings update
```
scontrol reconfigure
```
## Docker
### Pulling a new image
Via user `admingeogr` which has AWS pull credentials configured
```
cd /home/admingeogr/rsw
# log into AWS ECR repo
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 222488041355.dkr.ecr.eu-central-1.amazonaws.com
docker-compose pull
```
### Update a container
```
cd /home/admingeogr/rsw
docker-compose up -d
```
### Clean up old images
```
docker image prune -af
```