Skip to content

Commit

Permalink
Bring your own lab (BYOL) doc - 2nd revision
Browse files Browse the repository at this point in the history
  • Loading branch information
sferlin committed Nov 23, 2023
1 parent 8594aaf commit 810e2da
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 34 deletions.
2 changes: 1 addition & 1 deletion ansible/roles/bastion-install/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
executable: pip3

- name: Scale / Alias lab bastion tasks
when: lab in rh_labs
when: lab in rh_labs or lab not in byol_labs
block:
- name: Clean lab interfaces
shell: |
Expand Down
3 changes: 2 additions & 1 deletion ansible/roles/validate-vars/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@

- name: Validate lab
fail:
msg: "Invalid lab selected('{{ lab }}') Select from {{ rh_labs }} and {{ cloud_labs }} "
msg: "Invalid lab selected('{{ lab }}') Select from {{ rh_labs }} and {{ cloud_labs }} and {{ byol_labs }}"
when:
- lab not in rh_labs
- lab not in cloud_labs
- lab not in byol_labs

- name: Check pull secret var is set
fail:
Expand Down
3 changes: 3 additions & 0 deletions ansible/vars/lab.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ rh_labs:
- alias
- scalelab

byol_labs:
- byol

labs:
alias:
dns:
Expand Down
48 changes: 16 additions & 32 deletions docs/bastion-deploy-bm-byol.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ Assuming you received a set of machines, this guide will walk you through gettin
In a BYOL, due to the non-standard interface names and NIC PCI slots, we have to craft jetlag's inventory file by hand.

The bastion machine needs 2 interfaces:
- The interface connected to the network, i.e., with an IP assigned, a L3 network.
- The control-plane interface, from which the cluster nodes are accessed (this is a L2 network, i.e., it does have an IP assigned).
- The interface connected to the network, i.e., with an IP assigned, a L3 network. This interface usually referred to as *lab_network* as it provides the connectivity into the bastion machine.
- The control-plane interface, from which the cluster nodes are accessed (this is a L2 network, i.e., it does not have an IP assigned).

The cluster machines need (at least) 1 interface:
- The control-plane interface, from which other cluster nodes are accessed.
Expand All @@ -31,23 +31,8 @@ _**Table of Contents**_
Sometimes your bastion may have undesirable settings such as `firewalld` or `iptables` with rules in place. These are recommended to be fixed, e.g., `firewalld` can be silenced and `iptables` cleaned.

1. Select your bastion machine from the allocation
2. Copy your ssh keys to the designated bastion machine

```console
[user@fedora ~]$ ssh-copy-id [email protected]
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 2 key(s) remain to be installed -- if you are prompted now it is to install the new keys
Warning: Permanently added 'xxx-r660.machine.com,x.x.x.x' (ECDSA) to the list of known hosts.
[email protected]'s password:

Number of key(s) added: 2

Now try logging into the machine, with: "ssh '[email protected]'"
and check to make sure that only the key(s) you wanted were added.
[user@fedora ~]$
```

3. Install some additional tools to help after reboot
2. Install some additional tools to help after reboot

Also, make sure RHEL has repositories added and an active subscription, since `jetlag` will require some packages: `dnsmasq`, `frr`, `golang-bin`, `httpd` and `httpd-tools`, `ipmitool`, `python3-pip`, `podman`, and `skopeo`.

Expand All @@ -58,7 +43,7 @@ Updating Subscription Management repositories.
Complete!
```

4. Setup ssh keys on the bastion and copy to itself to permit local ansible interactions
3. Setup ssh keys on the bastion and copy to itself to permit local ansible interactions

```console
[root@xxx-r660 ~]# ssh-keygen
Expand Down Expand Up @@ -90,7 +75,7 @@ and check to make sure that only the key(s) you wanted were added.
[root@xxx-r660 ~]#
```

5. Clone `jetlag`
4. Clone `jetlag`

```console
[root@xxx-r660 ~]# git clone https://github.com/redhat-performance/jetlag.git
Expand All @@ -103,7 +88,7 @@ Receiving objects: 100% (4510/4510), 831.98 KiB | 21.33 MiB/s, done.
Resolving deltas: 100% (2450/2450), done.
```

6. Download your pull_secret.txt from [console.redhat.com/openshift/downloads](https://console.redhat.com/openshift/downloads) and place it in the root directory of `jetlag`
5. Download your pull_secret.txt from [console.redhat.com/openshift/downloads](https://console.redhat.com/openshift/downloads) and place it in the root directory of `jetlag`

```console
[root@xxx-r660 jetlag]# cat pull_secret.txt
Expand All @@ -112,7 +97,7 @@ Resolving deltas: 100% (2450/2450), done.
...
```

7. Change to `jetlag` directory, and then run `source bootstrap.sh`
6. Change to `jetlag` directory, and then run `source bootstrap.sh`

```console
[root@xxx-r660 ~]# cd jetlag/
Expand All @@ -122,7 +107,7 @@ Collecting pip
(.ansible) [root@xxx-r660 jetlag]#
```

8. Subsequent bastion setup attempts
7. Subsequent bastion setup attempts

If you wish to install a different OCP version after having prepared your bastion, [clean up](https://github.com/redhat-performance/jetlag/blob/main/docs/troubleshooting.md#bastion---clean-all-container-services--podman-pods) all running pods and rerun the `setup-bastion.yml` playbook:

Expand All @@ -145,7 +130,7 @@ Copy the vars file and edit it to create the inventory with your lab info:

### Lab & cluster infrastructure vars

Change `lab` to `lab: scalelab or ibmcloud or alias`
Change `lab` to `lab: scalelab or ibmcloud or alias or byol`

Change `lab_cloud` to `lab_cloud: na`

Expand Down Expand Up @@ -210,7 +195,7 @@ The `ansible/vars/all.yml` now resembles ..
# Lab & cluster infrastructure vars
################################################################################
# Which lab to be deployed into (Ex scalelab)
lab: alias
lab: byol
# Which cloud in the lab environment (Ex cloud42)
lab_cloud: na
Expand Down Expand Up @@ -303,8 +288,9 @@ Choose wisely which server for which role: bastion, masters and workers. Make su
- Record the names and MACs of their L3 network NIC to be used for the inventory.
- Choose the control-plane NICs, the L2 NIC interface.
- Record the interface names and MACs of the chosen control-plane interfaces.
- Make sure you have root access to the bms, i.e., idrac for Dell. In the example below the bmc_user and bmc_password are set to root and password.

Now, copy the inventory file and edit it with the above info from your lab:
Now, copy the inventory file and edit it with the above info manually for your lab:

```
# Create inventory playbook will generate this for you much easier
Expand All @@ -330,7 +316,7 @@ boot_iso=discovery.iso
bmc_user=root
bmc_password=password
lab_interface=<lab_mac interface name>
network_interface=<anything, not used>
network_interface=<anything>
network_prefix=24
gateway=198.18.10.1
dns1=198.18.10.1
Expand All @@ -346,7 +332,7 @@ boot_iso=discovery.iso
bmc_user=root
bmc_password=password
lab_interface=<lab_mac interface name>
network_interface=<anything, not used>
network_interface=<anything>
network_prefix=24
gateway=198.18.10.1
dns1=198.18.10.1
Expand Down Expand Up @@ -408,11 +394,9 @@ In jetlab, we divide the cluster installation process in two phases: (1) Setup t

- The task ''Clean lab interfaces'' will fail if there is no file at this location `/root/clean-interface.sh`.

- I observed that the file `/root/bm/opm-linux.tar.gz` could fail to be downloaded.

- The task ''Stop and disable iptables'' failed because `dnf install iptables-services` and `systemctl start` needed to be done.

- Most importantly, while the setup-bastion can finish successfully, some pods did not come up with "permission denied". In [blog](https://www.redhat.com/sysadmin/container-permission-denied-errors) SELinux seemed to be the cause when containers mount a writable volume, as starting the container manually (without mounting the volume) worked. The direct fix to this issue is appending the ":Z" flag at the end (instead of touching SELinux in general) in a few locations in `jetlag`:
- Most importantly, while the setup-bastion can successfully finish, some pods did not come up showing "permission denied". In [blog](https://www.redhat.com/sysadmin/container-permission-denied-errors) SELinux seemed to be the cause when containers mount a writable volume, as starting the container manually (without mounting the volume) worked. The direct fix to this issue is appending the ":Z" flag at the end (instead of touching SELinux in general) in a few locations in `jetlag`:
- In `ansible/roles/bastion-assisted-installer/tasks/main.yml`:
- `/opt/assisted-service/data/postgresql:/var/lib/pgsql:Z`
- `/opt/assisted-service/nginx-ui.conf:/opt/bitnami/nginx/conf/server_blocks/nginx.conf:Z`
Expand All @@ -429,4 +413,4 @@ In jetlab, we divide the cluster installation process in two phases: (1) Setup t
- Make sure to inspect the 'BIOS Settings' in the machine for both, the *boot order* and *boot type*. `jetlag` will mount the .iso and instruct the machines for a one-time boot, where, later, they should be able to boot from the disk. In other words, check if the string in the boot order field contains the hard disk. Once booted, in the virtual console, you will see the L3 NIC interface with an 198.10.18.x address, which is correct according to our `byol.yml` above.
- [badfish](https://github.com/redhat-performance/badfish) could be used, however, it uses FQDN for the machines only, and we also do not have the configuration for Del R660 and R760 in the interface config file yet.

- In the assistant installer GUI, under cluster events, if you observe any *permission denied* error, it is related to the SELinux issue pointed out above. If you however notice an issue related to *wrong booted device*, make sure to observe in your virtual console, if the machines booted from the disk, and if the boot order contains the option. This is a classic boot order issue. The correct steps in the assistant installer procedures are that the control-plane nodes will boot from the disk, be configured, and join the control-plane "nominated" as the bootstrap node (around 45 t0 47% of the installation) to continue with the installation of the worker nodes.
- In the assistant installer GUI, under cluster events, if you observe any *permission denied* error, it is related to the SELinux issue pointed out previously. If you however notice an issue related to *wrong booted device*, make sure to observe in your virtual console, if the machines booted from the disk, and if the boot order contains the disk option. This is a classic boot order issue. The steps in the assistant installer are that the control-plane nodes will boot from the disk to be configured, and then join the control-plane "nominated" as the bootstrap node (this happens around 45-47% of the installation) to continue with the installation of the worker nodes.

0 comments on commit 810e2da

Please sign in to comment.