Add workflow to run tests on real hardware #2545

MaisenbacherD · 2024-10-27T14:36:16Z

This PR introduces a GitHub workflow that runs all test cases under the
tests directory on real hardware.

In preparation, the existing test cases are fixed, such that the new
run-tests workflow completes successfully (see actions on my fork's master
branch).

The infrastructure that provides the self-hosted GitHub runner with real nvme
devices attached to it are provided by Western Digital Corporation.
On a high-level overview, this infrastructure contains multiple storage nodes
that form a Kubernetes cluster. Within this Kubernetes cluster, KubeVirt
virtual machines can be spawned on demand with different hardware
configurations. Those VMs are owned by different users. One user group that
is separated in its own namespace is self-hosted GitHub runners.

The overall goal of this infrastructure is to provide device access to open
source projects for testing purposes. The first two pioneering candidates to
make use of this service are nvme-cli and ZenFS (see westerndigitalcorporation/zenfs#294).

The new self-hosted GitHub runner IaC is to be open-sourced.

If people want to see other nvme devices for testing in GitHub workflows, we
are happy to discuss details on how to move forward with integrating new
test devices.

Initial requirements for this self-hosted runner:

Share NVMe devices (both NVM and ZNS command sets. Western Digital
Ultrastar® SN640 and Western Digital Ultrastar® ZN540 are provided)
Update the VM kernel nightly with Linus' master branch (through cron job)
Security:
- Running in a dedicated KubeVirt VM, one per repository.
- Restricted cluster network access: KubeVirt masquerade interface that only opens
  necessary ports.
- Privileged code execution within the VM ('gh-runner' user). The VM will be reset
  after each workflow run.
- Running in the gh-runner namespace with NetworkPolicies where
  ssh is just allowed from local IPs, https is just allowed to and from
  external addresses (no access to internal cluster services), access to VM
  image repository access just from local IPs.
- /etc/hosts.allow and /etc/hosts.deny rules for the gh-runner instances that
  only allow sshd access from cluster local IPs into the VM.

IMPORTANT: Configuration required by the GitHub repo that uses the self-hosted runner:

A single self-hosted runner instance (Kubevirt VM) is only allowed to serve
one repository
Optionally: Prevent a group of people from allowing actions:
Repo -> Settings -> Actions -> General -> Actions permissions
Prevent PR's from executing code on the self-hosted runner before getting approved:
Repo -> Settings -> Actions -> General -> 'Require approval for all outside
collaborators' -> Save
Restrict write access to the repository with the GITHUB_TOKEN:
Repo -> Settings -> Actions -> General -> 'Read repository content and packages
permissions' -> Save
Preventing GitHub Actions from creating or approving pull requests through the
GITHUB_TOKEN:
Repo -> Settings -> Actions -> General -> DISABLE 'Allow GitHub Actions to
create and approve pull requests' -> Save
In reviews watch out for code injection and secret leaks within workflows
(https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions)

In a GitHub workflow that uses a container on the self-hosted instance, the
block device has to be passed into the container:

...
    container:
      image: ghcr.io/igaw/linux-nvme/debian.python:latest
      options: '--device=/dev/nvme0n1:/dev/nvme0n1'
...

Before this PR gets merged we should make sure that the self-hosted runner is
added to the repository, which requires a token exchange on a side channel. :)

tests/nvme_test.py

tests/nvme_format_test.py

.github/workflows/run-nightly-tests.yml

igaw · 2024-10-28T10:38:39Z

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

This fixes "SyntaxWarning: invalid escape sequence" when `get_error_log` is being called. Signed-off-by: Dennis Maisenbacher <[email protected]>

MaisenbacherD · 2024-10-28T15:22:27Z

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

Thanks for the review :) I changed the sections you commented on. Letting the tests run on my branch now.
I think it makes sense to use a container within the VM instead of running directly in the VM.
I will now experiment a bit with that before pushing this update here.

tests/nvme_test.py

Calculate ncap with configured flbas instead of setting a hard coded value that might be to big for the test device. Also, for this test to not fail the `nvme list-ctrl` outputs a summary of ctrls present which must be skipped when parsing for the ctrl-id. Signed-off-by: Dennis Maisenbacher <[email protected]>

Reworking the `get_ocfs` function to correctly parse the ocfs field. At the same time refactor `nvme id-ctrl` that extracts the ocfs field into separate functions. This is in preparation for following commits that reuse the same functionality. Furthermore nvme_copy_test fails on devices that do not support any copy formats. For this to pass we need to declare `self.host_behavior_data` in any case. Signed-off-by: Dennis Maisenbacher <[email protected]>

Fix off by one errors and capacity allocation. Small speedup for nvme_create_max_ns_test by reducing the io done by `run_ns_io`. For this we introduce a new count parameter which can overwrite the default value of 10. Signed-off-by: Dennis Maisenbacher <[email protected]>

Run I/O after a successful ctrl reset. This tests if the sqs and cqs are constructed after reset. Signed-off-by: Dennis Maisenbacher <[email protected]>

Set ncap and nsze to the lowest possible value, such that this test can be run on different device capacities. Signed-off-by: Dennis Maisenbacher <[email protected]>

Specify mandatory namespace for Error Recovery (Feature Identifier 05h) get features command. Signed-off-by: Dennis Maisenbacher <[email protected]>

Use the long option for a vendor specific id-ctrl instead of the verbose flag 'v'. Signed-off-by: Dennis Maisenbacher <[email protected]>

Don't parse the smart log output to then print the (unsuccessfuly) parsed values. Instead we print out the whole smart log output. Signed-off-by: Dennis Maisenbacher <[email protected]>

MaisenbacherD · 2024-10-28T20:21:06Z

The checkpatch is failing with the following:

-------------------------------------------------------------
[13/13] Check commit - b75fb594d323ca1f5facf47c9062037969037bb4
-------------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#18:
new file mode 100644

I am not sure if this is applicable since I can't spot a MAINTAINERS file 🤔

Check if the NVM 'compare' command is supported before running the nvme_compare_test which whould then fail with an 'Invalid Command Opcode' Signed-off-by: Dennis Maisenbacher <[email protected]>

Look up if the drive supports the `Get LBA Status` optional admin command before executing a `nvme get-lba-status` or `nvme lba-status-log` command. Furthermore use the correct action value on `get-lba-status`. Signed-off-by: Dennis Maisenbacher <[email protected]>

After a test was run, the `tearDown` function is called and then creates and attaches a single ns with the full NVM capacity. This is done so the caller or the next test case receives a reasonably formated drive. Signed-off-by: Dennis Maisenbacher <[email protected]>

Introducing a GitHub workflow which runs all test cases under the `tests` directory on real hardware through a self-hosted runner. This workflow is triggered nightly or on demand as the tests run about an hour. Signed-off-by: Dennis Maisenbacher <[email protected]>

igaw · 2024-10-29T06:26:18Z

You can ignore that checkpatch failure. We don't do the MAINTAINER file thing, so any new file added to project will trigger this message.

igaw · 2024-10-29T06:37:18Z

BTW, I've just remembered, that we also have a Python binding for the library which provides the constants. Maybe we could use these in future as well. Just as idea.

igaw · 2024-10-29T06:38:06Z

Now, I have to figure out how to enable it :)

MaisenbacherD · 2024-10-29T07:02:39Z

Thanks for the review and merge :)
I will contact you to set up the VM and exchange the config token.