Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow to run tests on real hardware #2545

Merged
merged 13 commits into from
Oct 29, 2024
Merged

Conversation

MaisenbacherD
Copy link
Contributor

This PR introduces a GitHub workflow that runs all test cases under the
tests directory on real hardware.

In preparation, the existing test cases are fixed, such that the new
run-tests workflow completes successfully (see actions on my fork's master
branch).

The infrastructure that provides the self-hosted GitHub runner with real nvme
devices attached to it are provided by Western Digital Corporation.
On a high-level overview, this infrastructure contains multiple storage nodes
that form a Kubernetes cluster. Within this Kubernetes cluster, KubeVirt
virtual machines can be spawned on demand with different hardware
configurations. Those VMs are owned by different users. One user group that
is separated in its own namespace is self-hosted GitHub runners.

The overall goal of this infrastructure is to provide device access to open
source projects for testing purposes. The first two pioneering candidates to
make use of this service are nvme-cli and ZenFS (see westerndigitalcorporation/zenfs#294).

The new self-hosted GitHub runner IaC is to be open-sourced.

If people want to see other nvme devices for testing in GitHub workflows, we
are happy to discuss details on how to move forward with integrating new
test devices.

Initial requirements for this self-hosted runner:

  • Share NVMe devices (both NVM and ZNS command sets. Western Digital
    Ultrastar® SN640 and Western Digital Ultrastar® ZN540 are provided)
  • Update the VM kernel nightly with Linus' master branch (through cron job)
  • Security:
    • Running in a dedicated KubeVirt VM, one per repository.
    • Restricted cluster network access: KubeVirt masquerade interface that only opens
      necessary ports.
    • Privileged code execution within the VM ('gh-runner' user). The VM will be reset
      after each workflow run.
    • Running in the gh-runner namespace with NetworkPolicies where
      ssh is just allowed from local IPs, https is just allowed to and from
      external addresses (no access to internal cluster services), access to VM
      image repository access just from local IPs.
    • /etc/hosts.allow and /etc/hosts.deny rules for the gh-runner instances that
      only allow sshd access from cluster local IPs into the VM.

IMPORTANT: Configuration required by the GitHub repo that uses the self-hosted runner:

  • A single self-hosted runner instance (Kubevirt VM) is only allowed to serve
    one repository
  • Optionally: Prevent a group of people from allowing actions:
    Repo -> Settings -> Actions -> General -> Actions permissions
  • Prevent PR's from executing code on the self-hosted runner before getting approved:
    Repo -> Settings -> Actions -> General -> 'Require approval for all outside
    collaborators' -> Save
  • Restrict write access to the repository with the GITHUB_TOKEN:
    Repo -> Settings -> Actions -> General -> 'Read repository content and packages
    permissions' -> Save
  • Preventing GitHub Actions from creating or approving pull requests through the
    GITHUB_TOKEN:
    Repo -> Settings -> Actions -> General -> DISABLE 'Allow GitHub Actions to
    create and approve pull requests' -> Save
  • In reviews watch out for code injection and secret leaks within workflows
    (https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions)

In a GitHub workflow that uses a container on the self-hosted instance, the
block device has to be passed into the container:

...
    container:
      image: ghcr.io/igaw/linux-nvme/debian.python:latest
      options: '--device=/dev/nvme0n1:/dev/nvme0n1'
...

Before this PR gets merged we should make sure that the self-hosted runner is
added to the repository, which requires a token exchange on a side channel. :)

tests/nvme_test.py Outdated Show resolved Hide resolved
tests/nvme_test.py Outdated Show resolved Hide resolved
@igaw
Copy link
Collaborator

igaw commented Oct 28, 2024

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

This fixes "SyntaxWarning: invalid escape sequence" when `get_error_log`
is being called.

Signed-off-by: Dennis Maisenbacher <[email protected]>
@MaisenbacherD
Copy link
Contributor Author

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

Thanks for the review :) I changed the sections you commented on. Letting the tests run on my branch now.
I think it makes sense to use a container within the VM instead of running directly in the VM.
I will now experiment a bit with that before pushing this update here.

tests/nvme_test.py Outdated Show resolved Hide resolved
Calculate ncap with configured flbas instead of setting a hard coded
value that might be to big for the test device.

Also, for this test to not fail the `nvme list-ctrl` outputs a summary
of ctrls present which must be skipped when parsing for the ctrl-id.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Reworking the `get_ocfs` function to correctly parse the ocfs field.
At the same time refactor `nvme id-ctrl` that extracts the ocfs
field into separate functions. This is in preparation for following
commits that reuse the same functionality.

Furthermore nvme_copy_test fails on devices that do not support any
copy formats. For this to pass we need to declare
`self.host_behavior_data` in any case.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Fix off by one errors and capacity allocation.

Small speedup for nvme_create_max_ns_test by reducing the io done by
`run_ns_io`. For this we introduce a new count parameter which can
overwrite the default value of 10.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Run I/O after a successful ctrl reset.
This tests if the sqs and cqs are constructed after reset.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Set ncap and nsze to the lowest possible value, such that this test can
be run on different device capacities.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Specify mandatory namespace for Error Recovery (Feature Identifier 05h)
get features command.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Use the long option for a vendor specific id-ctrl instead of the verbose
flag 'v'.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Don't parse the smart log output to then print the (unsuccessfuly) parsed
values. Instead we print out the whole smart log output.

Signed-off-by: Dennis Maisenbacher <[email protected]>
@MaisenbacherD
Copy link
Contributor Author

The checkpatch is failing with the following:

-------------------------------------------------------------
[13/13] Check commit - b75fb594d323ca1f5facf47c9062037969037bb4
-------------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#18:
new file mode 100644

I am not sure if this is applicable since I can't spot a MAINTAINERS file 🤔

Check if the NVM 'compare' command is supported before running the
nvme_compare_test which whould then fail with an 'Invalid Command
Opcode'

Signed-off-by: Dennis Maisenbacher <[email protected]>
Look up if the drive supports the `Get LBA Status` optional admin
command before executing a `nvme get-lba-status` or `nvme
lba-status-log` command.

Furthermore use the correct action value on `get-lba-status`.

Signed-off-by: Dennis Maisenbacher <[email protected]>
After a test was run, the `tearDown` function is called and then creates
and attaches a single ns with the full NVM capacity.
This is done so the caller or the next test case receives a reasonably
formated drive.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Introducing a GitHub workflow which runs all test cases under the
`tests` directory on real hardware through a self-hosted runner.
This workflow is triggered nightly or on demand as the tests run about an
hour.

Signed-off-by: Dennis Maisenbacher <[email protected]>
@igaw
Copy link
Collaborator

igaw commented Oct 29, 2024

You can ignore that checkpatch failure. We don't do the MAINTAINER file thing, so any new file added to project will trigger this message.

@igaw
Copy link
Collaborator

igaw commented Oct 29, 2024

BTW, I've just remembered, that we also have a Python binding for the library which provides the constants. Maybe we could use these in future as well. Just as idea.

@igaw igaw merged commit 01abe63 into linux-nvme:master Oct 29, 2024
16 of 17 checks passed
@igaw
Copy link
Collaborator

igaw commented Oct 29, 2024

Now, I have to figure out how to enable it :)

@MaisenbacherD
Copy link
Contributor Author

Thanks for the review and merge :)
I will contact you to set up the VM and exchange the config token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants