Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E CI Test for Operator Bundle #28

Merged

Conversation

adambkaplan
Copy link
Member

@adambkaplan adambkaplan commented Nov 2, 2021

Changes

This change augments the e2e test suite to simulate the operator's deployment using OLM.
The setup consists of the following components:

  • A docker/distribution container registry, running in docker outside of any Kubernetes cluster.
  • A KinD cluster that is configured to resolve image refs which use "localhost" as the image registry domain.
  • Installing OLM on the KinD cluster.

Once set up, the operator and its associated OLM bundle are built and pushed to the local container registry.
Next, an OLM catalog is built based on the catalog published in operatorhub.io.
The catalog is what allows OLM to find the Tekton operator that Shipwright depends on, and is likewise pushed to the local container registry.

Building the operator, bundle, and catalog with a fully on-cluster registry is problematic for several reasons:

  • Not all tools can push to the on-cluster registry in this fashion
  • Mainfests need to be rewritten to reference the on-cluster DNS name for the registry
  • The catalog source needs to be pullable within the cluster.

The test runs as follows:

  • Create a namespace to run the operator under test
  • Create a CatalogSource using the catalog containing the operator under test.
  • Create an OperatorGroup which allows AllNamespace operators to be installed in the given namespace.
  • Create a Subscription to install the Shipwright operator and its associated Tekton operator.
  • Verify that the shipwright operator deploys successfully.

See also:

/kind cleanup

Submitter Checklist

  • Includes tests if functionality changed/was added
  • Includes docs if changes are user-facing
  • Set a kind label on this PR
  • Release notes block has been filled in, or marked NONE

See the contributor guide
for details on coding conventions, github and prow interactions, and the code review process.

Release Notes

Add CI and developer documentation on how to deploy the operator using OLM.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Nov 2, 2021
@openshift-ci openshift-ci bot requested review from mattcui and zhangtbj November 2, 2021 14:48
@adambkaplan adambkaplan force-pushed the bundle-build-ci branch 5 times, most recently from ef0d8a4 to 8caa8ca Compare November 5, 2021 21:05
@adambkaplan adambkaplan changed the title WIP - Verify operator and bundle work in CI testing E2E CI Test for Operator Bundle Nov 8, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 8, 2021
@adambkaplan
Copy link
Member Author

/hold

Should merge after #30

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2021
@gabemontero
Copy link
Member

how hard would it be to enable either PSP or the PSP follow on to turn on enforcement in our kind cluster to that running a non-root is proven out?

If it is too much to take on with this PR, can you open an issue to track doing that longer term?

@gabemontero gabemontero mentioned this pull request Nov 8, 2021
4 tasks
@adambkaplan
Copy link
Member Author

how hard would it be to enable either PSP or the PSP follow on to turn on enforcement in our kind cluster to that running a non-root is proven out?

@gabemontero are you referring to PodSecurity Policy (deprecated) or the new PodSecurity admission plugin (which has enforcement mechanisms for the new Pod Security Standards). For the latter, I would prefer we add it in a follow-up PR, as the plugin is only available in k8s 1.22.

@gabemontero
Copy link
Member

how hard would it be to enable either PSP or the PSP follow on to turn on enforcement in our kind cluster to that running a non-root is proven out?

@gabemontero are you referring to PodSecurity Policy (deprecated) or the new PodSecurity admission plugin (which has enforcement mechanisms for the new Pod Security Standards). For the latter, I would prefer we add it in a follow-up PR, as the plugin is only available in k8s 1.22.

Referring to both. You could either

  1. do deprecated PSP now, then switch to the plugin when we go to 1.22
  2. do not do PSP now (but we have not upstream validation in the meantime) and wait until 1.22 to do the plugin

If you do 1) now, great. If you want to open a tracking item to do 1) at some point in the future, outside of this PR, ok, or If you open a tracking item to do 2), ok

As long as one of those 3 possible actions is taken, I'm good

@adambkaplan
Copy link
Member Author

I don't think we need PodSecurityPolicy to ensure that we don't run pods as root - securityContext.runAsNonRoot enforces that the containers don't run as root at the kubelet level. A previous iteration of the runAsNonRoot PR failed because the rbac proxy ran as root on the KinD cluster. That version didn't fail on OpenShift because it has an admission webhook that sets an arbitrarily high runAsUser value on most pods/containers.

I can file an issue to enable the PodSecurity plugin when we upgrade to 1.22, and furthermore enforce the restricted profile for the operator.

@adambkaplan
Copy link
Member Author

Filed #33

@adambkaplan
Copy link
Member Author

/hold cancel

#30 merged.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2021
@adambkaplan
Copy link
Member Author

/approve

Self-approving

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 17, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2021
Copy link
Member

@otaviof otaviof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the same tools across all projects in the Shipwright organization, as in the hack scripts we need for KinD, local container registry and such. Currently we are duplicating those scripts on each new interaction, with sightly differences on each place, making the future maintenance less predictable.

As well, I miss having the Makefile as the "entrypoint" for (almost) all automation that we need in a given project. The Makefile acts as "source of authority" for automation, keeps the environment variables under control, and serves a place to document and invoke our automation inventory. Likewise, we should try to have the same Makefile targets across all projects in the organization, as much as possible.

So, I think we should find means to avoid duplicating our CI scripts, and make sure the Makefile is the concise place to trigger automation in the project. That could be something this PR could tackle, or even a future improvement too.

.github/workflows/ci.yml Show resolved Hide resolved
@adambkaplan
Copy link
Member Author

I think we should use the same tools across all projects in the Shipwright organization, as in the hack scripts we need for KinD, local container registry and such. Currently we are duplicating those scripts on each new interaction, with sightly differences on each place, making the future maintenance less predictable.

I agree that this will become a challenge over time. It sounds like we are ready to build some test-infra tooling!

@adambkaplan
Copy link
Member Author

Filed shipwright-io/community#44 to discuss how to approach common tooling.

Copy link
Member

@otaviof otaviof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on scripting, but other than that it looks really good 👍🏼

attempts=1
while [[ ${attempts} -le 10 ]]; do
echo "Checking the status of the operator rollout - attempt ${attempts}"
if ${k8s} rollout status deployment "${namePrefix}operator" -n "${subNamespace}"; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can use the option --timeout and let Kubernetes client handle the retries for you. We do the same on the CLI, please consider.

With the timeout option in place, we may also simplify the script not having to handle the attempt for-loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to use a single function to wait on pod status.

hack/run-operator-catalog.sh Outdated Show resolved Hide resolved
hack/run-operator-catalog.sh Outdated Show resolved Hide resolved
Comment on lines 80 to 82
# Pod may not exist, in which case wait 30 seconds and try again
${KUBECTL_BIN} wait --for=condition=Ready pod -l "${label}" -n "${namespace}" --timeout "${timeout}" || \
sleep 30 && ${KUBECTL_BIN} wait --for=condition=Ready pod -l "${label}" -n "${namespace}" --timeout "${timeout}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 30 second gap before the retry is needed because OLM does a lot of work to deploy the operator, and no pods may exist when the initial wait call is made.

@adambkaplan
Copy link
Member Author

bump @otaviof


echo "# Using KinD context..."
${KUBECTL_BIN} config use-context "kind-kind"
cho "# KinD nodes:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we lost the letter e on echo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - and not adding set -e here let it pass through 🤦

Copy link
Member

@otaviof otaviof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, very good additions to this project!

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2021
This change augments the e2e test suite to simulate the
operator's deployment using OLM. The setup consists of the
following components:

- A docker/distribution container registry, running in
  docker outside of any Kubernetes cluster.
- A KinD cluster that is configured to resolve image refs
  which use "localhost" as the image registry domain.
- Installing OLM on the KinD cluster.

Once set up, the operator and its associated OLM bundle are
built and pushed to the local container registry. Next,
an OLM catalog is built based on the catalog published in
operatorhub.io. The catalog is what allows OLM to find the
Tekton operator that Shipwright depends on, and is likewise
pushed to the local container registry.

Building the operator, bundle, and catalog with a fully on-cluster
registry is problematic for several reasons:

- Not all tools can push to the on-cluster registry in this fashion
- Mainfests need to be rewritten to reference the on-cluster DNS name
for the registry
- The catalog source needs to be pullable within the cluster.

The test runs as follows:

- Create a namespace to run the operator under test
- Create a CatalogSource using the catalog containing the
  operator under test.
- Create an OperatorGroup which allows AllNamespace operators
  to be installed in the given namespace.
- Create a Subscription to install the Shipwright operator
  and its associated Tekton operator.
- Verify that the shipwright operator deploys successfully.

Contributor documentation has also been updated so that
developers can run this process using make commands on their
Kubernetes cluster of choice.

See also:
- https://kind.sigs.k8s.io/docs/user/local-registry/
- https://olm.operatorframework.io/docs/tasks/creating-a-catalog/
- https://olm.operatorframework.io/docs/tasks/make-catalog-available-on-cluster/
- https://olm.operatorframework.io/docs/tasks/install-operator-with-olm/
- https://olm.operatorframework.io/docs/advanced-tasks/operator-scoping-with-operatorgroups/
Use the `operatorhub/catalog_sa` image as the base for the catalog
index. The default operatorhub catalog appears to have a root-owned file
that causes `opm index add` to fail.

See operator-framework/operator-registry#870
o.MatchError will panic if a k8s NotFound error is returned. This is
fixed by checking that a NotFound error is raised separate from a gomega
match.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 3, 2021

New changes are detected. LGTM label has been removed.

@adambkaplan adambkaplan added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021
@adambkaplan
Copy link
Member Author

Re-tagged @otaviof 's lgtm (I needed to push a minor fix that caused tests to break).

@openshift-merge-robot openshift-merge-robot merged commit ea9571c into shipwright-io:main Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants