Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-3009: Prune stale CCRs before aggregating scan results #221

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

rhmdnd
Copy link

@rhmdnd rhmdnd commented Feb 16, 2023

Previously, the compliance operator would leave CCRs around and then
just overwrite them on subsequent scans. While the most recent scan data
was accurate, because it was overwriting existing check results, it gave
the impression that some changes weren't taking effect.

For example, if you create a tailored profile, run a scan, exclude a
rule, and rerun the scan, it appears the change you just made never took
effect because the result from the rule you ignored still exists.

To avoid this, let's check for any check results at scan time and make
sure we clean them up before we aggregate the new results.

@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: This pull request references Jira Issue OCPBUGS-3009, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously, the compliance operator would leave CCRs around and then
just overwrite them on subsequent scans. While the most recent scan data
was accurate, because it was overwriting existing check results, it gave
the impression that some changes weren't taking effect.

For example, if you create a tailored profile, run a scan, exclude a
rule, and rerun the scan, it appears the change you just made never took
effect because the result from the rule you ignored still exists.

To avoid this, let's check for any check results at scan time and make
sure we clean them up before we aggregate the new results.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from jhrozek and xiaojiey February 16, 2023 22:18
@rhmdnd
Copy link
Author

rhmdnd commented Feb 16, 2023

This still needs some testing but I'm waiting on a cluster.

@@ -3,7 +3,7 @@ resources:

images:
- name: compliance-operator
newName: quay.io/compliance-operator/compliance-operator
newTag: latest
newName: image-registry.openshift-image-registry.svc:5000/openshift/compliance-operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove this change from the commit?

@Vincent056
Copy link

Vincent056 commented Feb 16, 2023

I think it makes sense to remove CCR before aggregating phase, but I wonder if we should do that during phaseDoneHandler, when someone labels a scan to be rescanned. maybe in here: https://github.com/ComplianceAsCode/compliance-operator/pull/221/files#diff-7aba53b74d27b417a478166b1a983b145fb0e381975e97097b7794752b51edeeR612

@xiaojiey
Copy link
Collaborator

/hold for test

@xiaojiey
Copy link
Collaborator

xiaojiey commented Feb 23, 2023

@rhmdnd,
Verified with 4.13.0-0.nightly-2023-02-23-000625 and the code in the PR. seems it is NOT working as expected. Could you please help to check? Thanks.
Details:

1. Deploy compliance operator with code in the PR
2. Create a tp with a rule disabled in the tp:
oc apply -f -<<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: TailoredProfile
metadata:
  annotations:
    compliance.openshift.io/product-type: Platform
  name: test
  namespace: openshift-compliance
spec:
  description: test
  extends: ocp4-cis
  disableRules:
  - name: ocp4-kubelet-enable-streaming-connections
    rationale: we only want to test this rule
  title: test
EOF
tailoredprofile.compliance.openshift.io/test created
3. Create a ssb with above tp and check results:
$ oc compliance bind -N test1 -S default tailoredprofile/test
Creating ScanSettingBinding test1
$ oc get suite -w
NAME    PHASE     RESULT
test1   RUNNING   NOT-AVAILABLE
test1   AGGREGATING   NOT-AVAILABLE
test1   DONE          NON-COMPLIANT
test1   DONE          NON-COMPLIANT
^C
$ oc get ccr | grep kubelet-enable-streaming-connections
$ oc get ccr | grep kubelet-eviction-thresholds-set-hard-imagefs-available
test-kubelet-eviction-thresholds-set-hard-imagefs-available    PASS     medium
4. configure the tp with one more rule disabled:
$ oc apply -f -<<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: TailoredProfile
metadata:
  annotations:
    compliance.openshift.io/product-type: Platform
  name: test
  namespace: openshift-compliance
spec:
  description: test
  extends: ocp4-cis
  disableRules:
  - name: ocp4-kubelet-enable-streaming-connections
    rationale: we only want to test this rule
  - name: ocp4-kubelet-eviction-thresholds-set-hard-imagefs-available
    rationale: we only want to test this rule
  title: test
EOF
tailoredprofile.compliance.openshift.io/test configured
5. Tried to rerun the ssb directly and check the result:
$ oc compliance rerun-now scansettingbinding test1
Rerunning scans from 'test1': test
Re-running scan 'openshift-compliance/test'
[xiyuan@MiWiFi-RA69-srv compliance-operator (pr-221)]$ oc get suite -w
NAME    PHASE     RESULT
test1   RUNNING   NOT-AVAILABLE
test1   AGGREGATING   NOT-AVAILABLE
test1   DONE          NON-COMPLIANT
test1   DONE          NON-COMPLIANT
^C
$ oget ccr | grep kubelet-enable-streaming-connections
$ oc get ccr | grep kubelet-eviction-thresholds-set-hard-imagefs-available
test-kubelet-eviction-thresholds-set-hard-imagefs-available    PASS     medium
6. Create another ssb with the tp:
$ oc compliance bind -N test2 -S default tailoredprofile/test
Creating ScanSettingBinding test2
$ oc get suite -w
NAME    PHASE   RESULT
test1   DONE    NON-COMPLIANT
test2   DONE    NON-COMPLIANT
^C
$ oc get ccr | grep kubelet-enable-streaming-connections
$ oc get ccr | grep kubelet-eviction-thresholds-set-hard-imagefs-available
test-kubelet-eviction-thresholds-set-hard-imagefs-available    PASS     medium

@rhmdnd
Copy link
Author

rhmdnd commented Mar 9, 2023

I think this PR still needs some work, including an e2e test. I was in the process of adding a test and then discovered other issues in the test framework that were interfering with my test (specifically around resource cleanup).

@rhmdnd rhmdnd force-pushed the OCPBUGS-3009 branch 2 times, most recently from b238c94 to 75e89f2 Compare April 13, 2023 19:59
@rhmdnd
Copy link
Author

rhmdnd commented Apr 13, 2023

I was able to reproduce with the latest test.

tests/e2e/framework/client.go Outdated Show resolved Hide resolved
@rhmdnd
Copy link
Author

rhmdnd commented Apr 13, 2023

/retest

@rhmdnd
Copy link
Author

rhmdnd commented Apr 14, 2023

@xiaojiey I got a clean parallel run with the test. Should be good for another pre-merge validation test.

@rhmdnd rhmdnd requested review from mkumku and sheriff-rh April 14, 2023 03:37
@rhmdnd
Copy link
Author

rhmdnd commented Apr 14, 2023

/retest e2e-aws-serial

Retest due to infrastructure timeouts.

@openshift-ci
Copy link

openshift-ci bot commented Apr 14, 2023

@rhmdnd: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

  • /test e2e-aws-parallel
  • /test e2e-aws-serial
  • /test go-build
  • /test images
  • /test unit
  • /test verify

Use /test all to run all jobs.

In response to this:

/retest e2e-aws-serial

Retest due to infrastructure timeouts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Collaborator

@mkumku mkumku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

@rhmdnd
Copy link
Author

rhmdnd commented Apr 14, 2023

/retest e2e-aws-serial

@rhmdnd rhmdnd removed the qe-approved label Nov 8, 2023
@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: This pull request references Jira Issue OCPBUGS-3009, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Previously, the compliance operator would leave CCRs around and then
just overwrite them on subsequent scans. While the most recent scan data
was accurate, because it was overwriting existing check results, it gave
the impression that some changes weren't taking effect.

For example, if you create a tailored profile, run a scan, exclude a
rule, and rerun the scan, it appears the change you just made never took
effect because the result from the rule you ignored still exists.

To avoid this, let's check for any check results at scan time and make
sure we clean them up before we aggregate the new results.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xiaojiey
Copy link
Collaborator

xiaojiey commented Nov 9, 2023

/hold

Copy link

@Vincent056 Vincent056 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
just some inline comments but I think we can add that in another pr

@openshift-ci openshift-ci bot added the lgtm label Nov 9, 2023
@openshift-ci openshift-ci bot removed the lgtm label Nov 9, 2023
Copy link

@Vincent056 Vincent056 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Nov 9, 2023
@xiaojiey
Copy link
Collaborator

Pre-merge testing passed with 4.14.0-0.nightly-2023-11-09-092851 and code in #221

1. Create a tp and create a ssb with the tailoredprofile:

$ cat tp2.yaml 
apiVersion: compliance.openshift.io/v1alpha1
kind: TailoredProfile
metadata:
  name: test
  namespace: openshift-compliance
spec:
  description: test
  title: test
  enableRules:
    - name: ocp4-api-server-anonymous-auth
      rationale: platform
    - name: ocp4-api-server-api-priority-gate-enabled
      rationale: platform
    - name: ocp4-api-server-audit-log-maxbackup
      rationale: platform
    - name: ocp4-api-server-audit-log-maxsize
      rationale: platform
    - name: ocp4-api-server-encryption-provider-cipher
      rationale: platform
$ oc apply -f tp2.yaml 
tailoredprofile.compliance.openshift.io/test created
]$ oc get tp
NAME   STATE
test   READY
$ oc compliance bind -N test tailoredprofile/test
Creating ScanSettingBinding test
$ oc compliance bind -N test tailoredprofile/test
Creating ScanSettingBinding test
$ oc get suite
NAME   PHASE     RESULT
test   DONE          NON-COMPLIANT
$ oc get ccr
NAME                                         STATUS   SEVERITY
test-api-server-anonymous-auth               PASS     medium
test-api-server-api-priority-gate-enabled    FAIL     medium
test-api-server-audit-log-maxbackup          PASS     low
test-api-server-audit-log-maxsize            PASS     medium
test-api-server-encryption-provider-cipher   FAIL     medium
$ oc get cr
NAME                                         STATE
test-api-server-encryption-provider-cipher   NotApplied
2. Edit to replace the rule ocp4-api-server-encryption-provider-cipher with ocp4-audit-profile-set, and rerun the ssb. Check the result for newly added rule is created, the removed rule has been remove from the ccr and cr.

$ oc get tp test -o yaml
...
spec:
  description: test
  enableRules:
  - name: ocp4-api-server-anonymous-auth
    rationale: platform
  - name: ocp4-api-server-api-priority-gate-enabled
    rationale: platform
  - name: ocp4-api-server-audit-log-maxbackup
    rationale: platform
  - name: ocp4-api-server-audit-log-maxsize
    rationale: platform
  - name: ocp4-audit-profile-set
    rationale: platform
  title: test
status:
  id: xccdf_compliance.openshift.io_profile_test
  outputRef:
    name: test-tp
    namespace: openshift-compliance
  state: READY
$ oc compliance rerun-now scansettingbinding test
Rerunning scans from 'test': test
Re-running scan 'openshift-compliance/test'
$ oc get suite
NAME   PHASE     RESULT
test   DONE          NON-COMPLIANT
$ oc get ccr
NAME                                        STATUS   SEVERITY
test-api-server-anonymous-auth              PASS     medium
test-api-server-api-priority-gate-enabled   FAIL     medium
test-api-server-audit-log-maxbackup         PASS     low
test-api-server-audit-log-maxsize           PASS     medium
test-audit-profile-set                      PASS     medium
$ oc get ccr --no-headers |wc -l
5
$ oc get cr
No resources found in openshift-compliance namespace.
##3. Add "extends: ocp4-cis" in the tp, and rerun the test result:

$ oc get tp test -o=jsonpath={.spec} | jq -r
{
  "description": "test",
  "enableRules": [
    {
      "name": "ocp4-api-server-anonymous-auth",
      "rationale": "platform"
    },
    {
      "name": "ocp4-api-server-api-priority-gate-enabled",
      "rationale": "platform"
    },
    {
      "name": "ocp4-api-server-audit-log-maxbackup",
      "rationale": "platform"
    },
    {
      "name": "ocp4-api-server-audit-log-maxsize",
      "rationale": "platform"
    },
    {
      "name": "ocp4-audit-profile-set",
      "rationale": "platform"
    }
  ],
  "extends": "ocp4-cis",
  "title": "test"
}
$ oc compliance rerun-now scansettingbinding test
Rerunning scans from 'test': test
Re-running scan 'openshift-compliance/test'
$ oc get suite -w
NAME   PHASE     RESULT
test   DONE          NON-COMPLIANT
$ oc get ccr --no-headers |wc -l
88
$ oc get ccr | head
NAME                                                          STATUS   SEVERITY
test-accounts-restrict-service-account-tokens                 MANUAL   medium
test-accounts-unique-service-account                          MANUAL   medium
test-api-server-admission-control-plugin-alwaysadmit          PASS     medium
test-api-server-admission-control-plugin-alwayspullimages     PASS     high
test-api-server-admission-control-plugin-namespacelifecycle   PASS     medium
test-api-server-admission-control-plugin-noderestriction      PASS     medium
test-api-server-admission-control-plugin-scc                  PASS     medium
test-api-server-admission-control-plugin-service-account      PASS     medium
test-api-server-anonymous-auth                                PASS     medium

@xiaojiey
Copy link
Collaborator

/label qe-approved

@openshift-ci openshift-ci bot added qe-approved and removed lgtm labels Nov 10, 2023
@Vincent056
Copy link

I was trying to resolve the merge conflicts through github UI and then it got the master to merge to the branch to commit here

Previously, the compliance operator would leave CCRs around and then
just overwrite them on subsequent scans. While the most recent scan data
was accurate, because it was overwriting existing check results, it gave
the impression that some changes weren't taking effect.

For example, if you create a tailored profile, run a scan, exclude a
rule, and rerun the scan, it appears the change you just made never took
effect because the result from the rule you ignored still exists.

To avoid this, let's prune stale results when we aggregate new results.
@rhmdnd
Copy link
Author

rhmdnd commented Nov 10, 2023

Removed the hold flag since we have QE approval.

I also rebased locally to resolve the merge conflict with master in the same patch.

Thanks for all the reviews folks.

Copy link
Member

@yuumasato yuumasato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

openshift-ci bot commented Nov 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhrozek, rhmdnd, Vincent056, yuumasato

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Vincent056,jhrozek,rhmdnd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Vincent056
Copy link

/jira refresh

@openshift-ci-robot
Copy link
Collaborator

@Vincent056: This pull request references Jira Issue OCPBUGS-3009, which is invalid:

  • expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-bot openshift-merge-bot bot merged commit ebcd30a into ComplianceAsCode:master Nov 14, 2023
6 checks passed
@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: Jira Issue OCPBUGS-3009: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-3009 has been moved to the MODIFIED state.

In response to this:

Previously, the compliance operator would leave CCRs around and then
just overwrite them on subsequent scans. While the most recent scan data
was accurate, because it was overwriting existing check results, it gave
the impression that some changes weren't taking effect.

For example, if you create a tailored profile, run a scan, exclude a
rule, and rerun the scan, it appears the change you just made never took
effect because the result from the rule you ignored still exists.

To avoid this, let's check for any check results at scan time and make
sure we clean them up before we aggregate the new results.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rhmdnd added a commit to rhmdnd/compliance-operator that referenced this pull request Nov 28, 2023
A recent change improved how the aggregator pod handled compliance check
results, by allowing it to find all existing results, and prune results
that were stale. This makes the state of the Compliance Check Results
consistent with the latest run:

  ComplianceAsCode#221

To do this though, we needed to give the aggregator pod permissions to
list and delete Compliance Check Results. But, in that patch we forgot
to update the bundle build to include those new permissions. This means
bundle installs are currently broken for all scans because the
aggregator pod gets stuck in a crashloop, due to failing permissions.

This commit updates the manifest for the bundles so that bundle installs
work again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants