Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding felix service metric port #3534

Merged
merged 8 commits into from
Nov 23, 2024
Merged

Conversation

vikastigera
Copy link
Contributor

Description

Changes done to add felix service metric
port in calico-node service. This helps in
removing the manual step required by the client
to enable felix metric for BYO prometheus.

https://tigera.atlassian.net/browse/EV-5305

*** Additional changes required to update felix-metrics-service-monitor.yaml to this :

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    team: network-operators
  name: felix-metrics
spec:
  endpoints:
    - honorLabels: true
      interval: 5s
      port: felix-metrics-port
      scheme: http
      scrapeTimeout: 5s
  namespaceSelector:
    matchNames:
      - calico-system
  selector:
    matchLabels:
      k8s-app: calico-node

Testing

Screenshot from 2024-10-09 14-41-06
Screenshot from 2024-10-09 15-23-01

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

@tmjd
Copy link
Member

tmjd commented Oct 10, 2024

There is a PrometheusMetricsPort in FelixConfig that should be used instead of always using a built-in default value.

@rene-dekker
Copy link
Member

rene-dekker commented Oct 10, 2024

@tmjd
There is a PrometheusMetricsPort in FelixConfig that should be used instead of always using a built-in default value.

There is also in migration/core.go a migration that sets installation.spec.NodeMetricsPort. Should we perhaps:

  • In the core_controller verify installation.spec.NodeMetricsPort and store it in a new variable "nodeMetricsPort"
  • If variable does not exist, use *felixConfiguration.Spec.PrometheusMetricsPort if it exists
  • pass add new variable to the NodeConfiguration object
  • In node.go, if NodeConfiguration.NodeMetricsPort is not nil, we add it to the service.

@tmjd
Copy link
Member

tmjd commented Oct 11, 2024

Was felixConfiguration.Spec.PrometheusMetricsPort a new field that has been added? I'm wondering if that field was needed if we already had NodeMetricsPort in the operator.

If both NodeMetricsPort and PrometheusMetricsPort are set and are different should the operator report that as a problem?

@vikastigera
Copy link
Contributor Author

Was felixConfiguration.Spec.PrometheusMetricsPort a new field that has been added? I'm wondering if that field was needed if we already had NodeMetricsPort in the operator.

If both NodeMetricsPort and PrometheusMetricsPort are set and are different should the operator report that as a problem?

PrometheusMetricsPort is not a new field.
I have updated the code to use NodeMetricsPort and in case it is nil it will use PrometheusMetricsPort or default value. (

// If specified, this overrides any FelixConfiguration resources which may exist. If omitted, then
)

@tmjd
Copy link
Member

tmjd commented Oct 28, 2024

So I guess one thing I'm unclear of, should the installationl.NodeMetricsPort and felixConfig.PrometheusMetricsPort be configuring the same thing or are they different?
I think the answer is they are different and also I think I was wrong originally mentioning NodeMetricsPort since it is used to configure calico-metrics-port but PrometheusMetricsPort is used for felix-metrics-port. For some reason I was thinking they were for the same thing but I think clearly they are not.

@vikastigera
Copy link
Contributor Author

@tmjd
As per the documentation and code :
"NodeMetricsPort specifies which port calico/node serves prometheus metrics on. By default, metrics are not enabled.If specified, this overrides any FelixConfiguration resources which may exist. If omitted, then prometheus metrics may still be configured through FelixConfiguration."

Changes I have done is specific to enabling Felix metric for "Bring your own Prometheus" use case (https://docs.tigera.io/calico-enterprise/latest/operations/monitor/prometheus/byo-prometheus#scrape-metrics-from-specific-components).
Also, calico-metrics-port and felix-metrics-port are for different purpose.

Comment on lines 1770 to 1772
if c.cfg.Installation.NodeMetricsPort != nil {
return *c.cfg.Installation.NodeMetricsPort
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two metrics ports are unrelated, other than providing some metrics information. I don't think there is any reason we should use NodeMetricsPort here.
If this is needed then I'd suggest this logic should be moved to the core_controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to only use PrometheusMetricsPort

Copy link
Member

@tmjd tmjd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as far as operator API and operator code quality. Was there a review that this is appropriate for Calico/Enterprise, for example should the felix metrics port be exposed on the same service as the node metrics or should there be a separate service?

Changes done to add felix service metric
port in calico-node service. This helps in
removing the manual step required by the client
to enable felix metric for BYO prometheus.
@vikastigera
Copy link
Contributor Author

LGTM as far as operator API and operator code quality. Was there a review that this is appropriate for Calico/Enterprise, for example should the felix metrics port be exposed on the same service as the node metrics or should there be a separate service?

Rene and I had discussion to add this in same service as we are using using "calico-node" in the selector. Also we discussed that ServiceMonitor also needs to be updated, I have updated the code for that.

@tmjd
Copy link
Member

tmjd commented Nov 21, 2024

@rene-dekker did you want to review this then to make sure it follows what you discussed with @vikastigera ?

@@ -373,18 +379,24 @@ func (r *ReconcileMonitor) Reconcile(ctx context.Context, request reconcile.Requ
return reconcile.Result{}, err
}

felixConfiguration, err := utils.GetFelixConfiguration(ctx, r.client)
if err != nil {
log.Error(err, "Error retrieving Felix configuration")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we log and move on as opposed to degrading and returning? Or would that create some sort of deadlock with the core controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I wouldn't expect it to be problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@rene-dekker
Copy link
Member

Was there a review that this is appropriate for Calico/Enterprise, for example should the felix metrics port be exposed on the same service as the node metrics or should there be a separate service?

I think the vision we have around scraping is that we prefer users to get the metrics by scraping our prometheus and have our prometheus scrape node/felix. Users are struggling with the mTLS configuration and by leveraging the ExternalPrometheus option that could be simplified. With that in mind, I don't see any drawback from this, unless you can think of any reason.

@tmjd
Copy link
Member

tmjd commented Nov 22, 2024

Was there a review that this is appropriate for Calico/Enterprise, for example should the felix metrics port be exposed on the same service as the node metrics or should there be a separate service?

I think the vision we have around scraping is that we prefer users to get the metrics by scraping our prometheus and have our prometheus scrape node/felix. Users are struggling with the mTLS configuration and by leveraging the ExternalPrometheus option that could be simplified. With that in mind, I don't see any drawback from this, unless you can think of any reason.

I wasn't trying to suggest a problem with it, I just wanted to make sure someone had reviewed this change from the aspect of is this the right solution for Enterprise and addressed the problem it was targeting.

Copy link
Member

@rene-dekker rene-dekker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rene-dekker rene-dekker merged commit 9cb3733 into tigera:master Nov 23, 2024
5 checks passed
@vikastigera vikastigera deleted the EV-5305 branch November 25, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants