Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util: log slow GRPC calls #4847

Merged
merged 4 commits into from
Sep 20, 2024
Merged

util: log slow GRPC calls #4847

merged 4 commits into from
Sep 20, 2024

Conversation

gman0
Copy link
Contributor

@gman0 gman0 commented Sep 17, 2024

Describe what this PR does

Fixes #4839

This PR adds logs for slow GRPC calls. It does so by implementing a grpc middleware where it checks if the call's handler finishes after its context deadline, in which case a "slow call" message is logged.

The time duration is passed in through a new MiddlewareServerOptionConfig struct -- a couple of functions needed to be modified to accept a new arg.

User-facing changes:

  • New cmdline arg: --logslowopinterval
  • New Helm chart value logSlowOperationInterval

Logs with this PR included:

I0917 15:11:29.233484       1 utils.go:266] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 GRPC call: /csi.v1.Controller/CreateVolume
I0917 15:11:29.235487       1 utils.go:267] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0917 15:11:29.236381       1 rbd_util.go:1339] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0917 15:11:32.236436       1 utils.go:298] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 Still processing GRPC call /csi.v1.Controller/CreateVolume (3s)
I0917 15:11:32.237029       1 utils.go:300] ID: 22 Req-ID: pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-a0e79396-1008-4c6b-b18b-b21ffeeb10f4","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}

Is there anything that requires special attention

Do you have any questions?

Is the change backward compatible?

Are there concerns around backward compatibility?

Provide any external context for the change, if any.

For example:

  • Kubernetes links that explain why the change is required
  • CSI spec related changes/catch-up that necessitates this patch
  • golang related practices that necessitates this change

Related issues

Mention any github issues relevant to this PR. Adding below line
will help to auto close the issue once the PR is merged.

Fixes: #issue_number

Future concerns

List items that are not part of the PR and do not impact it's
functionality, but are work items that can be taken up subsequently.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gman0 Thanks for the PR, some nits.

@@ -118,6 +118,7 @@ charts and their default values.
| `commonLabels` | Labels to apply to all resources | `{}` |
| `logLevel` | Set logging level for csi containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity. | `5` |
| `sidecarLogLevel` | Set logging level for csi sidecar containers. Supported values from 0 to 5. 0 for general useful logs, 5 for trace level verbosity. | `1` |
| `logSlowOperationsAfter` | Operations running longer than the specified time duration will be logged as slow. Setting the value to zero disables the feature. | `15s` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this value as same as GRPC timeout which we have at 60s seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the same timeout probably defeats the purpose? It would be useful to know if an operation is still running before the timeout hits.

cmd/cephcsi.go Outdated
flag.DurationVar(
&conf.LogSlowOpsAfter,
"logslowopsafter",
time.Second*15,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to enable it by default? how about disable it by default and if the user requires they can enable it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with enabling by default, it can help with troubleshooting issues.

info *grpc.UnaryServerInfo,
handler grpc.UnaryHandler,
) (interface{}, error) {
ticker := time.NewTicker(timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a defer for the ticker.Stop() to avoid resource leak

for {
select {
case <-callFinished:
break waitForCallFinished
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we return from here we dont need to use break right?

@@ -40,6 +40,9 @@ commonLabels: {}
logLevel: 5
# sidecarLogLevel is the variable for Kubernetes sidecar container's log level
sidecarLogLevel: 1
# Operations running longer than the specified time duration will be logged
# as slow. Setting the value to zero disables the feature.
logSlowOperationsAfter: 15s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we comment it out and not enable it by default?

@Madhu-1 Madhu-1 requested a review from a team September 18, 2024 07:59
@gman0
Copy link
Contributor Author

gman0 commented Sep 18, 2024

@nixpanic, actually I like @Madhu-1's idea about waiting for context timeout. It's what's happening after, when a call is retried while the older call is still in flight -- this is what I wanted to log and to have a better visibility over.

I0918 15:33:48.756232       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (45s)
I0918 15:33:48.756658       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:03.756752       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (1m0s)
I0918 15:34:03.758209       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:04.270362       1 utils.go:266] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 GRPC call: /csi.v1.Controller/CreateVolume
I0918 15:34:04.270850       1 utils.go:267] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:04.274015       1 rbd_util.go:1339] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 setting disableInUseChecks: false image features: [layering] mounter: rbd
I0918 15:34:11.608149       1 utils.go:266] ID: 28 GRPC call: /csi.v1.Identity/Probe
I0918 15:34:11.608281       1 utils.go:267] ID: 28 GRPC request: {}
I0918 15:34:11.608368       1 utils.go:273] ID: 28 GRPC response: {}
I0918 15:34:18.756588       1 utils.go:298] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (1m15s)
I0918 15:34:18.757242       1 utils.go:300] ID: 25 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}
I0918 15:34:19.271399       1 utils.go:298] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Still processing GRPC call /csi.v1.Controller/CreateVolume (15s)
I0918 15:34:19.272034       1 utils.go:300] ID: 27 Req-ID: pvc-aa8d7385-d6ae-4403-973d-3f26901003d5 Slow GRPC request: {"capacity_range":{"required_bytes":1048576},"name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","parameters":{"clusterID":"micro-osd","csi.storage.k8s.io/pv/name":"pvc-aa8d7385-d6ae-4403-973d-3f26901003d5","csi.storage.k8s.io/pvc/name":"rbd-pvc","csi.storage.k8s.io/pvc/namespace":"default","imageFeatures":"layering","pool":"rbdpool"},"secrets":"***stripped***","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]}

Here in the log you can see this exact situation:

  • CreateVolume ID 25 is still in progress,
  • at 1-minute mark CreateVolume is retried with ID 27,
  • 25 continues on

Normally 27 would be blocked by a try-lock but in any case it's great that we can immediately see when a call is stuck. I propose we change this PR so that the "slow call" logs start only after a call exceeds its context deadline, and we keep this enabled by default. This way we can keep an eye on runaway GRPCs.

@gman0
Copy link
Contributor Author

gman0 commented Sep 19, 2024

I've made the changes described in the post above. I wonder if it makes sense to make the log interval configurable or just have it hardcoded to some sane value (30s might be ok)?

MetricsPort int // TCP port for liveness/grpc metrics requests
PollTime time.Duration // time interval in seconds between each poll
PoolTimeout time.Duration // probe timeout in seconds
LogSlowOpInterval time.Duration // GRPC calls running longer than this will be logged
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here needs an update.

nixpanic
nixpanic previously approved these changes Sep 19, 2024
Copy link
Member

@nixpanic nixpanic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@nixpanic
Copy link
Member

You probably should consider adding a note to PendingReleaseNotes.md so that the new feature will be included in the release notes for the next version.

@mergify mergify bot dismissed nixpanic’s stale review September 19, 2024 12:51

Pull request has been modified.

@nixpanic
Copy link
Member

Two minor issues:

  1. The last commit misses your Signed-off-by line.
  2. yamllint complains about the line length
./PendingReleaseNotes.md:15: MD013 Line length

Further documentation is available for these failures:
 - MD013: https://github.com/markdownlint/markdownlint/blob/main/docs/RULES.md#md013---line-length

@gman0
Copy link
Contributor Author

gman0 commented Sep 19, 2024

CI is passing now, thanks! PTAL

@gman0 gman0 requested a review from Madhu-1 September 20, 2024 05:42
@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 20, 2024

@Mergifyio queue

Copy link
Contributor

mergify bot commented Sep 20, 2024

queue

🛑 The pull request has been removed from the queue default

The pull request can't be updated.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 20, 2024

@Mergifyio rebase

Robert Vasek added 4 commits September 20, 2024 06:55
This commit adds a gRPC middleware that logs calls that
keep running after their deadline.

Adds --logslowopinterval cmdline argument to pass the log rate.

Signed-off-by: Robert Vasek <[email protected]>
Copy link
Contributor

mergify bot commented Sep 20, 2024

rebase

✅ Branch has been successfully rebased

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 20, 2024

@Mergifyio refresh

Copy link
Contributor

mergify bot commented Sep 20, 2024

refresh

✅ Pull request refreshed

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 20, 2024

@Mergifyio queue

Copy link
Contributor

mergify bot commented Sep 20, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at c76338c

@mergify mergify bot added the ok-to-test Label to trigger E2E tests label Sep 20, 2024
@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-cephfs

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/upgrade-tests-rbd

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.29

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.30

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/k8s-e2e-external-storage/1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e-helm/k8s-1.31

@ceph-csi-bot
Copy link
Collaborator

/test ci/centos/mini-e2e/k8s-1.31

@ceph-csi-bot ceph-csi-bot removed the ok-to-test Label to trigger E2E tests label Sep 20, 2024
@mergify mergify bot merged commit c76338c into ceph:devel Sep 20, 2024
36 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log a notification about long-running operations
4 participants