Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Memory leak on all providers #538

Open
1 task done
IxDay opened this issue Jun 10, 2024 · 6 comments · Fixed by #539
Open
1 task done

[Bug]: Memory leak on all providers #538

IxDay opened this issue Jun 10, 2024 · 6 comments · Fixed by #539
Labels
bug Something isn't working needs:triage stale

Comments

@IxDay
Copy link
Contributor

IxDay commented Jun 10, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

All the providers

Resource MRs required to reproduce the bug

No response

Steps to Reproduce

  • Deploy vanilla crossplane via helmchart
  • Deploy any resource needing reconciliation from this project, the more resource created the faster the increase
  • Wait and observe memory increase of the provider pod.

What happened?

The memory kept growing until pod got restarted or OOM kill, see:

Cloudplatform
Screenshot from 2024-06-10 14-07-28

Redis
Screenshot from 2024-06-10 14-13-56

Storage
Screenshot from 2024-06-10 14-15-20

Most of those drops are restarts, curve is steeper on Cloudplatform because it is the most used in our setup. Behavior is consistent across all the modules.

Relevant Error Output Snippet

No response

Crossplane Version

1.14.5

Provider Version

1.1.0

Kubernetes Version

No response

Kubernetes Distribution

No response

Additional Info

Since we identified the culprit, I will not dig into too much details. We will publish a PR with the fix we deployed in order to discuss what should be done.

@IxDay IxDay added bug Something isn't working needs:triage labels Jun 10, 2024
IxDay added a commit to IxDay/provider-upjet-gcp that referenced this issue Jun 10, 2024
This patch fixes the propagation of context cancellation through the
call stack. It prevents leaks of channel and goroutine from the
[terraform provider][provider_code].

Closes: crossplane-contrib#538

[provider_code]: https://github.com/hashicorp/terraform-provider-google/blob/1d1a50adf64af60815b7a08ffc5e9d3e856d2e9c/google/transport/batcher.go#L117-L123
IxDay added a commit to IxDay/provider-upjet-gcp that referenced this issue Jun 10, 2024
This patch fixes the propagation of context cancellation through the
call stack. It prevents leaks of channel and goroutine from the
[terraform provider][provider_code].

Closes: crossplane-contrib#538

[provider_code]: https://github.com/hashicorp/terraform-provider-google/blob/1d1a50adf64af60815b7a08ffc5e9d3e856d2e9c/google/transport/batcher.go#L117-L123

Signed-off-by: Maxime Vidori <[email protected]>
IxDay added a commit to IxDay/provider-upjet-gcp that referenced this issue Jun 10, 2024
This patch fixes the propagation of context cancellation through the
call stack. It prevents leaks of channel and goroutine from the
[terraform provider][provider_code].

Fixes: crossplane-contrib#538

[provider_code]: https://github.com/hashicorp/terraform-provider-google/blob/1d1a50adf64af60815b7a08ffc5e9d3e856d2e9c/google/transport/batcher.go#L117-L123

Signed-off-by: Maxime Vidori <[email protected]>
@mergenci
Copy link
Collaborator

@IxDay, thanks for your discovery and beautiful report 🙏 I wanted to note here that other providers are likely to be affected, in case underlying terraform providers function similarly (links to relevant lines):

  1. provider-azure
  2. provider-azuread

@momilo
Copy link

momilo commented Jun 12, 2024

This is probably a (well-observed) generalisation of the issue I noticed with pubsub provider, noted here. I suspect that addressing this at this level would resolve also the issues I've experienced.

@cterence
Copy link

Reporting that we also experience this issue and are hit in a significant way since we manage hundreds of resources.

Example featuring our compute and container providers :
image

@turkenf
Copy link
Collaborator

turkenf commented Sep 9, 2024

See: #539 (comment)

@turkenf turkenf reopened this Sep 9, 2024
@mergenci
Copy link
Collaborator

I would like to follow up with my comment above. In a discussion with @ulucinar and @turkenf, we suspected that provider-upjet-azure and provider-upjet-azuread might not be experiencing memory leaks, because of the differences in underlying Terraform providers. We didn't test, but I wouldn't be surprised if there were no memory leaks in those providers.

Copy link

This provider repo does not have enough maintainers to address every issue. Since there has been no activity in the last 90 days it is now marked as stale. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh will mark this issue as not stale.

@github-actions github-actions bot added the stale label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants