-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
504 GatewayTimeout
Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created
#3200
Comments
Thanks for writing this up @zbuchheit It would certainly be useful to have debug logs here to narrow down the specific source of the 504 error if possible. A 504 gateway timeout should certainly be retryable, but I'm unsure if this is already implemented within the Azure client library we're using. I expect this is particularly difficult to reproduce is it was likely caused by a transient error on Azure which likely was resolved in a matter of seconds. Identification of source areasAfter the initial PUT request which should be fairly fast to respond - then handing off to an awaitable operation - we check the response for if it was
If pulumi-azure-native/provider/pkg/provider/provider.go Lines 888 to 890 in e06cde1
If the read after the operation has completed fails then a warning should be logged but the create should be marked as successful, though might lack some of the outputs: pulumi-azure-native/provider/pkg/provider/provider.go Lines 908 to 910 in e06cde1
A possible source of an issue could be when we're awaiting the operation, if one of the polling requests is not parsable, it could be that it gets interpreted as a creation failure and therefore would surface as the creation failing. However, the handling of all of these possible errors looks sound and will still return the original pulumi-azure-native/provider/pkg/azure/client.go Lines 314 to 325 in e06cde1
It's also possible that we can just set a retry count on the client library which might also resolve this immediate issue of a 504, though it doesn't explain why the partial state isn't created. Currently, I would say it's inconclusive where the 504 is not being handled. Additional information which might help here are:
|
@zbuchheit weren't the logs from pulumi up, we've provided sufficient? it was severla hundreds of MB AFAIR :) Anyway, 504 or not, the takeaway from this and similar issues we've encountered with the azure-native provider (or Azure in general) is that Pulumi tends to forget about the resources it had tried to create, instead of registering them and veryfing on the next occasion. Cancel the process or interrupt connection and Pulumi might be completely not aware the resource it had requested is actually there. It leaves the customer in position where either his resources are duplicated during the next deploy (possibly unknowingly and causing costs/vulnerabilities) or the deployment is in permanent failed state until manual intervention (import what was created/partially created or delete that resource). Please correct me if I that isn't the case. But overal, our expectation as a customer is that if any problem occurs Pulumi verifies if resource is there or or not and should be retried (unless it's obvious, e.g. validation falied) |
hi @mikocot, I believe the logs you shared didn't include the 504 GatewayTimeout occurrence which is what I believe is what @danielrbradley is wanting when referring to Verbose logs, but I will make sure the ones you sent are passed along. |
not possible, as its not an error that we can reproduce, maybe you can by mocking the requests from Azure. We've only done extensive logging for the later deployments where resources were already created and causing conflicts. We can't really generate 100MB+ logs for every run, especially since it's only part of the pipeline, and also makes it slower :)
I have destroyed the stack some days ago, it was open already for like 2 weeks with us paying for the resources both in pulumi and azure. It's unforuntately a big problem of many support cases, they cost us a lot in total.
I've sent the console logs directly to Zach, maybe that can help. I dont think they should be published here. |
So from what you're saying, Pulumi should mark such resource as 'potentially created' in its state if e.g. a network issue happens and it can't tell it was created or not? It doesn't lose track of it? What about cancellations, we've seen numerous times that when pulumi process was cancelled in some way (e.g. GitHub timeouts) the state became corrupted and created resources had to be removed to proceed (or even to destroy the stack, as they blocked depencencies from cleanup). I'd assume a similar mechanism should protect pulumi from both cases, some way of registering that a creation is/was ongoing, that is not updated until verified with Azure. |
Hi @mikocot, apologies for the delay. I verified that the provider retries on 504, for a total of three retries. Therefore, I don't believe there is more we can do about this particular error. As to your question about Pulumi's approach to state on failure: yes, it can indeed happen that a resource creation seemingly fails and the resource doesn't exist in Pulumi's state, but on the service provider it was actually created despite the error return code. Similar for cancellations. Since Pulumi cannot know the real state on the service's side, it needs to make assumptions, which could be wrong either way (created or not created). For pending Creates, refresh can help (also built-in to |
I've created an engine issue to track improving this situation, pulumi/pulumi#15958. |
Looking a little closer at the error message it's actually two separate errors which can help us narrow down the issue:
This error is generated from this code:
(3) is the
ProposalIf the GET for the state also fails, we should still return the partial create result, but with the state missing. This will allow the create to be easily fixed with a refresh as suggested by @thomas11 above. |
This should also be handled already - as long as you're using a SIGTERM, pulumi should request the provider to shut down which should result in a partial state. If you use a SIGKILL then it's expected that it might leave the state without some resources being fully updated. |
If a read fails when trying to return a partial result, ensure we still return a partial result with the resource identifier so the operation can be completed using refresh at a later point. Fixes #3200 - where the partial state failed to be written resulting in the resource being created but lost from state.
What happened?
While attempting to create a container app I encountered a
504 GatewayTimeout
where creating the container apps timed out at800s
but they ended up actually being created in Azure. A subsequentpulumi up
results in a resource already exists error.This requires some manual intervention where I need to
import
the resource or alternatively, delete the resource in the portal (resulting in down time) and recreate the resource again.I would consider this to be a transient error and would prefer pulumi to be able to recover from a scenario like this.
Error Message
Example
Container App Code
I don't have a consistent repro as this is a network related error.
Output of
pulumi about
Additional context
In doing discovery in the logs, I see the workflow for creating this resource is as follows
GET
to check if the resource exists, if not, do aPUT
on the Azure endpoing to create the resource which will provide aAzure-Asyncoperation
url to perform aGET
call against and watch for the operation to complete. It then polls this until it successfully completes.I don't have a verbose copy of the logs to verify if the 504 comes from the
PUT
or the subsequentGET
calls, but I wonder if like there is a possibility for adding some handling in a scenario like this where a 504 is encountered, another GET is fired off with a retry timer to check the resource, and if the resource doesn't exist after a reasonable threshold, bubble up the error.Much like #1266 this causes some friction in use of this provider and causes users to lose confidence in the provider.
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: