Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

504 GatewayTimeout Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

Closed
zbuchheit opened this issue Apr 4, 2024 · 9 comments · Fixed by #3219
Assignees
Labels
awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed

Comments

@zbuchheit
Copy link

zbuchheit commented Apr 4, 2024

What happened?

While attempting to create a container app I encountered a 504 GatewayTimeout where creating the container apps timed out at 800s but they ended up actually being created in Azure. A subsequent pulumi up results in a resource already exists error.

This requires some manual intervention where I need to import the resource or alternatively, delete the resource in the portal (resulting in down time) and recreate the resource again.

I would consider this to be a transient error and would prefer pulumi to be able to recover from a scenario like this.

Error Message

+ azure-native:app/v20230501:ContainerApp redacted **creating failed** error: resource partially created but read failed autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period.": autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period."

Example

Container App Code
using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.ContainerRegistry;
using Pulumi.AzureNative.ContainerRegistry.Inputs;
using Pulumi.AzureNative.App;
using Pulumi.AzureNative.App.Inputs;
using Pulumi.Docker;
using System.Collections.Generic;
using Pulumi.Docker.Inputs;
using DockerRegistryArgs = Pulumi.Docker.Inputs.RegistryArgs;
using System;

return await Pulumi.Deployment.RunAsync(async () =>
    {
        var resourceGroupName = "zbuchheitResourceGroup";
        var resourceGroup = new ResourceGroup("resourceGroup", new ResourceGroupArgs
        {
            ResourceGroupName = resourceGroupName,
            Location = "EastUS"
        });
        var config = new Pulumi.Config();
        var importing = config.GetBoolean("importing") ?? false;

        var registry = new Registry("registry", new Pulumi.AzureNative.ContainerRegistry.RegistryArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Sku = new SkuArgs
            {
                Name = "basic"
            },
            AdminUserEnabled = true
        });

        var credentials = Output.Tuple(resourceGroup.Name, registry.Name).Apply(names =>
            ListRegistryCredentials.InvokeAsync(new ListRegistryCredentialsArgs
            {
                ResourceGroupName = names.Item1,
                RegistryName = names.Item2
            })
        );

        var username = credentials.Apply(creds => creds.Username);
        var password = credentials.Apply(creds => creds.Passwords[0].Value);

        var image = new Image("image", new ImageArgs
        {
            ImageName = Output.Format($"{registry.LoginServer}/timeout:latest"),
            Build = new DockerBuildArgs
            {
                Context = ".",
                Platform = "linux/amd64",
            },
            Registry = new DockerRegistryArgs
            {
                Server = registry.LoginServer,
                Username = username!,
                Password = password!
            }
        }, new CustomResourceOptions { IgnoreChanges = new List<string> { "build" } });

        var managedEnvironment = new ManagedEnvironment("myManagedEnvironment", new ManagedEnvironmentArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Location = resourceGroup.Location
        });

        var containerAppName = "zbuchheitcontainerapp";
        var containerApp = new ContainerApp(containerAppName, new ContainerAppArgs
        {
            ContainerAppName = containerAppName,
            ResourceGroupName = resourceGroup.Name,
            Location = resourceGroup.Location,
            EnvironmentId = managedEnvironment.Id,
            ManagedEnvironmentId = managedEnvironment.Id,
            Configuration = new ConfigurationArgs
            {
                MaxInactiveRevisions = 100,
                Registries = new List<RegistryCredentialsArgs>
                {
                    new RegistryCredentialsArgs
                    {
                        Server = registry.LoginServer,
                        Username = username!,
                        PasswordSecretRef = "registry-pwd",
                    }
                },
                Secrets = new List<Pulumi.AzureNative.App.Inputs.SecretArgs>
                {
                    new Pulumi.AzureNative.App.Inputs.SecretArgs
                    {
                        Name = "registry-pwd",
                        Value = password!
                    },
                }
            },
            Template = new TemplateArgs
            {
                Scale = new ScaleArgs
                {
                    MaxReplicas = 10,
                },
                Containers = new List<Pulumi.AzureNative.App.Inputs.ContainerArgs>
                {
                    new Pulumi.AzureNative.App.Inputs.ContainerArgs
                    {
                        Name = "mycontainer",
                        Image = image.ImageName,
                    }
                }
            }
        }, new()
        {
            ImportId = importing
            ? GetContainerApp.InvokeAsync(new GetContainerAppArgs
            {
                ContainerAppName = containerAppName,
                ResourceGroupName = resourceGroupName
            }).Result.Id
            : null,
            // IgnoreChanges = new List<string> {"identity", "configuration.secrets[0]", "configuration.registries[0]", "template.revisionSuffix", "template.containers[0]" }
        });
    });

I don't have a consistent repro as this is a network related error.

Output of pulumi about

CLI          
Version      3.112.0
Go Version   go1.22.1
Go Compiler  gc

Plugins
NAME          VERSION
azure-native  2.35.0
command       0.9.2
docker        4.5.3
dotnet        unknown

Host     
OS       darwin
Version  14.2.1
Arch     arm64

This project is written in dotnet: executable='/usr/local/share/dotnet/dotnet' version='6.0.417'

Current Stack: zbuchheit-pulumi-corp/azure-cs/dev

Found no resources associated with dev

Found no pending operations associated with dev

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/zbuchheit-pulumi-corp
User           zbuchheit-pulumi-corp
Organizations  zbuchheit-pulumi-corp
Token type     personal

Dependencies:
NAME                VERSION
Pulumi              3.60.0
Pulumi.AzureNative  2.35.0
Pulumi.Command      0.9.2
Pulumi.Docker       4.5.3

Additional context

In doing discovery in the logs, I see the workflow for creating this resource is as follows

GET to check if the resource exists, if not, do a PUT on the Azure endpoing to create the resource which will provide a Azure-Asyncoperation url to perform a GET call against and watch for the operation to complete. It then polls this until it successfully completes.

I don't have a verbose copy of the logs to verify if the 504 comes from the PUT or the subsequent GET calls, but I wonder if like there is a possibility for adding some handling in a scenario like this where a 504 is encountered, another GET is fired off with a retry timer to check the resource, and if the resource doesn't exist after a reasonable threshold, bubble up the error.

Much like #1266 this causes some friction in use of this provider and causes users to lose confidence in the provider.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@zbuchheit zbuchheit added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team customer/feedback Feedback from customers labels Apr 4, 2024
@danielrbradley
Copy link
Member

danielrbradley commented Apr 9, 2024

Thanks for writing this up @zbuchheit

It would certainly be useful to have debug logs here to narrow down the specific source of the 504 error if possible. A 504 gateway timeout should certainly be retryable, but I'm unsure if this is already implemented within the Azure client library we're using. I expect this is particularly difficult to reproduce is it was likely caused by a transient error on Azure which likely was resolved in a matter of seconds.

Identification of source areas

After the initial PUT request which should be fairly fast to respond - then handing off to an awaitable operation - we check the response for if it was < 400 to determine that a create was started:

created = resp.StatusCode < 400

If created is marked as true but there was an error while waiting for the operation to complete, we should then create a "partial create" which indicates the create was started, and the next deployment should re-check for the completed creation against the recorded identifier:

if created {
return id, nil, crudClient.HandleErrorWithCheckpoint(ctx, err, id, inputs, req.GetProperties())
}

If the read after the operation has completed fails then a warning should be logged but the create should be marked as successful, though might lack some of the outputs:

k.host.Log(ctx, diag.Warning, resource.URN(req.GetUrn()), `Failed to read resource after Create. Please report this issue.
Verbose logs contain more information, see https://www.pulumi.com/docs/support/troubleshooting/#verbose-logging.`)
logging.V(9).Infof("failed read resource %q after creation: %s", id, err.Error())

A possible source of an issue could be when we're awaiting the operation, if one of the polling requests is not parsable, it could be that it gets interpreted as a creation failure and therefore would surface as the creation failing. However, the handling of all of these possible errors looks sound and will still return the original created=true along with any error which should result in a partial create:

future, err := azure.NewFutureFromResponse(resp)
if err != nil {
return nil, created, err
}
err = future.WaitForCompletionRef(ctx, a.client)
if err != nil {
return nil, created, err
}
resp, err = future.GetResult(a.client)
if err != nil {
return nil, created, err
}

It's also possible that we can just set a retry count on the client library which might also resolve this immediate issue of a 504, though it doesn't explain why the partial state isn't created.

Currently, I would say it's inconclusive where the 504 is not being handled. Additional information which might help here are:

  1. Verbose logs, if possible
  2. The Pulumi stack state after the timeout was created - to see if a partial create was registered.
  3. Console logs to see if there's any additional clues as to the operation order around the error.

@danielrbradley danielrbradley added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Apr 9, 2024
@mikocot
Copy link

mikocot commented Apr 10, 2024

@zbuchheit weren't the logs from pulumi up, we've provided sufficient? it was severla hundreds of MB AFAIR :)

Anyway, 504 or not, the takeaway from this and similar issues we've encountered with the azure-native provider (or Azure in general) is that Pulumi tends to forget about the resources it had tried to create, instead of registering them and veryfing on the next occasion.

Cancel the process or interrupt connection and Pulumi might be completely not aware the resource it had requested is actually there. It leaves the customer in position where either his resources are duplicated during the next deploy (possibly unknowingly and causing costs/vulnerabilities) or the deployment is in permanent failed state until manual intervention (import what was created/partially created or delete that resource). Please correct me if I that isn't the case.

But overal, our expectation as a customer is that if any problem occurs Pulumi verifies if resource is there or or not and should be retried (unless it's obvious, e.g. validation falied)

@zbuchheit
Copy link
Author

hi @mikocot, I believe the logs you shared didn't include the 504 GatewayTimeout occurrence which is what I believe is what @danielrbradley is wanting when referring to Verbose logs, but I will make sure the ones you sent are passed along.

@mikocot
Copy link

mikocot commented Apr 11, 2024

Verbose logs, if possible

not possible, as its not an error that we can reproduce, maybe you can by mocking the requests from Azure. We've only done extensive logging for the later deployments where resources were already created and causing conflicts. We can't really generate 100MB+ logs for every run, especially since it's only part of the pipeline, and also makes it slower :)

The Pulumi stack state after the timeout was created - to see if a partial create was registered.

I have destroyed the stack some days ago, it was open already for like 2 weeks with us paying for the resources both in pulumi and azure. It's unforuntately a big problem of many support cases, they cost us a lot in total.

Console logs to see if there's any additional clues as to the operation order around the error.

I've sent the console logs directly to Zach, maybe that can help. I dont think they should be published here.

@mikocot
Copy link

mikocot commented Apr 11, 2024

So from what you're saying, Pulumi should mark such resource as 'potentially created' in its state if e.g. a network issue happens and it can't tell it was created or not? It doesn't lose track of it?
So it either considered azure response as 'it was not created' or there is a bug where it has failed to add it to state.

What about cancellations, we've seen numerous times that when pulumi process was cancelled in some way (e.g. GitHub timeouts) the state became corrupted and created resources had to be removed to proceed (or even to destroy the stack, as they blocked depencencies from cleanup).

I'd assume a similar mechanism should protect pulumi from both cases, some way of registering that a creation is/was ongoing, that is not updated until verified with Azure.

@pulumi-bot pulumi-bot added needs-triage Needs attention from the triage team and removed awaiting-feedback Blocked on input from the author labels Apr 11, 2024
@thomas11 thomas11 added resolution/by-design This issue won't be fixed because the functionality is working as designed and removed needs-triage Needs attention from the triage team labels Apr 16, 2024
@thomas11
Copy link
Contributor

thomas11 commented Apr 16, 2024

Hi @mikocot, apologies for the delay. I verified that the provider retries on 504, for a total of three retries. Therefore, I don't believe there is more we can do about this particular error.

As to your question about Pulumi's approach to state on failure: yes, it can indeed happen that a resource creation seemingly fails and the resource doesn't exist in Pulumi's state, but on the service provider it was actually created despite the error return code. Similar for cancellations. Since Pulumi cannot know the real state on the service's side, it needs to make assumptions, which could be wrong either way (created or not created).

For pending Creates, refresh can help (also built-in to up as --refresh). It wouldn't help in your case, however, because the Create has already failed.

@thomas11
Copy link
Contributor

thomas11 commented Apr 17, 2024

I've created an engine issue to track improving this situation, pulumi/pulumi#15958.

@thomas11 thomas11 added awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). and removed resolution/by-design This issue won't be fixed because the functionality is working as designed labels Apr 17, 2024
@thomas11 thomas11 reopened this Apr 17, 2024
@danielrbradley
Copy link
Member

Looking a little closer at the error message it's actually two separate errors which can help us narrow down the issue:

  1. azure-native:app/v20230501:ContainerApp redacted **creating failed** error:
  2. resource partially created but read failed autorest/azure:
  3. Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period.":
  4. autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period."

This error is generated from this code:

return azure.AzureError(errors.Wrapf(err, "resource updated but read failed %s", getErr))

(3) is the getErr and (4) is the err from the create operation. Therefore what we're observing here is that we:

  1. Performed the PUT successfully.
  2. Started polling for the operation.
  3. The operation GET returned a 504.
  4. We attempted to fetch the latest state to include in the partial state
  5. The GET for the state also failed with a 504
  6. Therefore we didn't return a partial state.

Proposal

If the GET for the state also fails, we should still return the partial create result, but with the state missing. This will allow the create to be easily fixed with a refresh as suggested by @thomas11 above.

@danielrbradley
Copy link
Member

What about cancellations, we've seen numerous times that when pulumi process was cancelled in some way (e.g. GitHub timeouts) the state became corrupted and created resources had to be removed to proceed (or even to destroy the stack, as they blocked depencencies from cleanup).

This should also be handled already - as long as you're using a SIGTERM, pulumi should request the provider to shut down which should result in a partial state. If you use a SIGKILL then it's expected that it might leave the state without some resources being fully updated.

danielrbradley added a commit that referenced this issue Apr 18, 2024
If a read fails when trying to return a partial result, ensure we still
return a partial result with the resource identifier so the operation
can be completed using refresh at a later point.

Fixes #3200 - where the partial state failed to be written resulting in
the resource being created but lost from state.
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants