`504 GatewayTimeout` Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

zbuchheit · 2024-04-04T23:04:07Z

What happened?

While attempting to create a container app I encountered a 504 GatewayTimeout where creating the container apps timed out at 800s but they ended up actually being created in Azure. A subsequent pulumi up results in a resource already exists error.

This requires some manual intervention where I need to import the resource or alternatively, delete the resource in the portal (resulting in down time) and recreate the resource again.

I would consider this to be a transient error and would prefer pulumi to be able to recover from a scenario like this.

Error Message

+ azure-native:app/v20230501:ContainerApp redacted **creating failed** error: resource partially created but read failed autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period.": autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period."

Example

Container App Code

using Pulumi;
using Pulumi.AzureNative.Resources;
using Pulumi.AzureNative.ContainerRegistry;
using Pulumi.AzureNative.ContainerRegistry.Inputs;
using Pulumi.AzureNative.App;
using Pulumi.AzureNative.App.Inputs;
using Pulumi.Docker;
using System.Collections.Generic;
using Pulumi.Docker.Inputs;
using DockerRegistryArgs = Pulumi.Docker.Inputs.RegistryArgs;
using System;

return await Pulumi.Deployment.RunAsync(async () =>
    {
        var resourceGroupName = "zbuchheitResourceGroup";
        var resourceGroup = new ResourceGroup("resourceGroup", new ResourceGroupArgs
        {
            ResourceGroupName = resourceGroupName,
            Location = "EastUS"
        });
        var config = new Pulumi.Config();
        var importing = config.GetBoolean("importing") ?? false;

        var registry = new Registry("registry", new Pulumi.AzureNative.ContainerRegistry.RegistryArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Sku = new SkuArgs
            {
                Name = "basic"
            },
            AdminUserEnabled = true
        });

        var credentials = Output.Tuple(resourceGroup.Name, registry.Name).Apply(names =>
            ListRegistryCredentials.InvokeAsync(new ListRegistryCredentialsArgs
            {
                ResourceGroupName = names.Item1,
                RegistryName = names.Item2
            })
        );

        var username = credentials.Apply(creds => creds.Username);
        var password = credentials.Apply(creds => creds.Passwords[0].Value);

        var image = new Image("image", new ImageArgs
        {
            ImageName = Output.Format($"{registry.LoginServer}/timeout:latest"),
            Build = new DockerBuildArgs
            {
                Context = ".",
                Platform = "linux/amd64",
            },
            Registry = new DockerRegistryArgs
            {
                Server = registry.LoginServer,
                Username = username!,
                Password = password!
            }
        }, new CustomResourceOptions { IgnoreChanges = new List<string> { "build" } });

        var managedEnvironment = new ManagedEnvironment("myManagedEnvironment", new ManagedEnvironmentArgs
        {
            ResourceGroupName = resourceGroup.Name,
            Location = resourceGroup.Location
        });

        var containerAppName = "zbuchheitcontainerapp";
        var containerApp = new ContainerApp(containerAppName, new ContainerAppArgs
        {
            ContainerAppName = containerAppName,
            ResourceGroupName = resourceGroup.Name,
            Location = resourceGroup.Location,
            EnvironmentId = managedEnvironment.Id,
            ManagedEnvironmentId = managedEnvironment.Id,
            Configuration = new ConfigurationArgs
            {
                MaxInactiveRevisions = 100,
                Registries = new List<RegistryCredentialsArgs>
                {
                    new RegistryCredentialsArgs
                    {
                        Server = registry.LoginServer,
                        Username = username!,
                        PasswordSecretRef = "registry-pwd",
                    }
                },
                Secrets = new List<Pulumi.AzureNative.App.Inputs.SecretArgs>
                {
                    new Pulumi.AzureNative.App.Inputs.SecretArgs
                    {
                        Name = "registry-pwd",
                        Value = password!
                    },
                }
            },
            Template = new TemplateArgs
            {
                Scale = new ScaleArgs
                {
                    MaxReplicas = 10,
                },
                Containers = new List<Pulumi.AzureNative.App.Inputs.ContainerArgs>
                {
                    new Pulumi.AzureNative.App.Inputs.ContainerArgs
                    {
                        Name = "mycontainer",
                        Image = image.ImageName,
                    }
                }
            }
        }, new()
        {
            ImportId = importing
            ? GetContainerApp.InvokeAsync(new GetContainerAppArgs
            {
                ContainerAppName = containerAppName,
                ResourceGroupName = resourceGroupName
            }).Result.Id
            : null,
            // IgnoreChanges = new List<string> {"identity", "configuration.secrets[0]", "configuration.registries[0]", "template.revisionSuffix", "template.containers[0]" }
        });
    });

I don't have a consistent repro as this is a network related error.

Output of `pulumi about`

CLI          
Version      3.112.0
Go Version   go1.22.1
Go Compiler  gc

Plugins
NAME          VERSION
azure-native  2.35.0
command       0.9.2
docker        4.5.3
dotnet        unknown

Host     
OS       darwin
Version  14.2.1
Arch     arm64

This project is written in dotnet: executable='/usr/local/share/dotnet/dotnet' version='6.0.417'

Current Stack: zbuchheit-pulumi-corp/azure-cs/dev

Found no resources associated with dev

Found no pending operations associated with dev

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/zbuchheit-pulumi-corp
User           zbuchheit-pulumi-corp
Organizations  zbuchheit-pulumi-corp
Token type     personal

Dependencies:
NAME                VERSION
Pulumi              3.60.0
Pulumi.AzureNative  2.35.0
Pulumi.Command      0.9.2
Pulumi.Docker       4.5.3

Additional context

In doing discovery in the logs, I see the workflow for creating this resource is as follows

GET to check if the resource exists, if not, do a PUT on the Azure endpoing to create the resource which will provide a Azure-Asyncoperation url to perform a GET call against and watch for the operation to complete. It then polls this until it successfully completes.

I don't have a verbose copy of the logs to verify if the 504 comes from the PUT or the subsequent GET calls, but I wonder if like there is a possibility for adding some handling in a scenario like this where a 504 is encountered, another GET is fired off with a retry timer to check the resource, and if the resource doesn't exist after a reasonable threshold, bubble up the error.

Much like #1266 this causes some friction in use of this provider and causes users to lose confidence in the provider.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

danielrbradley · 2024-04-09T15:00:08Z

Thanks for writing this up @zbuchheit

It would certainly be useful to have debug logs here to narrow down the specific source of the 504 error if possible. A 504 gateway timeout should certainly be retryable, but I'm unsure if this is already implemented within the Azure client library we're using. I expect this is particularly difficult to reproduce is it was likely caused by a transient error on Azure which likely was resolved in a matter of seconds.

Identification of source areas

After the initial PUT request which should be fairly fast to respond - then handing off to an awaitable operation - we check the response for if it was < 400 to determine that a create was started:

pulumi-azure-native/provider/pkg/azure/client.go

Line 311 in e06cde1

created = resp.StatusCode < 400

If created is marked as true but there was an error while waiting for the operation to complete, we should then create a "partial create" which indicates the create was started, and the next deployment should re-check for the completed creation against the recorded identifier:

pulumi-azure-native/provider/pkg/provider/provider.go

Lines 888 to 890 in e06cde1

    
           if created { 
        
           	return id, nil, crudClient.HandleErrorWithCheckpoint(ctx, err, id, inputs, req.GetProperties()) 
        
           }

If the read after the operation has completed fails then a warning should be logged but the create should be marked as successful, though might lack some of the outputs:

pulumi-azure-native/provider/pkg/provider/provider.go

Lines 908 to 910 in e06cde1

    
           			k.host.Log(ctx, diag.Warning, resource.URN(req.GetUrn()), `Failed to read resource after Create. Please report this issue. 
        
           	Verbose logs contain more information, see https://www.pulumi.com/docs/support/troubleshooting/#verbose-logging.`) 
        
           			logging.V(9).Infof("failed read resource %q after creation: %s", id, err.Error())

A possible source of an issue could be when we're awaiting the operation, if one of the polling requests is not parsable, it could be that it gets interpreted as a creation failure and therefore would surface as the creation failing. However, the handling of all of these possible errors looks sound and will still return the original created=true along with any error which should result in a partial create:

pulumi-azure-native/provider/pkg/azure/client.go

Lines 314 to 325 in e06cde1

    
           future, err := azure.NewFutureFromResponse(resp) 
        
           if err != nil { 
        
           	return nil, created, err 
        
           } 
        
           err = future.WaitForCompletionRef(ctx, a.client) 
        
           if err != nil { 
        
           	return nil, created, err 
        
           } 
        
           resp, err = future.GetResult(a.client) 
        
           if err != nil { 
        
           	return nil, created, err 
        
           }

It's also possible that we can just set a retry count on the client library which might also resolve this immediate issue of a 504, though it doesn't explain why the partial state isn't created.

Currently, I would say it's inconclusive where the 504 is not being handled. Additional information which might help here are:

Verbose logs, if possible
The Pulumi stack state after the timeout was created - to see if a partial create was registered.
Console logs to see if there's any additional clues as to the operation order around the error.

mikocot · 2024-04-10T11:38:14Z

@zbuchheit weren't the logs from pulumi up, we've provided sufficient? it was severla hundreds of MB AFAIR :)

Anyway, 504 or not, the takeaway from this and similar issues we've encountered with the azure-native provider (or Azure in general) is that Pulumi tends to forget about the resources it had tried to create, instead of registering them and veryfing on the next occasion.

Cancel the process or interrupt connection and Pulumi might be completely not aware the resource it had requested is actually there. It leaves the customer in position where either his resources are duplicated during the next deploy (possibly unknowingly and causing costs/vulnerabilities) or the deployment is in permanent failed state until manual intervention (import what was created/partially created or delete that resource). Please correct me if I that isn't the case.

But overal, our expectation as a customer is that if any problem occurs Pulumi verifies if resource is there or or not and should be retried (unless it's obvious, e.g. validation falied)

zbuchheit · 2024-04-10T21:19:36Z

hi @mikocot, I believe the logs you shared didn't include the 504 GatewayTimeout occurrence which is what I believe is what @danielrbradley is wanting when referring to Verbose logs, but I will make sure the ones you sent are passed along.

mikocot · 2024-04-11T11:40:04Z

Verbose logs, if possible

not possible, as its not an error that we can reproduce, maybe you can by mocking the requests from Azure. We've only done extensive logging for the later deployments where resources were already created and causing conflicts. We can't really generate 100MB+ logs for every run, especially since it's only part of the pipeline, and also makes it slower :)

The Pulumi stack state after the timeout was created - to see if a partial create was registered.

I have destroyed the stack some days ago, it was open already for like 2 weeks with us paying for the resources both in pulumi and azure. It's unforuntately a big problem of many support cases, they cost us a lot in total.

Console logs to see if there's any additional clues as to the operation order around the error.

I've sent the console logs directly to Zach, maybe that can help. I dont think they should be published here.

mikocot · 2024-04-11T11:43:54Z

So from what you're saying, Pulumi should mark such resource as 'potentially created' in its state if e.g. a network issue happens and it can't tell it was created or not? It doesn't lose track of it?
So it either considered azure response as 'it was not created' or there is a bug where it has failed to add it to state.

What about cancellations, we've seen numerous times that when pulumi process was cancelled in some way (e.g. GitHub timeouts) the state became corrupted and created resources had to be removed to proceed (or even to destroy the stack, as they blocked depencencies from cleanup).

I'd assume a similar mechanism should protect pulumi from both cases, some way of registering that a creation is/was ongoing, that is not updated until verified with Azure.

thomas11 · 2024-04-16T15:05:04Z

Hi @mikocot, apologies for the delay. I verified that the provider retries on 504, for a total of three retries. Therefore, I don't believe there is more we can do about this particular error.

As to your question about Pulumi's approach to state on failure: yes, it can indeed happen that a resource creation seemingly fails and the resource doesn't exist in Pulumi's state, but on the service provider it was actually created despite the error return code. Similar for cancellations. Since Pulumi cannot know the real state on the service's side, it needs to make assumptions, which could be wrong either way (created or not created).

For pending Creates, refresh can help (also built-in to up as --refresh). It wouldn't help in your case, however, because the Create has already failed.

thomas11 · 2024-04-17T05:52:02Z

I've created an engine issue to track improving this situation, pulumi/pulumi#15958.

danielrbradley · 2024-04-17T15:28:12Z

Looking a little closer at the error message it's actually two separate errors which can help us narrow down the issue:

azure-native:app/v20230501:ContainerApp redacted **creating failed** error:
resource partially created but read failed autorest/azure:
Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period.":
autorest/azure: Service returned an error. Status=504 Code="GatewayTimeout" Message="The gateway did not receive a response from 'Microsoft.App' within the specified time period."

This error is generated from this code:

pulumi-azure-native/provider/pkg/provider/crud/crud.go

Line 204 in 3ce1712

    
           return azure.AzureError(errors.Wrapf(err, "resource updated but read failed %s", getErr))

(3) is the getErr and (4) is the err from the create operation. Therefore what we're observing here is that we:

Performed the PUT successfully.
Started polling for the operation.
The operation GET returned a 504.
We attempted to fetch the latest state to include in the partial state
The GET for the state also failed with a 504
Therefore we didn't return a partial state.

Proposal

If the GET for the state also fails, we should still return the partial create result, but with the state missing. This will allow the create to be easily fixed with a refresh as suggested by @thomas11 above.

danielrbradley · 2024-04-18T14:57:45Z

What about cancellations, we've seen numerous times that when pulumi process was cancelled in some way (e.g. GitHub timeouts) the state became corrupted and created resources had to be removed to proceed (or even to destroy the stack, as they blocked depencencies from cleanup).

This should also be handled already - as long as you're using a SIGTERM, pulumi should request the provider to shut down which should result in a partial state. If you use a SIGKILL then it's expected that it might leave the state without some resources being fully updated.

If a read fails when trying to return a partial result, ensure we still return a partial result with the resource identifier so the operation can be completed using refresh at a later point. Fixes #3200 - where the partial state failed to be written resulting in the resource being created but lost from state.

zbuchheit added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team customer/feedback Feedback from customers labels Apr 4, 2024

danielrbradley added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Apr 9, 2024

pulumi-bot added needs-triage Needs attention from the triage team and removed awaiting-feedback Blocked on input from the author labels Apr 11, 2024

thomas11 added resolution/by-design This issue won't be fixed because the functionality is working as designed and removed needs-triage Needs attention from the triage team labels Apr 16, 2024

thomas11 closed this as completed Apr 16, 2024

thomas11 mentioned this issue Apr 16, 2024

Assist users in recovering resources that may have been created despite errors pulumi/pulumi#15958

Open

thomas11 added awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). and removed resolution/by-design This issue won't be fixed because the functionality is working as designed labels Apr 17, 2024

thomas11 reopened this Apr 17, 2024

danielrbradley mentioned this issue Apr 17, 2024

Handle read failure after create failure #3219

Merged

danielrbradley closed this as completed in #3219 Apr 18, 2024

pulumi-bot assigned danielrbradley Apr 18, 2024

pulumi-bot added the resolution/fixed This issue was fixed label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`504 GatewayTimeout` Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

`504 GatewayTimeout` Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

zbuchheit commented Apr 4, 2024 •

edited

Loading

danielrbradley commented Apr 9, 2024 •

edited

Loading

mikocot commented Apr 10, 2024

zbuchheit commented Apr 10, 2024

mikocot commented Apr 11, 2024 •

edited

Loading

mikocot commented Apr 11, 2024 •

edited

Loading

thomas11 commented Apr 16, 2024 •

edited

Loading

thomas11 commented Apr 17, 2024 •

edited

Loading

danielrbradley commented Apr 17, 2024

danielrbradley commented Apr 18, 2024

504 GatewayTimeout Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

504 GatewayTimeout Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

Comments

zbuchheit commented Apr 4, 2024 • edited Loading

What happened?

Error Message

Example

Output of pulumi about

Additional context

Contributing

danielrbradley commented Apr 9, 2024 • edited Loading

Identification of source areas

mikocot commented Apr 10, 2024

zbuchheit commented Apr 10, 2024

mikocot commented Apr 11, 2024 • edited Loading

mikocot commented Apr 11, 2024 • edited Loading

thomas11 commented Apr 16, 2024 • edited Loading

thomas11 commented Apr 17, 2024 • edited Loading

danielrbradley commented Apr 17, 2024

Proposal

danielrbradley commented Apr 18, 2024

`504 GatewayTimeout` Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

`504 GatewayTimeout` Error while creating a ContainerApp caused Pulumi to fail and not add it to state despite the resource being created #3200

zbuchheit commented Apr 4, 2024 •

edited

Loading

Output of `pulumi about`

danielrbradley commented Apr 9, 2024 •

edited

Loading

mikocot commented Apr 11, 2024 •

edited

Loading

mikocot commented Apr 11, 2024 •

edited

Loading

thomas11 commented Apr 16, 2024 •

edited

Loading

thomas11 commented Apr 17, 2024 •

edited

Loading