Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB returns operation error use of closed network #1825

Closed
satorunooshie opened this issue Aug 31, 2022 · 31 comments
Closed

DynamoDB returns operation error use of closed network #1825

satorunooshie opened this issue Aug 31, 2022 · 31 comments
Labels
bug This issue is a bug. needs-reproduction This issue needs reproduction. p2 This is a standard priority issue

Comments

@satorunooshie
Copy link

Describe the bug

SDK often returns the error below after updated SDK version.

acutal error:

operation error DynamoDB: Query, $tableName: https response error StatusCode: 200, RequestID: 123456, deserialization failed, failed to decode response body, read tcp $src->$dst: use of closed network connection

Expected Behavior

No errors

Current Behavior

operation error DynamoDB: Query, $tableName: https response error StatusCode: 200, RequestID: 123456, deserialization failed, failed to decode response body, read tcp $src->$dst: use of closed network connection

Reproduction Steps

	ddb := dynamodb.NewFromConfig(defaultCfg)
	result, err := ddb.Query(
		ctx,
		&dynamodb.QueryInput{
			TableName:                 aws.String(tableName),
			ExpressionAttributeNames:  expr.Names(),
			ExpressionAttributeValues: expr.Values(),
			KeyConditionExpression:    expr.KeyCondition(),
		},
	)

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

	github.com/aws/aws-sdk-go-v2 v1.16.8
	github.com/aws/aws-sdk-go-v2/config v1.15.15
	github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.9.8
	github.com/aws/aws-sdk-go-v2/feature/dynamodb/expression v1.4.14
	github.com/aws/aws-sdk-go-v2/service/dynamodb v1.15.10
	github.com/aws/smithy-go v1.12.0

Compiler and Version used

1.19

Operating System and version

Linux

@satorunooshie satorunooshie added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 31, 2022
@RanVaknin RanVaknin self-assigned this Aug 31, 2022
@RanVaknin
Copy link
Contributor

Hi @satorunooshie ,

I saw that you opened 2 issues but Im going to answer them both here:

The reason you are seeing a status 200 followed by a serialization error is that the DynamoDB server is able to send back a valid response, and before the client is finished reading the body of the successful response, the connection errors out. This can happen for many many reasons and is difficult to pinpoint especially when the pattern is inconsistent like in your case.

Regarding your other ticket and PR:
We cannot accept your PR to override the Retryer behavior because not every API operation may be idempotent. The service API models don't have information to determine if an operation is idempotent, thus the SDKs cannot assume if it's safe to retry.

Proposed solution:
You can implement your own Retryer on the client level, or on operation level and configure it to retry when the connection fails.

Let me know if you have any other questions and I can answer them here 🙂.

Thanks you!
~Ran

@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 31, 2022
@github-actions
Copy link

github-actions bot commented Sep 3, 2022

This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 3, 2022
@github-actions github-actions bot closed this as completed Sep 8, 2022
@SaikyoSaru
Copy link

We are seeing the same issue as well, can we please resurrect this issue?

@psmarcin
Copy link

psmarcin commented Sep 29, 2022

I was able to fix this issue but wasn't able to reproduce this error 100% times so I can't create PR.

Before deploying my fix we had ~6 errors per minute, after deploy there is no errors.

Here is my fork of smithy-go: https://github.com/psmarcin/smithy-go there is only one small change - remove conditional body closing.

You can give it a try by simple adding to go.mod file:

replace github.com/aws/smithy-go => github.com/psmarcin/smithy-go v0.0.0-20220923084642-9e31178f1156

@cwd-nial
Copy link

cwd-nial commented Dec 8, 2022

@RanVaknin or any other contributor, can you review the solution psmarcin is suggesting? We are having the same issue that is mentioned above:

operation error DynamoDB: Query, https response error StatusCode: 200, RequestID: 1263a6b7-ebec-4a51-9594-b647545da4fa, deserialization failed, failed to decode response body, read tcp 172.17.0.12:36206->172.17.0.15:8000: use of closed network connection\n

@RanVaknin
Copy link
Contributor

Hi everyone,

Did not see the action on this thread after it closed.
My last attempt at reproducing was not fruitful but I will take another look at it and look at the fix @psmarcin suggested.

Thank you,
Ran~

@RanVaknin RanVaknin reopened this Dec 9, 2022
@RanVaknin RanVaknin added needs-review This issue or pull request needs review from a core team member. p2 This is a standard priority issue and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. closed-for-staleness labels Dec 9, 2022
@RanVaknin RanVaknin assigned isaiahvita and unassigned RanVaknin Dec 12, 2022
@psmarcin
Copy link

@RanVaknin I run my fork in high load environment without any problem for about 3 months now. Would love to know the root cause.

@cwd-nial
Copy link

cwd-nial commented Dec 15, 2022

In our case, it is a weird combination of the smithy version with the golang version that causes the issue:
go 1.19 + smithy 1.10.0 = OK
go 1.19 + smithy 1.11.1 = fails sometimes
go 1.19 + smithy 1.13.5 = fails sometimes

The behavior is not consistent, sometimes it fails sometimes it does not, so this is not a proven OK or fail combination, but before our uprade from golang 1.17 to 1.19 we had no problems (and were using smithy v1.13.4...)

So anyways, it looks to me like @psmarcin's fix t throught out the closing block could work, but what would the downsides of this missing close statement be?

In the original commit in which it came in, it was written:

// HTTP RoundTripper *should* close the request body. But this may not happen in a timely manner.
// So instead Smithy *Request Build wraps the body to be sent in a safe closer that will clear the
// stream reference so that it can be safely reused.
if builtRequest.Body != nil {
	_ = builtRequest.Body.Close()
}

So not sure what effect this removal would have...

@sugymt
Copy link

sugymt commented Jan 2, 2023

We probably have same problem on SQS DeleteMessage.
(Request ID and source IP are masked)

operation error SQS: DeleteMessage, https response error StatusCode: 200, RequestID: xxxx-xxxx-xxxx-xxxx-xxxx, deserialization failed, failed to discard response body, read tcp xxx.xxx.xxx.xxx:xxx->18.183.37.45:443: use of closed network connection

go 1.19.3, aws-sdk-go-v2 v1.17.2

@RanVaknin RanVaknin added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Jan 4, 2023
@a-h
Copy link

a-h commented Jan 5, 2023

I've seen this issue sporadically (4-5 times) in the last few weeks in a failing unit test which connects to DynamoDB Local running in Docker, but I haven't seen it in production.

https response error StatusCode: 200, RequestID: 94f949c6-46fe-47ad-9a8a-6227339b616c, deserialization failed, failed to 
decode response body, read tcp 172.18.0.2:57392->172.18.0.3:8000: use of closed network connection

I have tried to reproduce it locally, but no success. I've tried multithreading calls to DynamoDB local, and stressing it etc.

In production, I've got 1M+ daily DynamoDB calls using github.com/aws/aws-sdk-go-v2/service/dynamodb v1.17.8 running without issue.

@iwata
Copy link

iwata commented Jan 19, 2023

I am using this package to use AWS Rekognition and have the same error. I have the latest version, when will this be a fix? Or is there any workaround?

	github.com/aws/aws-sdk-go-v2 v1.17.3
	github.com/aws/aws-sdk-go-v2/config v1.18.8
	github.com/aws/aws-sdk-go-v2/service/rekognition v1.23.0
	github.com/aws/smithy-go v1.13.5

@lucix-aws lucix-aws added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 24, 2023
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Aug 29, 2023
@jcarter3
Copy link

jcarter3 commented Oct 9, 2023

This is just one of a thousand paper cuts caused by use of smithy generated code. We've already had to "downgrade" a few services to using aws-sdk-v1 to work around a bunch of s3 edge cases - Is downgrading the preferred solution for all of these issues?

drshriveer added a commit to drshriveer/smithy-go that referenced this issue Jan 30, 2024
@BananZG
Copy link

BananZG commented Feb 19, 2024

just upgraded my project from sdk v1 to latest sdk v2, started seeing this error.

github.com/aws/aws-sdk-go-v2/service/dynamodb v1.28.1
github.com/aws/smithy-go v1.20.0

@drshriveer
Copy link

We also are having significant developer pain when running tests with Github Actions. Once we reach a critical mass of tests, they begin to fail consistently not intermittently. As others have suggested, we tried forking the repo and commenting out the suggested block of code. I also concluded (as did @isaiahvita) that the commented-out block should not have an impact on this issue. But with nothing else to go off, it was worth a try. And removing removing the block of code did stop failures (at least until we built up another critical mass of tests-- so for a month or so, a beautiful, consistent month).

I spent more time attempting to debug the issue in smithy itself as I had a path to reproduce it 100% of the time and quickly learned that any extra infrastructure I added to help debug (with or without the commented block), tended to "fix" the issue.

To me this seems to indicate there is a real race condition somewhere in the code and that simply adding or subtracting work seems to shuffle load just enough that the is less likely to occur. The fact that the issue only occurs after upgrading to sdk2 should also be a clear sign of this. Alas, I did not have time to continue my investigation and now the error is back.

Please re-open the issue.

@zapisanchez
Copy link

go 1.21
github.com/aws/aws-sdk-go-v2 v1.25.2
github.com/aws/aws-sdk-go-v2/config v1.27.5
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.13.7
github.com/aws/aws-sdk-go-v2/feature/dynamodb/expression v1.7.7
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.30.2
github.com/aws/aws-sdk-go-v2/service/s3 v1.51.2
github.com/aws/smithy-go v1.20.1

Well.... other here. It looks incredible that in 2 years there is no an official fix.

we are facing same problem. We are trying to read the body response, but we got an "use of closed network connection" error instead.

It's clear that closing the body, as @psmarcin and @isaiahvita indicates, causes a race condition.

It would be nice to propose a solution better htan make our own fork commenting those lines

@stan-stately
Copy link

I'm also seeing this. it is making my CI basically unusable

@RanVaknin RanVaknin reopened this Apr 5, 2024
@RanVaknin
Copy link
Contributor

Hi everyone,

Since more and more people are reporting being impacted I have re-opened this issue.

In our previous reproduction attempts we were not able to recreate the reported behavior. The next step is to ask one of the customers here on the thread to provide a minimal github repository with code that can reliably reproduce the behavior (even if it means seeing intermittent errors reliably).

That would give us a little more to go off of.

Thanks again,
Ran~

@RanVaknin RanVaknin added bug This issue is a bug. needs-reproduction This issue needs reproduction. p2 This is a standard priority issue and removed guidance Question that needs advice or information. labels Apr 5, 2024
@CurryFishBalls9527
Copy link

we are also seeing this issue happen intermittently in all of our environments.

@mcblair
Copy link

mcblair commented Apr 8, 2024

We have been seeing this intermittently in all of our environments too. I'm curious if in the short term, avoiding concurrent use of the ddb client would reduce the frequency of occurrence. Will try to yield a minimal example to reproduce the behavior, but not sure when we can.

@csepulveda
Copy link

I got the same issue using github.com/aws/aws-sdk-go-v2/service/sqs v1.31.4.
the issue shows up after we upgrade from sdk-v1 to sdk-v2.

@RanVaknin
Copy link
Contributor

Hi @CurryFishBalls9527 , @mcblair , @csepulveda ,

Please provide a complete reproduction code so we can re-attempt to reproduce this behavior.

#1825 (comment)

Thanks,
Ran~

@lucix-aws
Copy link
Contributor

To those affected by this issue, 2 things--

retrying these errors

You can make the underlying error net.ErrClosed retryable in your client with the following:

type retryErrClosed struct {
	aws.Retryer
}

func (r *retryErrClosed) IsErrorRetryable(err error) bool {
	return errors.Is(err, net.ErrClosed) || r.Retryer.IsErrorRetryable(err)
}
	
svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
	o.Retryer = &retryErrClosed{o.Retryer}	
})

This should be safe to do for get-type operations such as dynamodb.Query.

root-causing

As of this writing we have been unable to reproduce this in-house whatsoever. I'm going to ask that anyone affected by this issue implement some basic net.Conn tracing to help grant more visibility into where the underlying Close() calls are coming from. The following DialContext implementation, when provided to a transport, will save the stack trace on a Close() call, and dump it to stdout when a Read() on the underlying conn catches this.

If you are affected by this issue and are in a position to do so, please implement this tracing logic in your clients and share any of the stack traces produced from it:

import (
	awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
)

type dialContext func(ctx context.Context, network, addr string) (net.Conn, error)

func traceConn(dc dialContext) dialContext {
	return func(ctx context.Context, network, addr string) (net.Conn, error) {
		conn, err := dc(ctx, network, addr)
		return &tracedConn{Conn: conn}, err
	}
}

type tracedConn struct {
	net.Conn
	closeTrace string
}

func (c *tracedConn) Read(p []byte) (int, error) {
	n, err := c.Conn.Read(p)
	if len(c.closeTrace) > 0 {
		fmt.Println()
		fmt.Println("!!! READ ON CLIENT-CLOSED CONN !!!")
		fmt.Println(c.closeTrace)
		fmt.Println("!!! -(end trace)-------------- !!!")
		fmt.Println()
	}
	return n, err
}

func (c *tracedConn) Close() error {
	c.closeTrace = string(debug.Stack())
	return c.Conn.Close()
}

func main() {
	// ...

	svc := s3.NewFromConfig(cfg, func(o *s3.Options) {
		client := o.HTTPClient.(*awshttp.BuildableClient)
		o.HTTPClient = client.WithTransportOptions(func(t *http.Transport) {
			t.DialContext = traceConn(t.DialContext)
		})
	})

	// ...
}

@minhquang4334
Copy link

minhquang4334 commented Jun 23, 2024

I'm also seeing this when using aws-sdk-go-v2 v1.27.2 and aws-sdk-go-v2/service/dynamodb v1.32.8

@cenapguldh
Copy link

We also see this errors (but in SQS ReceiveMessage) in all of our environments.

operation error SQS: ReceiveMessage, https response error StatusCode: 200, RequestID: xxxx-xxx-xxxxx--xxxx, deserialization failed, failed to decode response body, read tcp src->dst: use of closed network connection

github.com/aws/aws-sdk-go-v2/service/sqs v1.33.1
github.com/aws/aws-sdk-go-v2 v1.30.0
github.com/aws/smithy-go v1.20.2

@RanVaknin
Copy link
Contributor

Hi @iwata @csepulveda @a-h @CurryFishBalls9527 @drshriveer @cenapguldh @minhquang4334 @satorunooshie @cwd-nial @mikkael131 @BananZG @mcblair @sugymt @SaikyoSaru @zapisanchez @jcarter3 @psmarcin @drshriveer and anyone else I might have missed here.

First off, thank you all for your continued patience and detailed contributions to this thread. We recognize that this issue has been persistent and challenging.

Despite our efforts, we still haven’t been able to reproduce this issue internally, which complicates our ability to pinpoint the exact cause and develop a fix. Both @lucix-aws and I have requested additional information to aid in this investigation, but it seems we need to make these requests more visible.

To ensure that all necessary details are clearly visible and to help guide new people running into this issue, we are going to lock this thread, and I'm also going to pin it. We ask that anyone still experiencing this problem to please open a new issue and include all the details outlined in @lucix-aws’s recent comment. This information is crucial for us to attempt a detailed reproduction of the problem.

Thank you once again for your efforts and cooperation.
Ran~

@RanVaknin RanVaknin closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
@aws aws locked as off-topic and limited conversation to collaborators Jul 11, 2024
@RanVaknin RanVaknin pinned this issue Jul 11, 2024
@lucix-aws lucix-aws unpinned this issue Sep 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue is a bug. needs-reproduction This issue needs reproduction. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests