Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ostree-fetcher-curl: handle non 404 errors as G_IO_ERROR_TIMED_OUT #2843

Merged
merged 1 commit into from
Oct 18, 2023

Conversation

jmarrero
Copy link
Member

@jmarrero jmarrero commented Apr 3, 2023

This introduces the retry-all-network-errors option which
is enabled by default. This is a behavior change as now
ostree will retry on requests that fail except when
they fail with NOT_FOUND. It also introduces the options
low-speed-limit-bytes and low-speed-time-seconds these
map to CURL options only at the moment. Which have defaults
set following librepo:
https://github.com/rpm-software-management/librepo/blob/7c9af219abd49f8961542b7622fc82cfdaa572e3/librepo/handle.h#L90
https://github.com/rpm-software-management/librepo/blob/7c9af219abd49f8961542b7622fc82cfdaa572e3/librepo/handle.h#L96
Currently this changes only apply when using libcurl.

Closes: #2570

@jmarrero jmarrero marked this pull request as draft April 3, 2023 23:13
Copy link
Member

@dbnicholson dbnicholson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. There are lots of possible curl errors and they have nothing to do with HTTP errors such as 404. See https://curl.se/libcurl/c/libcurl-errors.html. Many of those errors should not be retried. I suppose in most cases there's no harm in retrying, but I think it would make more sense to match specific errors that should be retried.

@cgwalters
Copy link
Member

See the context in this function; this code is only run when libcurl fails for a non-HTTP reason (e.g. TCP level errors). We already have specific error handling for HTTP codes just below this.

@jmarrero
Copy link
Member Author

jmarrero commented Apr 3, 2023

I likely misguided Dan with my commit message talking 404s and http. Thanks for clarifying, will fix the commit msg.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though indeed looking at the list, it does look like there's quite a few things where it doesn't make sense to retry. I think ideally I agree we'd special-case those that do, but also agree it probably doesn't hurt either to be more brute-force.

@@ -329,7 +329,7 @@ check_multi_info (OstreeFetcher *fetcher)
}
else
{
g_task_return_new_error (task, G_IO_ERROR, G_IO_ERROR_FAILED,
g_task_return_new_error (task, G_IO_ERROR, G_IO_ERROR_TIMED_OUT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't retry if is_file is set?

@jmarrero jmarrero force-pushed the retry branch 3 times, most recently from 0a2d4ec to 435ae57 Compare April 6, 2023 09:21
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cgwalters
Copy link
Member

How did testing go for this?

@jmarrero
Copy link
Member Author

jmarrero commented May 1, 2023

I have not had time to test this, in the last two weeks sorry. Will give it some time this week if the other bugs allow.

@cgwalters cgwalters added difficulty/medium medium complexity/difficutly issue reward/high Fixing this will result in significant benefit triaged This issue has been evaluated and is valid labels May 30, 2023
@cgwalters
Copy link
Member

I've rebased 🏄 this on main
And I added an official Closes: https://github.com/ostreedev/ostree/issues/2570
/test all

This issue just came up again in https://gitlab.apertis.org/infrastructure/apertis-issues/-/issues/349

@cgwalters
Copy link
Member

I was playing with
iptables -A OUTPUT -p tcp --dport 8000 -m statistic --mode random --probability 0.005 -j REJECT --reject-with tcp-reset
but this seems to have the strange semantic that at some point, the pull just hangs. And digging in briefly, the reason seems to be that curl has a "connection cache" and at some point it reaches "too many closed connections to this host" and just stops trying to make new ones. That's kind of annoying.

@cgwalters
Copy link
Member

I was also trying iptables -F OUTPUT; iptables -A OUTPUT -p tcp --dport 8000 -m statistic --mode random --probability 0.05 -j DROP but the problem with this is that TCP papers over the problem and we don't see errors at the HTTP layers (where this code would mainly come into play)...it's just really slow.

To really test this and reproduce the originating issue I think we need a HTTP/2 webserver which is intentionally returning errors. I'm sure it can be done, but I'm having trouble in a quick search finding anything...I would think one of the clients would have a setup for this.

Dunno...I was originally going to say we should merge this, but the "pulls just hang on too many errors" effect I discovered is concerning. With this we may potentially switch from erroring out too easily to just hanging indefinitely, which is definitely worse.

@champtar
Copy link

@cgwalters if it hangs forever it's just a bug to be fixed in curl

For the HTTP/2 webserver it was an nginx running on Fedora (latest version available when opening #2570), no http2 specific tunings, don't know if it's still happening with newer versions

@lucas-akira
Copy link

If more concrete results are needed before approving this pull request:

Among other code changes, I've incorporated this commit to do OSTree updates with aktualizr in an unreliable and low bandwidth connection i.e. download speeds around 10 kB/s with disconnection periods lasting a few minutes.

I've incorporated this commit in OSTree tag 2020.3. Before applying it the update would quickly fail after the first few libcurl errors e.g. CURLE_PARTIAL_FILE. After applying it the update doesn't fail mid-way through anymore, even with the same libcurl errors being thrown during download. I assume the retry logic is now being executed when those happen.

With it I was able to successfully download a 190 MB OSTree update with the low bandwidth connection described above, taking around 9h to complete.

One other thing I added was a timeout (CURLOPT_TIMEOUT) for the libcurl transfers to avoid having the update hanging indefinitely in case the network disconnects completely.

Hope this information is useful.

@jmarrero
Copy link
Member Author

@lucas-akira this is very helpful, thank you. I will resume working on this PR, hopefully next week.

@cgwalters
Copy link
Member

Hmm, well...one thing we could do perhaps is chicken out and make this a runtime option, via e.g. an environment variable. That would allow people to choose which bugs they want...if other environments aren't hitting the curl bug I was seeing on (then current) Fedora they could opt-in e.g.

What may also help is for us to add some sort of giant timeout on top of the entire pull process.

@jmarrero
Copy link
Member Author

jmarrero commented Sep 5, 2023

I re-tested with both:

iptables -A OUTPUT -p tcp --dport 8000 -m statistic --mode random --probability 0.005 -j REJECT --reject-with tcp-reset

and

iptables -F OUTPUT; iptables -A OUTPUT -p tcp --dport 8000 -m statistic --mode random --probability 0.05 -j DROP

on fedora and don't see the infinite hangs described on: #2843 (comment)

The process completes successfully and we see the retries:

(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.282: write of a823ebad56c043dd9bd17560d65938dc36e79b40c4dfdf76e37349b72dd92aac.file complete
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.297: fetch of b43c71d5ef6993bb1bba0b005159f6dedf3446af97f5bf7cde8947b12ef44a76.file complete
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.298: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.300: fetch of b5736b0253c681c70bea3f68dc4f6bd76aee325c0d6694cba99d6d1579903d47.file complete
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.300: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5

...
b43c71d5ef6993bb1bba0b005159f6dedf3446af97f5bf7cde8947b12ef44a76.file complete
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.303: pull: idle, exiting mainloop
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.303: Committing transaction in repository 0x66c620
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.736: txn commit staging-db81f3da-910d-418d-b019-fb3f6ac5ce4a-hUkJjC
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.747: Popping lock non-blocking with timeout 30
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.747: Pop lock: state=shared, depth=1
(/run/hostsrv/github/ostreedev/ostree/.libs/ostree pull:3149): OSTree-DEBUG: 21:16:43.747: Unlocking repo
1180 metadata, 6047 content objects fetched; 106853 KiB transferred in 25 seconds; 118.1 MB content written                                               

Curl version is:

curl-8.0.1-2.fc38.x86_64.fc38.x86_64

Going to try centos stream now and see if I can replicate the issue.

@jmarrero
Copy link
Member Author

jmarrero commented Sep 5, 2023

I guess that is not seeing the re-tries... It's just the debug output. mmm I am not seeing error: set to anything other than unset... @cgwalters Is there something specific I should be seeing here?

quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Apr 17, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request May 16, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request May 17, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request May 17, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request May 20, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request May 28, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request May 29, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request May 31, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jun 11, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jun 11, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jun 26, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jun 28, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Jul 2, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Jul 15, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jul 15, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Jul 16, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Jul 23, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Aug 19, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Aug 22, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Sep 16, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Sep 23, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Sep 23, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Oct 8, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Oct 9, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Oct 14, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Oct 16, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to foundriesio/meta-lmp that referenced this pull request Oct 28, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-lmp that referenced this pull request Oct 31, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
quaresmajose added a commit to quaresmajose/meta-updater that referenced this pull request Nov 7, 2024
Fixed Upstream in PR2843 and available v2024.5
ostreedev/ostree#2843

Signed-off-by: Jose Quaresma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/medium medium complexity/difficutly issue reward/high Fixing this will result in significant benefit triaged This issue has been evaluated and is valid
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Failure when receiving data from from the peer" is not retried
7 participants