Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on known flakey errors #211

Open
2 of 7 tasks
tmcgilchrist opened this issue Mar 2, 2023 · 8 comments
Open
2 of 7 tasks

Retry on known flakey errors #211

tmcgilchrist opened this issue Mar 2, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@tmcgilchrist
Copy link
Member

tmcgilchrist commented Mar 2, 2023

Base image builder regularly errors on this transient issue:

failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = ocurrent/opam-staging@sha256:90f036ba79b70d23d08aad99dcaaf9594bc953c9b9a16dd0aa7cee4894939512: failed to do request: Head "https://registry-1.docker.io/v2/ocurrent/opam-staging/manifests/sha256:90f036ba79b70d23d08aad99dcaaf9594bc953c9b9a16dd0aa7cee4894939512": dial tcp: lookup registry-1.docker.io: Temporary failure in name resolution
docker-build failed with exit-code 1
2023-03-02 15:27.29: Job failed: Failed: Build failed

It would be useful to immediately retry on known flakey errors.

Prerequisite

Known flakey errors

Flakey errors on docker-build:

Flakey errors on docker-push:

  • error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n\n" @ Retry on known flakey errors #211 (comment)

Flakey errors on docker authentication:

@tmcgilchrist tmcgilchrist added the bug Something isn't working label Mar 2, 2023
@shonfeder shonfeder self-assigned this May 5, 2024
@shonfeder shonfeder added the good first issue Good for newcomers label May 22, 2024
@shonfeder shonfeder removed their assignment May 27, 2024
@shonfeder shonfeder changed the title Retry on docker registry lookup error Retry on know flakey errors Jul 7, 2024
@shonfeder shonfeder changed the title Retry on know flakey errors Retry on known flakey errors Jul 8, 2024
@shonfeder
Copy link
Contributor

shonfeder commented Jul 10, 2024

The most frequent category of error I've seen in my brief time monitoring this stuff so far is

#66 exporting to image
#66 sha256:e8c613e07b0b7ff33893b694f7759a10d42e180f2b4dc349fb57dc6b71dcab00
#66 exporting layers
#66 exporting layers 21.8s done
#66 writing image sha256:03412a710050776f1862e9fd56b91e966e2858a71dc9f1dd821ffaf2aacc48f5 done
#66 DONE 21.8s
Pushing "sha256:184fd11abc04659d6ab0071aa1737ca3bcacee1f6b612807a6e6d5c937ece74b" to "ocurrent/opam-staging:ubuntu-20.04-opam-amd64" as user "ocurrentbuilder"
Login Succeeded
The push refers to repository [docker.io/ocurrent/opam-staging]
f3ef22358981: Preparing
error parsing HTTP 400 response body: invalid character '<' looking for beginning of value: "<html><body><h1>400 Bad request</h1>\nYour browser sent an invalid request.\n</body></html>\n\n"
docker-push failed with exit-code 1
2024-07-10 22:04.15: Job failed: Failed: Build failed
2024-07-10 22:04.15: Log analysis:
2024-07-10 22:04.15: >>> docker-push failed (score = 20)
2024-07-10 22:04.15: docker-push failed

@shonfeder shonfeder self-assigned this Jul 11, 2024
@shonfeder shonfeder removed the good first issue Good for newcomers label Jul 11, 2024
@shonfeder
Copy link
Contributor

2024-07-10 22:06.27: Will push staging image to ocurrent/opam-staging:debian-11-ocaml-4.03-i386
...
2024-07-10 22:06.27: Using cache hint "4.03.0-i386-ocurrent/opam-staging@sha256:0d421a01a2b832eaedec31c05dd0a87c337f036465a21e2b2e8af3f119b7578f"
2024-07-10 22:06.27: Waiting for resource in pool OCluster
2024-07-10 22:06.27: Waiting for worker…
2024-07-10 22:31.06: Got resource from pool OCluster
Building on x86-bm-c19.sw.ocaml.org
#2 [internal] load .dockerignore
#2 sha256:76716ffcb3cd99c3c374f52e5a45d9687189bdc321ad01196ed7d303fd040a64
#2 transferring context: 2B done
#2 DONE 0.4s

#1 [internal] load build definition from Dockerfile
#1 sha256:d1bbe7c7ab4dfa90070df180f90f841aeea20b486293a65facddf4ce6a55344f
#1 transferring dockerfile: 615B done
#1 DONE 0.3s

#3 resolve image config for docker.io/docker/dockerfile:1
#3 sha256:ac072d521901222eeef550f52282877f196e16b0247844be9ceb1ccc1eac391d
#3 DONE 1.7s

#4 docker-image://docker.io/docker/dockerfile:1@sha256:e87caa74dcb7d46cd820352bfea12591f3dba3ddc4285e19c7dcd13359f7cefd
#4 sha256:971261c9ec3d04b863c2e7e2301e85e136e954ddc12cdaba999b549fa96d15de
#4 CACHED
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: frontend grpc server closed unexpectedly
docker-build failed with exit-code 1
2024-07-10 22:32.14: Job failed: Failed: Build failed

@shonfeder
Copy link
Contributor

Network issues when fetching sources is another source of flakey failure. See https://github.com/tarides/infrastructure/issues/338#issuecomment-2229229672

This error happens during execution of opam. E.g.,

#9 [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp
#9 sha256:274702d28af2649859867b3e2c572ebe7f008e65afcd884b035e062145beeafa
#9 7.654 
#9 7.654 <><> Gathering sources ><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
#9 8.320 [ocaml-config.2/gen_ocaml_config.ml.in] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-config/gen_ocaml_config.ml.in.2
#9 25.79 [ocaml-variants.4.12.1+options] downloaded from https://github.com/ocaml/ocaml/archive/4.12.1.tar.gz
#9 27.11 [ocaml-variants.4.12.1+options/alt-signal-stack.patch] downloaded from https://github.com/ocaml/ocaml/commit/1eeb0e7fe595f5f9e1ea1edbdf785ff3b49feeeb.patch?full_index=1
#9 27.32 [ocaml-variants.4.12.1+options/ocaml-variants.install] downloaded from https://raw.githubusercontent.com/ocaml/opam-source-archives/main/patches/ocaml-variants/ocaml-variants.install
#9 27.32 Switch initialisation failed: clean up? ('n' will leave the switch partially installed) [Y/n] y
#9 27.33 [ERROR] The sources of the following couldn't be obtained, aborting:
#9 27.33           - ocaml-config.2: Curl failed
#9 27.33 
#9 ERROR: executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
------
 > [2/5] RUN opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp:
------
executor failed running [/bin/sh -c opam switch create 4.12 --packages=ocaml-variants.4.12.1+options,ocaml-options-only-fp]: exit code: 40
docker-build failed with exit-code 1
2024-07-15 15:24.03: Job failed: Failed: Build failed
2024-07-15 15:24.03: Log analysis:
2024-07-15 15:24.03: >>> The sources of the following couldn't be obtained, aborting:
#9 27.33           - ocaml-config.2: Curl failed (score = 50)
2024-07-15 15:24.03: Source download failed for ocaml-config.2: Curl failed

@shonfeder
Copy link
Contributor

shonfeder commented Jul 15, 2024

Notes from a discussion with @mtelvers today:

  • We probably don't need to examine the error message of docker pushes, as any failure is likely a networking or service error. Tho it would be good to consider possible counter-examples.
    • One possibility: authentication errors due to bad perms on our side.
  • We can look at https://github.com/ocurrent/ocaml-docs-ci/blob/611be4997108b53a1af4db348ae44e717f9f95bf/src/lib/retry.ml#L4, and maybe adopt that code. However, this is only on the level of Lwt and would only help us for current's we are implementing. However, our current failures here are happening in currents provided by Ocurrent.
  • It seems possible to hack in retries at the current level, e.g., as in A non-ideal (and probably erroneous) approach to retrying on failed pushes #292
  • However, we are currently thinking that the right fix would either be to
    • extend the ocurrent functions for docker operations with optional retry logic, or
    • add a retry mechanism to one of the current term combinators.

So our next step here is open an issue upstream to discuss and evaluate between those two options.

shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 21, 2024
shonfeder added a commit to shonfeder/ocurrent that referenced this issue Aug 22, 2024
@shonfeder
Copy link
Contributor

The most frequent case of this we have been coping with has been solved, going by this week's builds, which, afaik, all completed without any need for restarts or intervention, save for the known issues on ocaml <4.08 for some distros.

I'm going to let this fall back in the backlog then until we are troubled by new problems.

@shonfeder shonfeder removed their assignment Sep 5, 2024
@shonfeder
Copy link
Contributor

shonfeder commented Sep 26, 2024

Authentication errors due to networking issues or transient server-side problems are another class of failure that would benefit from retries (see https://github.com/tarides/infrastructure/issues/397).

Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646462710Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.646505824Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:15 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:15.648370147Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787262578Z" level=info msg="Attempting next endpoint for pull after error: errors:\nunauthorized: authentication required\nunauthorized: authent>
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.787333895Z" level=info msg="Ignoring extra error returned from registry: unauthorized: authentication required"
Sep 25 00:27:17 x86-bm-c8.sw.ocaml.org dockerd[1862]: time="2024-09-25T00:27:17.790197382Z" level=error msg="Handler for POST /v1.41/images/create returned error: unauthorized: authentication required"

@shonfeder
Copy link
Contributor

Failed to docker-login as "ocurrentbuilder"
2024-10-17 12:37.25: Job failed: Failed: Build failed

@shonfeder
Copy link
Contributor

shonfeder commented Oct 23, 2024

Looks like a network error led to a failing git clone:

#20 [stage-0  7/11] RUN git clone https://github.com/ocaml/opam /tmp/opam && cd /tmp/opam && cp -P -R -p . ../opam-sources && git checkout f0ba0c2fb1145fc596c9cf6d997db7d91e36c432 && env MAKE='make -j' shell/bootstrap-ocaml.sh && make -C src_ext cache-archives
#20 sha256:143a48a50230ff4af4200833a0e8faa70323ef685dc6aa7d4e98ae2fb3bfc91d
#20 60.60 error: RPC failed; curl 92 HTTP/2 stream 5 was not closed cleanly: CANCEL (err 8)
#20 60.60 error: 291 bytes of body are still expected
#20 60.60 fetch-pack: unexpected disconnect while reading sideband packet
#20 60.60 fatal: early EOF
#20 60.61 fatal: fetch-pack: invalid index-pack output
#20 ERROR: executor failed running [/bin/sh -c git clone https://github.com/ocaml/opam /tmp/opam && cd /tmp/opam && cp -P -R -p . ../opam-sources && git checkout f0ba0c2fb1145fc596c9cf6d997db7d91e36c432 && env MAKE='make -j' shell/bootstrap-ocaml.sh && make -C src_ext cache-archives]: exit code: 128
------
 > [stage-0  7/11] RUN git clone https://github.com/ocaml/opam /tmp/opam && cd /tmp/opam && cp -P -R -p . ../opam-sources && git checkout f0ba0c2fb1145fc596c9cf6d997db7d91e36c432 && env MAKE='make -j' shell/bootstrap-ocaml.sh && make -C src_ext cache-archives:
------
executor failed running [/bin/sh -c git clone https://github.com/ocaml/opam /tmp/opam && cd /tmp/opam && cp -P -R -p . ../opam-sources && git checkout f0ba0c2fb1145fc596c9cf6d997db7d91e36c432 && env MAKE='make -j' shell/bootstrap-ocaml.sh && make -C src_ext cache-archives]: exit code: 128
docker-build failed with exit-code 1
2024-10-23 18:04.43: Job failed: Failed: Build failed

https://images.ci.ocaml.org/job/2024-10-23/174748-ocluster-build-bd1fff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants