chore: bump seed timeout #2911

BlairCurrey · 2024-08-28T07:44:37Z

Changes proposed in this pull request

bumps seed timeout from about 4 seconds to about 30 seconds

Context

Noticed on a slower-running machine that it was timing out. And those of us on M1 macs have noticed it occasionally as well.

Checklist

- from about 4 seconds to about 30 seconds

netlify · 2024-08-28T07:44:52Z

✅ Deploy Preview for brilliant-pasca-3e80ec canceled.

Name	Link
🔨 Latest commit	`e6cbb22`
🔍 Latest deploy log	https://app.netlify.com/sites/brilliant-pasca-3e80ec/deploys/66e8590b1d29ca00089b4af4

mkurapov · 2024-08-29T07:22:47Z

packages/auth/src/index.ts

@@ -325,7 +325,7 @@ const callWithRetry: CallableFunction = async (
  try {
    return await fn()
  } catch (e) {
-    if (depth > 7) {


I think we can remove this one and the backend one this is the configuration for the migration files.
We are just worried about the ASE in this change, if I understand correctly

yes, removed

mkurapov · 2024-09-10T13:21:38Z

@BlairCurrey

what do you think instead of increasing the retry timeout, we add a healthcheck to the backend service (how you do in the integration tests), and then add a depends_on service_healthy to the mock ASEs? then we for sure know the seeding will wait until the service is ready

BlairCurrey · 2024-09-10T13:40:16Z

@BlairCurrey

what do you think instead of increasing the retry timeout, we add a healthcheck to the backend service (how you do in the integration tests), and then add a depends_on service_healthy to the mock ASEs? then we for sure know the seeding will wait until the service is ready

I'll try that - sounds better.

BlairCurrey · 2024-09-10T18:05:37Z

localenv/cloud-nine-wallet/docker-compose.yml

+    healthcheck:
+      test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"]
+      start_period: 60s
+      start_interval: 5s
+      interval: 5m
+      retries: 10
+      timeout: 3s


Some things to note here:

First, the healthcheck does not stop after it succeeds. I noticed this after seeing lots of logs for this gql operation after startup and successful seeding (a SO post observing the same thing).

This is different than I expected and why I specified the start period/interval. This checks more upfront then backs off to less frequent checks (since we really only care about the startup here).

Second, we use wget because there is no curl, and the endpoint is what apollo sever suggests for a healthcheck. I initially added a new /healthz path to the koa server which we mount the apollo server to but figured checking the apollo server itself was a better, more direct check.

Is it a log like this? to check that its working?

{"level":30,"time":1726148987820,"pid":42,"hostname":"cloud-nine-wallet-backend","requestId":"53de7860-30f7-46bb-8548-aca479dfb9de","operation":null}

So it keeps making the requests even if it succeeded the first time during the start_period?

Should we decrease the interval to like 30s maybe? Since 5m seems long, potentially

yeah thats the log.

So it keeps making the requests even if it succeeded the first time during the start_period?

and yes it keeps making the requests

30s would be fine, just figured we didn't really care about it after the initial one gating the MASEs (hence made it much longer).

what does that accomplish? after it succeeds it will still send the healtcheck for each interval. lowering the retries just makes the threshold lower for sending out a signal that it failed to start

Is it possible that we don't have any successful starts during the start_period, and then we have to wait 5 mins for the next check?

Normally no because we're doing 10 retries and checking every 5s (50s + time to make the http request for each) but looking at it again, yes maybe.

The start interval is 5s and the retries are 10 so within the 60s start period it should fail or succeed if the duration of each healthcheck is not >2s. Which should normally be the case (if its not up its an ECONREFUSED and it fails fast - doesn't hang). Currently the timeout is 3s though, so if it hung for 3s each time I think that could happen.

I bumped the start period to 90s. so even if all 10 healthchecks take 8 seconds (interval + timeout) then it will fail within the start period.

Looks like the number of retries are not considered during the start up time (doc):

start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries.

This means there will be 10 retries after the start up time of 90s, so we just need to determine how long it should take after the startup time to keep trying the check

mkurapov · 2024-09-12T14:16:41Z

localenv/happy-life-bank/docker-compose.yml

+      cloud-nine-mock-ase:
+        condition: service_started


I wonder why we had the cloud-nine-mock-ase dependency originally, do we need it?

This was introduced in #1812. All happy-life services depend on the cloud-nine ones to avoid building the same images twice. If we remove the depends_on, on a fresh localenv startup all the images will build two times since they are not found locally.

Got it, thanks @raducristianpopa !

mkurapov · 2024-09-16T16:15:43Z

localenv/happy-life-bank/docker-compose.yml

@@ -69,6 +71,13 @@ services:
      KEY_ID: 53f2d913-e98a-40b9-b270-372d0547f23d
    depends_on:
      - cloud-nine-backend
+    healthcheck:
+      test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"]


To confirm, this is just making an empty GQL request > what does this header do?

To confirm, this is just making an empty GQL request > what does this header do?

Yes, empty gql request. the header is to ensure CSRF protection doesnt block it. Took this request from apollo server docs here which mentions the header: https://www.apollographql.com/docs/apollo-server/monitoring/health-checks/

chore: bump seed timeout

a30d552

- from about 4 seconds to about 30 seconds

BlairCurrey requested review from mkurapov and JoblersTune August 28, 2024 07:44

github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic pkg: auth Changes in the GNAP auth package. pkg: mock-ase labels Aug 28, 2024

mkurapov reviewed Aug 29, 2024

View reviewed changes

chore: remove depth bump on callWithRetry

a0d6eed

github-actions bot removed pkg: backend Changes in the backend package. type: source Changes business logic pkg: auth Changes in the GNAP auth package. labels Sep 10, 2024

BlairCurrey requested a review from mkurapov September 10, 2024 13:15

feat: use healthcheck instead of increasing seed timeout

1318376

github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic and removed pkg: mock-ase labels Sep 10, 2024

feat: use more direct apollo server healthcheck

1316e74

github-actions bot removed pkg: backend Changes in the backend package. type: source Changes business logic labels Sep 10, 2024

feat(localenv): fine-tune healthcheck interval

ad90f0d

BlairCurrey commented Sep 10, 2024

View reviewed changes

mkurapov reviewed Sep 12, 2024

View reviewed changes

BlairCurrey added 2 commits September 16, 2024 11:04

chore: bump backend healthcheck start period

898664a

fix(localenv): adjust healthcheck params

e6cbb22

BlairCurrey requested a review from mkurapov September 16, 2024 16:14

mkurapov approved these changes Sep 16, 2024

View reviewed changes

BlairCurrey merged commit 86e0a96 into main Sep 16, 2024
30 of 42 checks passed

BlairCurrey deleted the bc/increase-seed-timeout branch September 16, 2024 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: bump seed timeout #2911

chore: bump seed timeout #2911

BlairCurrey commented Aug 28, 2024

netlify bot commented Aug 28, 2024 •

edited

Loading

mkurapov Aug 29, 2024

BlairCurrey Sep 10, 2024

mkurapov commented Sep 10, 2024

BlairCurrey commented Sep 10, 2024

BlairCurrey Sep 10, 2024 •

edited

Loading

mkurapov Sep 12, 2024

mkurapov Sep 12, 2024 •

edited

Loading

mkurapov Sep 12, 2024

BlairCurrey Sep 16, 2024

BlairCurrey Sep 16, 2024

mkurapov Sep 16, 2024

BlairCurrey Sep 16, 2024

BlairCurrey Sep 16, 2024

mkurapov Sep 16, 2024

mkurapov Sep 12, 2024

raducristianpopa Sep 13, 2024 •

edited

Loading

mkurapov Sep 13, 2024

mkurapov Sep 16, 2024

BlairCurrey Sep 16, 2024

chore: bump seed timeout #2911

chore: bump seed timeout #2911

Conversation

BlairCurrey commented Aug 28, 2024

Changes proposed in this pull request

Context

Checklist

netlify bot commented Aug 28, 2024 • edited Loading

✅ Deploy Preview for brilliant-pasca-3e80ec canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkurapov commented Sep 10, 2024

BlairCurrey commented Sep 10, 2024

BlairCurrey Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkurapov Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raducristianpopa Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Aug 28, 2024 •

edited

Loading

BlairCurrey Sep 10, 2024 •

edited

Loading

mkurapov Sep 12, 2024 •

edited

Loading

raducristianpopa Sep 13, 2024 •

edited

Loading