Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump seed timeout #2911

Merged
merged 7 commits into from
Sep 16, 2024
Merged

chore: bump seed timeout #2911

merged 7 commits into from
Sep 16, 2024

Conversation

BlairCurrey
Copy link
Contributor

Changes proposed in this pull request

  • bumps seed timeout from about 4 seconds to about 30 seconds

Context

Noticed on a slower-running machine that it was timing out. And those of us on M1 macs have noticed it occasionally as well.

Checklist

  • Related issues linked using fixes #number
  • Tests added/updated
  • Documentation added
  • Make sure that all checks pass
  • Bruno collection updated

- from about 4 seconds to about 30 seconds
@github-actions github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic pkg: auth Changes in the GNAP auth package. pkg: mock-ase labels Aug 28, 2024
Copy link

netlify bot commented Aug 28, 2024

Deploy Preview for brilliant-pasca-3e80ec canceled.

Name Link
🔨 Latest commit e6cbb22
🔍 Latest deploy log https://app.netlify.com/sites/brilliant-pasca-3e80ec/deploys/66e8590b1d29ca00089b4af4

@@ -325,7 +325,7 @@ const callWithRetry: CallableFunction = async (
try {
return await fn()
} catch (e) {
if (depth > 7) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this one and the backend one this is the configuration for the migration files.
We are just worried about the ASE in this change, if I understand correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, removed

@github-actions github-actions bot removed pkg: backend Changes in the backend package. type: source Changes business logic pkg: auth Changes in the GNAP auth package. labels Sep 10, 2024
@mkurapov
Copy link
Contributor

@BlairCurrey

what do you think instead of increasing the retry timeout, we add a healthcheck to the backend service (how you do in the integration tests), and then add a depends_on service_healthy to the mock ASEs? then we for sure know the seeding will wait until the service is ready

@BlairCurrey
Copy link
Contributor Author

@BlairCurrey

what do you think instead of increasing the retry timeout, we add a healthcheck to the backend service (how you do in the integration tests), and then add a depends_on service_healthy to the mock ASEs? then we for sure know the seeding will wait until the service is ready

I'll try that - sounds better.

@github-actions github-actions bot added pkg: backend Changes in the backend package. type: source Changes business logic and removed pkg: mock-ase labels Sep 10, 2024
@github-actions github-actions bot removed pkg: backend Changes in the backend package. type: source Changes business logic labels Sep 10, 2024
Comment on lines 82 to 88
healthcheck:
test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"]
start_period: 60s
start_interval: 5s
interval: 5m
retries: 10
timeout: 3s
Copy link
Contributor Author

@BlairCurrey BlairCurrey Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things to note here:

First, the healthcheck does not stop after it succeeds. I noticed this after seeing lots of logs for this gql operation after startup and successful seeding (a SO post observing the same thing).

This is different than I expected and why I specified the start period/interval. This checks more upfront then backs off to less frequent checks (since we really only care about the startup here).

Second, we use wget because there is no curl, and the endpoint is what apollo sever suggests for a healthcheck. I initially added a new /healthz path to the koa server which we mount the apollo server to but figured checking the apollo server itself was a better, more direct check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a log like this? to check that its working?

{"level":30,"time":1726148987820,"pid":42,"hostname":"cloud-nine-wallet-backend","requestId":"53de7860-30f7-46bb-8548-aca479dfb9de","operation":null}

Copy link
Contributor

@mkurapov mkurapov Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it keeps making the requests even if it succeeded the first time during the start_period?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we decrease the interval to like 30s maybe? Since 5m seems long, potentially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thats the log.

So it keeps making the requests even if it succeeded the first time during the start_period?

and yes it keeps making the requests

30s would be fine, just figured we didn't really care about it after the initial one gating the MASEs (hence made it much longer).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does that accomplish? after it succeeds it will still send the healtcheck for each interval. lowering the retries just makes the threshold lower for sending out a signal that it failed to start

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we don't have any successful starts during the start_period, and then we have to wait 5 mins for the next check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally no because we're doing 10 retries and checking every 5s (50s + time to make the http request for each) but looking at it again, yes maybe.

The start interval is 5s and the retries are 10 so within the 60s start period it should fail or succeed if the duration of each healthcheck is not >2s. Which should normally be the case (if its not up its an ECONREFUSED and it fails fast - doesn't hang). Currently the timeout is 3s though, so if it hung for 3s each time I think that could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped the start period to 90s. so even if all 10 healthchecks take 8 seconds (interval + timeout) then it will fail within the start period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the number of retries are not considered during the start up time (doc):

start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries.

This means there will be 10 retries after the start up time of 90s, so we just need to determine how long it should take after the startup time to keep trying the check

Comment on lines +30 to +31
cloud-nine-mock-ase:
condition: service_started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we had the cloud-nine-mock-ase dependency originally, do we need it?

Copy link
Member

@raducristianpopa raducristianpopa Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was introduced in #1812. All happy-life services depend on the cloud-nine ones to avoid building the same images twice. If we remove the depends_on, on a fresh localenv startup all the images will build two times since they are not found locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks @raducristianpopa !

@@ -69,6 +71,13 @@ services:
KEY_ID: 53f2d913-e98a-40b9-b270-372d0547f23d
depends_on:
- cloud-nine-backend
healthcheck:
test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this is just making an empty GQL request > what does this header do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, this is just making an empty GQL request > what does this header do?

Yes, empty gql request. the header is to ensure CSRF protection doesnt block it. Took this request from apollo server docs here which mentions the header: https://www.apollographql.com/docs/apollo-server/monitoring/health-checks/

@BlairCurrey BlairCurrey merged commit 86e0a96 into main Sep 16, 2024
30 of 42 checks passed
@BlairCurrey BlairCurrey deleted the bc/increase-seed-timeout branch September 16, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants