-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: bump seed timeout #2911
chore: bump seed timeout #2911
Conversation
- from about 4 seconds to about 30 seconds
✅ Deploy Preview for brilliant-pasca-3e80ec canceled.
|
packages/auth/src/index.ts
Outdated
@@ -325,7 +325,7 @@ const callWithRetry: CallableFunction = async ( | |||
try { | |||
return await fn() | |||
} catch (e) { | |||
if (depth > 7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this one and the backend one this is the configuration for the migration files.
We are just worried about the ASE in this change, if I understand correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, removed
what do you think instead of increasing the retry timeout, we add a healthcheck to the backend service (how you do in the integration tests), and then add a depends_on service_healthy to the mock ASEs? then we for sure know the seeding will wait until the service is ready |
I'll try that - sounds better. |
healthcheck: | ||
test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"] | ||
start_period: 60s | ||
start_interval: 5s | ||
interval: 5m | ||
retries: 10 | ||
timeout: 3s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things to note here:
First, the healthcheck does not stop after it succeeds. I noticed this after seeing lots of logs for this gql operation after startup and successful seeding (a SO post observing the same thing).
This is different than I expected and why I specified the start period/interval. This checks more upfront then backs off to less frequent checks (since we really only care about the startup here).
Second, we use wget
because there is no curl, and the endpoint is what apollo sever suggests for a healthcheck. I initially added a new /healthz
path to the koa server which we mount the apollo server to but figured checking the apollo server itself was a better, more direct check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a log like this? to check that its working?
{"level":30,"time":1726148987820,"pid":42,"hostname":"cloud-nine-wallet-backend","requestId":"53de7860-30f7-46bb-8548-aca479dfb9de","operation":null}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it keeps making the requests even if it succeeded the first time during the start_period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we decrease the interval to like 30s maybe? Since 5m seems long, potentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah thats the log.
So it keeps making the requests even if it succeeded the first time during the start_period?
and yes it keeps making the requests
30s would be fine, just figured we didn't really care about it after the initial one gating the MASEs (hence made it much longer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does that accomplish? after it succeeds it will still send the healtcheck for each interval. lowering the retries just makes the threshold lower for sending out a signal that it failed to start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we don't have any successful starts during the start_period
, and then we have to wait 5 mins for the next check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally no because we're doing 10 retries and checking every 5s (50s + time to make the http request for each) but looking at it again, yes maybe.
The start interval is 5s and the retries are 10 so within the 60s start period it should fail or succeed if the duration of each healthcheck is not >2s. Which should normally be the case (if its not up its an ECONREFUSED and it fails fast - doesn't hang). Currently the timeout is 3s though, so if it hung for 3s each time I think that could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped the start period to 90s. so even if all 10 healthchecks take 8 seconds (interval + timeout) then it will fail within the start period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the number of retries are not considered during the start up time (doc):
start period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries.
This means there will be 10 retries after the start up time of 90s, so we just need to determine how long it should take after the startup time to keep trying the check
cloud-nine-mock-ase: | ||
condition: service_started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we had the cloud-nine-mock-ase
dependency originally, do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was introduced in #1812. All happy-life
services depend on the cloud-nine
ones to avoid building the same images twice. If we remove the depends_on
, on a fresh localenv startup all the images will build two times since they are not found locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks @raducristianpopa !
@@ -69,6 +71,13 @@ services: | |||
KEY_ID: 53f2d913-e98a-40b9-b270-372d0547f23d | |||
depends_on: | |||
- cloud-nine-backend | |||
healthcheck: | |||
test: ["CMD", "wget", "--header=apollo-require-preflight: true", "http://localhost:3001/graphql?query=%7B__typename%7D", "-O", "/dev/null"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, this is just making an empty GQL request > what does this header do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm, this is just making an empty GQL request > what does this header do?
Yes, empty gql request. the header is to ensure CSRF protection doesnt block it. Took this request from apollo server docs here which mentions the header: https://www.apollographql.com/docs/apollo-server/monitoring/health-checks/
Changes proposed in this pull request
Context
Noticed on a slower-running machine that it was timing out. And those of us on M1 macs have noticed it occasionally as well.
Checklist
fixes #number