-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512
Comments
Since I am not able to comment here (since the comment is too long). I link here a public gist with the summary of my findings. |
Can't comment on gist 👀 thx GitHub :D anyways Hard time limit (900s)
I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀 packit-service/packit_service/worker/handlers/abstract.py Lines 285 to 287 in b58b3b7
into packit-service/packit_service/worker/handlers/abstract.py Lines 225 to 228 in b58b3b7
Hard time limit (300s)
just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/ Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out. By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue. All in all, SLO1 related problemsFrom the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO |
Good point, I didn't saw the time difference between the status check action and the cleaning. I think an exception is thrown there and we are not logging it. I will add some code to log it.
I just noticed that after the switch to Valkey the
There are already 5 max_retries set for the HTTPAdapter class, so we should be good to go.
I agree, there have two different kind of problems.
|
Fixes related with packit service slowliness Should partially fix #2512 Reviewed-by: Matej Focko Reviewed-by: Laura Barcziová
We agreed on keeping this open for a while and see later what is fixed and what is not. To open new follow-up issues if needed. |
Closing this in favour of #2603 |
I realized that starting from 2nd of June 2024 we have this exception raised quite often in the same day (5-10 occurrences).
It makes no sense for the
process_message
function.Unless it is somehow related to the communication with the
pushgateway
If we can't find the reason for the slowliness we should at least increase again the
hard_time_limit
for this task.See also #2519 for follow-ups.
The text was updated successfully, but these errors were encountered: