Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Closed
majamassarini opened this issue Aug 30, 2024 · 5 comments · Fixed by #2555 or packit/ogr#862
Closed

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

majamassarini opened this issue Aug 30, 2024 · 5 comments · Fixed by #2555 or packit/ogr#862
Assignees
Labels
area/general Related to whole service, not a specific part/integration. blocked We are blocked! complexity/single-task Regular task, should be done within days. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.

Comments

@majamassarini
Copy link
Member

majamassarini commented Aug 30, 2024

I realized that starting from 2nd of June 2024 we have this exception raised quite often in the same day (5-10 occurrences).

It makes no sense for the process_message function.

Unless it is somehow related to the communication with the pushgateway

If we can't find the reason for the slowliness we should at least increase again the hard_time_limit for this task.

See also #2519 for follow-ups.

@majamassarini majamassarini added complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. labels Aug 30, 2024
@nforro nforro moved this from new to priority-backlog in Packit Kanban Board Sep 5, 2024
@nforro nforro moved this from priority-backlog to refined in Packit Kanban Board Sep 5, 2024
@majamassarini majamassarini moved this from refined to in-progress in Packit Kanban Board Sep 25, 2024
@majamassarini majamassarini self-assigned this Sep 25, 2024
@majamassarini
Copy link
Member Author

Since I am not able to comment here (since the comment is too long). I link here a public gist with the summary of my findings.

@mfocko
Copy link
Member

mfocko commented Oct 1, 2024

Can't comment on gist 👀 thx GitHub :D anyways

Hard time limit (900s)

"2024-09-26T10:51:51.546708534+00:00 stderr F [2024-09-26 10:51:51,546: INFO/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Cleaning up the mess.","2024-09-26T10:51:51.546+0000"
"2024-09-26T10:36:52.820075409+00:00 stderr F [2024-09-26 10:36:52,820: DEBUG/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Setting Github status check 'success' for check 'rpm-build:fedora-40-aarch64:fedora-rawhide': RPMs were built successfully.","2024-09-26T10:36:52.820+0000"

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀

def clean(self):
"""clean up the mess once we're done"""
logger.info("Cleaning up the mess.")

into

if not getenv("KUBERNETES_SERVICE_HOST"):
logger.debug("This is not a kubernetes pod, won't clean.")
return
logger.debug("Removing contents of the PV.")

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.


By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

SLO1 related problems

From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO

@majamassarini
Copy link
Member Author

majamassarini commented Oct 2, 2024

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀

Good point, I didn't saw the time difference between the status check action and the cleaning. I think an exception is thrown there and we are not logging it. I will add some code to log it.

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.

I just noticed that after the switch to Valkey the hard time limit 300s... exceptions are 10 times less frequent on average. So if we can not find the root cause, I will keep it ;)

By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

There are already 5 max_retries set for the HTTPAdapter class, so we should be good to go.

SLO1 related problems

From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO

I agree, there have two different kind of problems.

  • We have really too much work to match the labels and it tooks us a lot of time
  • We don't give feedback in the meantime so we are definitely too slow in giving feedback to the user

@github-project-automation github-project-automation bot moved this from in-progress to done in Packit Kanban Board Oct 3, 2024
softwarefactory-project-zuul bot added a commit that referenced this issue Oct 3, 2024
Fixes related with packit service slowliness

Should partially fix #2512

Reviewed-by: Matej Focko
Reviewed-by: Laura Barcziová
@majamassarini majamassarini reopened this Oct 3, 2024
@majamassarini
Copy link
Member Author

We agreed on keeping this open for a while and see later what is fixed and what is not. To open new follow-up issues if needed.

@majamassarini
Copy link
Member Author

Closing this in favour of #2603

@github-project-automation github-project-automation bot moved this from in-review to done in Packit Kanban Board Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/general Related to whole service, not a specific part/integration. blocked We are blocked! complexity/single-task Regular task, should be done within days. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related.
Projects
Archived in project
2 participants