Pull timeout #2418

taylordowns2000 · 2024-08-22T17:07:19Z

this is a concept. @stuartc , we're trying to:

raise the timeout (make it ENV configurable) so that 11s query doesnt lead to lost (@josephjclark on it.)
make that query more efficient (and maybe look for other things that might be slowing down this part of the pipeline under heavy load - you and @rorymckinley on it)
come up with some sort of failsafe that sweeps up false-claims (you sir must FORFEIT YOUR CLAIM ON THAT RUN!) and releases them for someone else to claim... but as you can see in the test placeholder, i really don't like this.

rorymckinley · 2024-08-23T05:16:20Z

test/lightning/janitor_test.exs

+      dbg("eish, i don't love this. what if there was some other reason for them getting lost?")
+      dbg("like, what if they did start, and did the work, and all that, but we never heard back from them because of network issues?")
+      dbg("i wouldn't want to re-do the work.")
+      dbg("i wish there was some way to only mark them as claimed once we know that the worker actually got the run.")


@taylordowns2000 If I may chip in (uninvited):

As you mentioned yesterday the janitor approach is a band-aid until a better fix is available. If I was applying the bandaid, I would not tie it to the timeout value - I would make it a large multiple of the timeout. E.g. if your timeout is 20 sec, maybe wait 10 min (after all you wait 2 South Effrican hours to determine if something is lost, right?). And, notify Sentry whenever this happens, so that we have an opportunity to take a look at what is going on and tune things. Wrt the concern of potentially redoing work, you could make this something that is user-controllable (i.e. if a Run is 'LostAfterClaimed' would you like us to automagically resubmit it - or would you rather do that manually?).

A longer-term fix (which I can only assume will mean a fair amount of work) would be to allow a run to be claimed by multiple worker processes but only started by one (whoever gets there first). This obviously creates inefficiency elsewhere. Personally, I would do the bandaid if it helps us satisfy the immediate requirements and then identify the issue - e.g. maybe it is a simple as scaling up web pods alongside worker pods?

taylordowns2000 · 2024-09-18T15:29:26Z

@rorymckinley , @stuartc , do you know if this code is still needed? would love to shut down this PR if it's not helpful (or has been implemented elsewhere in response to #2465 or #2416 )

rorymckinley · 2024-09-20T04:05:33Z

@rorymckinley , @stuartc , do you know if this code is still needed? would love to shut down this PR if it's not helpful (or has been implemented elsewhere in response to #2465 or #2416 )

@taylordowns2000 I have not implemented anything that addresses this - did the changing of the timeout values make the original problem go away?

taylordowns2000 added 4 commits August 22, 2024 17:41

problem statement

fcf5efc

add ENV

393c09e

concept

6682885

concept

d06ab47

taylordowns2000 marked this pull request as draft August 22, 2024 17:07

taylordowns2000 requested a review from stuartc August 22, 2024 17:07

rorymckinley reviewed Aug 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull timeout #2418

Pull timeout #2418

taylordowns2000 commented Aug 22, 2024

rorymckinley Aug 23, 2024

taylordowns2000 commented Sep 18, 2024

rorymckinley commented Sep 20, 2024

Pull timeout #2418

Are you sure you want to change the base?

Pull timeout #2418

Conversation

taylordowns2000 commented Aug 22, 2024

rorymckinley Aug 23, 2024

Choose a reason for hiding this comment

taylordowns2000 commented Sep 18, 2024

rorymckinley commented Sep 20, 2024