Unexpected "Thread killed by timeout manager" in 5.3.6 #153

edsko · 2024-11-19T10:22:56Z

I'm seeing threads die with "Thread killed by timeout manager" in 5.3.6, but not in 5.3.5 or 5.3.4. Apologies for not noticing this sooner (when we discussed #151); I had missed it because our main test suite did not catch this error, only our stress tests do. This feels like a regression of #136 / #137, but I do not know for sure. I will dig and let you know.

edsko · 2024-11-19T11:21:55Z

It's possible that this is related to #150.

kazu-yamamoto · 2024-11-19T21:22:06Z

Ah, I'm sure that this is relating to ef79679.
We should not re-throw the asynchronous exception thrown by the WAI time manager.
I will check this today.

kazu-yamamoto · 2024-11-20T00:03:34Z

FYI: yesodweb/wai#1014

edsko · 2024-11-20T11:51:20Z

I tried reverting ef79679 but I'm still seeing this happening.

kazu-yamamoto · 2024-11-21T01:01:01Z

This is very strange.

TimeoutThread cab be thrown if registerKillThread is used.
registerKillThread is used only in withTimeout.
withTimeout is used inside of forkManaged which ignores all exceptions.

@edsko Are you using WAI timeout manger in your client side by yourself?

kazu-yamamoto · 2024-11-21T01:02:51Z

I pushed c8453f6 which explicitly catches TimeoutThread in withTimeout according to rule 2 in yesodweb/wai#1014.

edsko · 2024-11-21T13:09:49Z

We initialize the WAI time manager, because the http2 config wants it, but don't otherwise use it. I tried with c8453f6 but no difference. I think the issue is more subtle. I will continue to dig and let you know.

edsko · 2024-11-23T06:44:34Z

I think the fundamental problem here is that threads are registered with two managers: the WAI manager and the H2 manager, which can sometimes lead to race conditions. I think however that we might be able to just accept this and resolve this our end; if and when I can confirm this, I will close this issue. I can't confirm it just now because I have discovered a memory leak in http-5.3.6 also (#154).

Details of the race condition:

The overall structure of the server is something like this:

System.TimeManager.withManager .. $ \tmgr ->   
  Network.Run.TCP.runTCPServerWithSocket listenSock $ \clientSock ->
    -- runs in a new thread!
    ..
    let config = .. tmgr .. clientSock ..
    HTTP2.run serverConfig config server

For our purposes we here we can abstract HTTP2.run as

mgr <- Network.HTTP2.H2.Manager.start tmgr -- through 'setup', 'newContext'
..
stopAfter mgr runBackgroundThreads $ ..

Server threads get registered two both of these managers:

runServer :: HasCallStack => Config -> Server -> Launch
runServer conf server ctx@Context{..} strm req = 
    forkManaged threadManager label $
        withTimeout threadManager $ \th -> ..

If the server now shuts down, we leave the scope of the HTTP2.run as well as the scope of withManager. At this point two both managers will enter their cleanup phase; the WAI manager will throw TimeoutThread exceptions, and the H2 manager will throw KilledByHttp2ThreadManager exceptions. Notice however that these run in different threads: the H2 manager runs in the main server thread, but the H2 manager runs in a per-connection thread (spawned by Network.Run.TCP.runTCPServerWithSocket. This results in a race condition.

We previously had some logic (#136, #137) to avoid this race condition, but I guess this logic has now been removed.

Like I said , however, we might just be able to work around this our side, I'm not completely sure yet.

edsko · 2024-11-23T08:15:36Z

Ok, have confirmed that we can deal with this our side. I will close this issue.

kazu-yamamoto · 2024-11-25T02:48:29Z

withTimeout is now using T.withHandleKillThread which ignores KillThread.
So, if @edsko's analysis is correct, this issue should not happen v5.3.7.

kazu-yamamoto · 2024-11-25T06:31:07Z

We previously had some logic (#136, #137) to avoid this race condition, but I guess this logic has now been removed.

Even with the current code, cancel is called before throwing KilledByHttp2ThreadManager.
So, I don't understand why this race happens yet.

edsko · 2024-12-04T15:07:38Z

I think the race condition is gone as of the refactoring that uses time-manager-0.2.

edsko closed this as completed Nov 23, 2024

kazu-yamamoto mentioned this issue Nov 26, 2024

Improve manager #156

Merged

edsko mentioned this issue Nov 28, 2024

allocSimpleConfig should take a timeout parameter, and timeouts should be optional #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

edsko commented Nov 19, 2024

edsko commented Nov 19, 2024

kazu-yamamoto commented Nov 19, 2024

kazu-yamamoto commented Nov 20, 2024

edsko commented Nov 20, 2024

kazu-yamamoto commented Nov 21, 2024

kazu-yamamoto commented Nov 21, 2024

edsko commented Nov 21, 2024

edsko commented Nov 23, 2024 •

edited

Loading

edsko commented Nov 23, 2024

kazu-yamamoto commented Nov 25, 2024

kazu-yamamoto commented Nov 25, 2024

edsko commented Dec 4, 2024

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

Comments

edsko commented Nov 19, 2024

edsko commented Nov 19, 2024

kazu-yamamoto commented Nov 19, 2024

kazu-yamamoto commented Nov 20, 2024

edsko commented Nov 20, 2024

kazu-yamamoto commented Nov 21, 2024

kazu-yamamoto commented Nov 21, 2024

edsko commented Nov 21, 2024

edsko commented Nov 23, 2024 • edited Loading

edsko commented Nov 23, 2024

kazu-yamamoto commented Nov 25, 2024

kazu-yamamoto commented Nov 25, 2024

edsko commented Dec 4, 2024

edsko commented Nov 23, 2024 •

edited

Loading