Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

Closed
edsko opened this issue Nov 19, 2024 · 12 comments
Closed

Unexpected "Thread killed by timeout manager" in 5.3.6 #153

edsko opened this issue Nov 19, 2024 · 12 comments

Comments

@edsko
Copy link
Collaborator

edsko commented Nov 19, 2024

I'm seeing threads die with "Thread killed by timeout manager" in 5.3.6, but not in 5.3.5 or 5.3.4. Apologies for not noticing this sooner (when we discussed #151); I had missed it because our main test suite did not catch this error, only our stress tests do. This feels like a regression of #136 / #137, but I do not know for sure. I will dig and let you know.

@edsko
Copy link
Collaborator Author

edsko commented Nov 19, 2024

It's possible that this is related to #150.

@kazu-yamamoto
Copy link
Owner

Ah, I'm sure that this is relating to ef79679.
We should not re-throw the asynchronous exception thrown by the WAI time manager.
I will check this today.

@kazu-yamamoto
Copy link
Owner

FYI: yesodweb/wai#1014

@edsko
Copy link
Collaborator Author

edsko commented Nov 20, 2024

I tried reverting ef79679 but I'm still seeing this happening.

@kazu-yamamoto
Copy link
Owner

This is very strange.

  • TimeoutThread cab be thrown if registerKillThread is used.
  • registerKillThread is used only in withTimeout.
  • withTimeout is used inside of forkManaged which ignores all exceptions.

@edsko Are you using WAI timeout manger in your client side by yourself?

@kazu-yamamoto
Copy link
Owner

I pushed c8453f6 which explicitly catches TimeoutThread in withTimeout according to rule 2 in yesodweb/wai#1014.

@edsko
Copy link
Collaborator Author

edsko commented Nov 21, 2024

We initialize the WAI time manager, because the http2 config wants it, but don't otherwise use it. I tried with c8453f6 but no difference. I think the issue is more subtle. I will continue to dig and let you know.

@edsko
Copy link
Collaborator Author

edsko commented Nov 23, 2024

I think the fundamental problem here is that threads are registered with two managers: the WAI manager and the H2 manager, which can sometimes lead to race conditions. I think however that we might be able to just accept this and resolve this our end; if and when I can confirm this, I will close this issue. I can't confirm it just now because I have discovered a memory leak in http-5.3.6 also (#154).

Details of the race condition:

The overall structure of the server is something like this:

System.TimeManager.withManager .. $ \tmgr ->   
  Network.Run.TCP.runTCPServerWithSocket listenSock $ \clientSock ->
    -- runs in a new thread!
    ..
    let config = .. tmgr .. clientSock ..
    HTTP2.run serverConfig config server

For our purposes we here we can abstract HTTP2.run as

mgr <- Network.HTTP2.H2.Manager.start tmgr -- through 'setup', 'newContext'
..
stopAfter mgr runBackgroundThreads $ ..

Server threads get registered two both of these managers:

runServer :: HasCallStack => Config -> Server -> Launch
runServer conf server ctx@Context{..} strm req = 
    forkManaged threadManager label $
        withTimeout threadManager $ \th -> ..

If the server now shuts down, we leave the scope of the HTTP2.run as well as the scope of withManager. At this point two both managers will enter their cleanup phase; the WAI manager will throw TimeoutThread exceptions, and the H2 manager will throw KilledByHttp2ThreadManager exceptions. Notice however that these run in different threads: the H2 manager runs in the main server thread, but the H2 manager runs in a per-connection thread (spawned by Network.Run.TCP.runTCPServerWithSocket. This results in a race condition.

We previously had some logic (#136, #137) to avoid this race condition, but I guess this logic has now been removed.

Like I said , however, we might just be able to work around this our side, I'm not completely sure yet.

@edsko
Copy link
Collaborator Author

edsko commented Nov 23, 2024

Ok, have confirmed that we can deal with this our side. I will close this issue.

@edsko edsko closed this as completed Nov 23, 2024
@kazu-yamamoto
Copy link
Owner

withTimeout is now using T.withHandleKillThread which ignores KillThread.
So, if @edsko's analysis is correct, this issue should not happen v5.3.7.

@kazu-yamamoto
Copy link
Owner

We previously had some logic (#136, #137) to avoid this race condition, but I guess this logic has now been removed.

Even with the current code, cancel is called before throwing KilledByHttp2ThreadManager.
So, I don't understand why this race happens yet.

@edsko
Copy link
Collaborator Author

edsko commented Dec 4, 2024

I think the race condition is gone as of the refactoring that uses time-manager-0.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants