Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-fork hang after atfork handlers in glibc (Linux, Gevent) #38020

Open
suntropez opened this issue Oct 29, 2024 · 7 comments
Open

Post-fork hang after atfork handlers in glibc (Linux, Gevent) #38020

suntropez opened this issue Oct 29, 2024 · 7 comments

Comments

@suntropez
Copy link

suntropez commented Oct 29, 2024

What version of gRPC and what language are you using?

1.56.0 with gevent around 22.10.2-3

What operating system (Linux, Windows,...) and version?

Linux 6+

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.9.*/3.11.* from multiple sources (Debian upstream or non-special builds of vanilla CPython)

What did you do?

Forked to subprocess.

Details

This is the meat of the stack frames, upwards from the at fork handlers.

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ff55f780706 in PyObject_TypeCheck (type=<optimized out>, ob=0xfffffffffffffe00) at /usr/local/include/python3.11/object.h:133
#2  __Pyx_PyObject_CallOneArg (arg=0x44b82fa09b5a53, func=0xfffffffffffffe00) at src/python/grpcio/grpc/_cython/cygrpc.cpp:61956
#3  __pyx_pf_4grpc_7_cython_6cygrpc_9AioServer_4add_insecure_port (__pyx_v_self=<optimized out>, __pyx_v_self=<optimized out>, __pyx_v_address=0x44b82fa09b5a53) at src/python/grpcio/grpc/_cython/cygrpc.cpp:29407
#4  __pyx_pw_4grpc_7_cython_6cygrpc_9AioServer_5add_insecure_port (__pyx_v_self=<optimized out>, __pyx_v_address=0x44b82fa09b5a53) at src/python/grpcio/grpc/_cython/cygrpc.cpp:29369
#5  0x00007ff55f780632 in Py_XDECREF (op=0x7ff2d2ab89cc) at /usr/local/include/python3.11/object.h:601
#6  __pyx_pf_4grpc_7_cython_6cygrpc_9AioServer_6add_secure_port (__pyx_v_self=<optimized out>, __pyx_v_self=<optimized out>, __pyx_v_server_credentials=<optimized out>, __pyx_v_address=<optimized out>) at src/python/grpcio/grpc/_cython/cyg
#7  __pyx_pw_4grpc_7_cython_6cygrpc_9AioServer_7add_secure_port (__pyx_v_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>) at src/python/grpcio/grpc/_cython/cygrpc.cpp:29502
#8  0x00007ff55f781aed in __Pyx_IsSubtype (b=0x0, a=0x80) at src/python/grpcio/grpc/_cython/cygrpc.cpp:132026
#9  __Pyx_PyObject_CallNoArg (func=0x7ff2d2ab891c) at src/python/grpcio/grpc/_cython/cygrpc.cpp:62413
#10 __pyx_pf_4grpc_7_cython_6cygrpc_20_SyncServicerContext_20peer (__pyx_v_self=<optimized out>, __pyx_v_self=<optimized out>) at src/python/grpcio/grpc/_cython/cygrpc.cpp:18105
#11 __pyx_pw_4grpc_7_cython_6cygrpc_20_SyncServicerContext_21peer (__pyx_v_self=<optimized out>, unused=<optimized out>) at src/python/grpcio/grpc/_cython/cygrpc.cpp:18067
#12 0x00007ff55f782801 in PyCFunction_GET_SELF (func_obj=0x55731d5a6518) at /usr/local/include/python3.11/cpython/methodobject.h:50
#13 __Pyx_PyObject_CallMethO (arg=0x8, func=0x55731d5a6518) at src/python/grpcio/grpc/_cython/cygrpc.cpp:61924
#14 __Pyx_PyObject_CallOneArg (arg=0x8, func=0x55731d5a6518) at src/python/grpcio/grpc/_cython/cygrpc.cpp:61958
#15 __pyx_pf_4grpc_7_cython_6cygrpc_20_SyncServicerContext_10set_code (__pyx_v_self=<optimized out>, __pyx_v_self=<optimized out>, __pyx_v_code=0x8) at src/python/grpcio/grpc/_cython/cygrpc.cpp:17709
#16 __pyx_pw_4grpc_7_cython_6cygrpc_20_SyncServicerContext_11set_code (__pyx_v_self=<optimized out>, __pyx_v_code=0x8) at src/python/grpcio/grpc/_cython/cygrpc.cpp:17672
#17 0x00007ff55f782f05 in __pyx_pf_4grpc_7_cython_6cygrpc_16_ServicerContext_43time_remaining (__pyx_v_self=<optimized out>, __pyx_v_self=<optimized out>) at src/python/grpcio/grpc/_cython/cygrpc.cpp:81969
#18 __pyx_pw_4grpc_7_cython_6cygrpc_16_ServicerContext_44time_remaining (__pyx_v_self=<optimized out>, unused=<optimized out>) at src/python/grpcio/grpc/_cython/cygrpc.cpp:16368
#19 0x00007ff55f2d17c2 in absl::lts_20230125::base_internal::SpinLock::SlowLock (this=0x55731bb93be0) at third_party/abseil-cpp/absl/base/internal/spinlock.cc:126
#20 0x00007ff55f554995 in grpc_call_stack_init (channel_stack=0x55731d5a6510, initial_refs=<optimized out>, destroy=<optimized out>, destroy_arg=<optimized out>, elem_args=0x55731bb93b70) at src/core/lib/channel/channel_stack.cc:202
#21 0x00007ff55f548ae5 in std::__shared_ptr<grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::Node, (__gnu_cxx::_Lock_policy)2>::get (this=0x55731a23fbb8 <opcode
    at /usr/include/c++/10/bits/shared_ptr_base.h:1324
#22 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::UnionWith(grpc_core::ChannelArgs) const::<lambda(const string&, const g
    n=<optimized out>) at ./src/core/lib/avl/avl.h:167
#23 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEach<grpc_core::ChannelArgs::UnionWith(grpc_core::ChannelArgs) const::<lambda(const string&, const grpc_
    this=<optimized out>) at ./src/core/lib/avl/avl.h:57
#24 grpc_core::ChannelArgs::UnionWith (this=<optimized out>, other=...) at src/core/lib/channel/channel_args.cc:295
#25 0x00007ff55f549efc in absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core::ChannelArgs::Pointer>::Constr
    this=<optimized out>) at /usr/include/c++/10/new:175
#26 absl::lts_20230125::base_internal::Callable::Invoke<absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core:
egral_constant<unsigned long, 0ul> > (f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:185
#27 absl::lts_20230125::base_internal::invoke<absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core::ChannelAr
tant<unsigned long, 0ul> > (f=...) at third_party/abseil-cpp/absl/base/internal/invoke.h:212
#28 absl::lts_20230125::variant_internal::ReachableSwitchCase<absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc
l>::Run (op=...) at third_party/abseil-cpp/absl/types/internal/variant.h:295
#29 absl::lts_20230125::variant_internal::VisitIndicesSwitch<3ul>::Run<absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
truct> (i=7, op=...) at third_party/abseil-cpp/absl/types/internal/variant.h:360
#30 absl::lts_20230125::variant_internal::VariantCopyBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core::ChannelArgs::Pointer>::VariantCopyBaseNontrivial (o
    at third_party/abseil-cpp/absl/types/internal/variant.h:1380
#31 absl::lts_20230125::variant_internal::VariantCopyAssignBaseNontrivial<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core::ChannelArgs::Pointer>::VariantCopyAssignBaseN
    at third_party/abseil-cpp/absl/types/internal/variant.h:1436
#32 absl::lts_20230125::variant<int, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, grpc_core::ChannelArgs::Pointer>::variant (other=..., this=0x7ffefa4886e0) at third_party/abseil-c
#33 grpc_core::ChannelArgs::Value::Value (this=0x7ffefa4886e0) at ./src/core/lib/channel/channel_args.h:284
#34 operator() (value=..., key=..., __closure=0x1) at src/core/lib/channel/channel_args.cc:313
#35 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:166
#36 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#37 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#38 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#39 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#40 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#41 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> > (f=..., n=<optimized out>) at ./src/core/lib/avl/avl.h:165
#42 grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::ForEachImpl<grpc_core::ChannelArgs::FuzzingReferenceUnionWith(grpc_core::ChannelArgs) const::<lambda(const 
e&)> >(const grpc_core::AVL<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, grpc_core::ChannelArgs::Value>::Node *, struct {...} &&) (n=<optimized out>, f=...) at ./src/core/lib/avl/avl.h:165
#43 0x00007ff55f52abb5 in std::char_traits<char>::copy (__n=0, __s2=0x7ffefa4887b0 "h\375\063\035sU", __s1=0x7ff2d2ab8940 "") at /usr/include/c++/10/bits/char_traits.h:402
#44 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=0, __s=0x7ffefa4887b0 "h\375\063\035sU", __d=0x7ff2d2ab8940 "") at /usr/include/c++/10/bits/basic_string.h:351
--Type <RET> for more, q to quit, c to continue without paging--c
#45 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=0, __s=0x7ffefa4887b0 "h\375\063\035sU", __d=0x7ff2d2ab8940 "") at /usr/include/c++/10/bits/basic_string.h:346
#46 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__str=..., this=0x55731d33fd70) at /usr/include/c++/10/bits/basic_string.h:722
#47 grpc_core::metadata_detail::GetStringValueHelper<grpc_metadata_batch>::Found<grpc_core::GrpcRetryPushbackMsMetadata> (this=0x55731d33fd68) at ./src/core/lib/transport/metadata_batch.h:704

New tracebacks based on a debug build:

Child main thread

#1  0x00007f74945d23d6 in absl::lts_20230125::synchronization_internal::FutexImpl::WaitUntil (t=..., val=0, v=0x7f720788c940) at third_party/abseil-cpp/absl/synchronization/internal/futex.h:104
#2  absl::lts_20230125::synchronization_internal::Waiter::Wait (this=this@entry=0x7f720788c940, t=t@entry=...) at third_party/abseil-cpp/absl/synchronization/internal/waiter.cc:93
#3  0x00007f74945d2302 in AbslInternalPerThreadSemWait_lts_20230125 (t=...) at third_party/abseil-cpp/absl/synchronization/internal/per_thread_sem.cc:89
#4  0x00007f74945d2e18 in absl::lts_20230125::synchronization_internal::PerThreadSem::Wait (t=...) at third_party/abseil-cpp/absl/synchronization/internal/per_thread_sem.h:107
#5  0x00007f74945d3823 in absl::lts_20230125::Mutex::Block (this=this@entry=0x560341e544a8, s=0x7f720788c900) at third_party/abseil-cpp/absl/synchronization/mutex.cc:1160
#6  0x00007f74945d4516 in absl::lts_20230125::Mutex::LockSlowLoop (this=0x560341e544a8, waitp=0x7ffc5ed42050, flags=<optimized out>) at third_party/abseil-cpp/absl/synchronization/mutex.cc:2062
#7  0x00007f74945d4ab4 in absl::lts_20230125::Mutex::LockSlowWithDeadline (this=0x560341e544a8, how=0x7f7494846ec0 <absl::lts_20230125::kExclusiveS>, cond=0x0, t=..., flags=0) at third_party/abseil-cpp/absl/synchronization/mutex.cc:1924
#8  0x00007f74940c7c48 in absl::lts_20230125::Mutex::LockSlow (this=<optimized out>, how=how@entry=0x7f7494846ec0 <absl::lts_20230125::kExclusiveS>, cond=cond@entry=0x0, flags=flags@entry=0) at third_party/abseil-cpp/absl/synchronization/mutex.cc:1815
#9  0x00007f74945d4d8d in absl::lts_20230125::Mutex::Lock (this=this@entry=0x560341e544a8) at third_party/abseil-cpp/absl/synchronization/mutex.cc:1518
#10 0x00007f7494365d55 in gpr_mu_lock (mu=mu@entry=0x560341e544a8) at src/core/lib/gpr/sync_abseil.cc:50
#11 0x00007f7494368495 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::Start (this=0x560341e544a0) at src/core/lib/gprpp/posix/thd.cc:173
#12 0x00007f74943570da in grpc_event_engine::experimental::TimerManager::RestartPostFork (this=0x560341e67580) at src/core/lib/event_engine/posix_engine/timer_manager.cc:165
#13 0x00007f749433cab1 in grpc_event_engine::experimental::PostforkChild () at src/core/lib/event_engine/forkable.cc:68
#14 0x00007f7496431f83 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00007f7496478b28 in fork () from /lib/x86_64-linux-gnu/libc.so.6

Child another thread

#1  0x00007f7494584305 in AbslInternalSpinLockDelay_lts_20230125 (w=w@entry=0x7f7494a73738 <absl::lts_20230125::synchronization_internal::freelist_lock>, value=value@entry=9) at third_party/abseil-cpp/absl/base/internal/spinlock_linux.inc:63
#2  0x00007f74940c5a00 in absl::lts_20230125::base_internal::SpinLockDelay (scheduling_mode=absl::lts_20230125::base_internal::SCHEDULE_KERNEL_ONLY, loop=1, value=9, w=0x7f7494a73738 <absl::lts_20230125::synchronization_internal::freelist_lock>) at third_party/abseil-cpp/absl/base/internal/spinlock_wait.h:92
#3  absl::lts_20230125::base_internal::SpinLock::SlowLock (this=this@entry=0x7f7494a73738 <absl::lts_20230125::synchronization_internal::freelist_lock>) at third_party/abseil-cpp/absl/base/internal/spinlock.cc:162
#4  0x00007f74940c726b in absl::lts_20230125::base_internal::SpinLock::Lock (this=0x7f7494a73738 <absl::lts_20230125::synchronization_internal::freelist_lock>) at third_party/abseil-cpp/absl/base/internal/spinlock.h:75
#5  absl::lts_20230125::base_internal::SpinLockHolder::SpinLockHolder (l=0x7f7494a73738 <absl::lts_20230125::synchronization_internal::freelist_lock>, this=<synthetic pointer>) at third_party/abseil-cpp/absl/base/internal/spinlock.h:196
#6  absl::lts_20230125::synchronization_internal::NewThreadIdentity () at third_party/abseil-cpp/absl/synchronization/internal/create_thread_identity.cc:105
#7  absl::lts_20230125::synchronization_internal::CreateThreadIdentity () at third_party/abseil-cpp/absl/synchronization/internal/create_thread_identity.cc:133
#8  0x00007f74945d4eac in absl::lts_20230125::synchronization_internal::GetOrCreateCurrentThreadIdentity () at third_party/abseil-cpp/absl/synchronization/internal/create_thread_identity.h:47
#9  absl::lts_20230125::Synch_GetPerThread () at third_party/abseil-cpp/absl/synchronization/mutex.cc:568
#10 absl::lts_20230125::Synch_GetPerThreadAnnotated (mu=0x560341e544a8) at third_party/abseil-cpp/absl/synchronization/mutex.cc:576
#11 absl::lts_20230125::CondVar::WaitCommon (this=0x560341e544b0, mutex=0x560341e544a8, t=...) at third_party/abseil-cpp/absl/synchronization/mutex.cc:2610
#12 0x00007f7494365e7f in gpr_cv_wait (cv=cv@entry=0x560341e544b0, mu=mu@entry=0x560341e544a8, abs_deadline=...) at src/core/lib/gpr/sync_abseil.cc:78
#13 0x00007f74943685b0 in operator() (__closure=0x0, v=<optimized out>) at src/core/lib/gprpp/posix/thd.cc:136
#14 _FUN () at src/core/lib/gprpp/posix/thd.cc:150
#15 0x00007f74966dbea7 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f74964acacf in clone () from /lib/x86_64-linux-gnu/libc.so.6

This happens around once every many thousand times. It seems to be timing sensitive, as reproduction ease varies across environments, as well as seemingly being sensitive to prints/logging added. At the hang point, a futex call, 7 or 8 threads have already been restarted (the grpc threadpool) and also hang together, so post-fork handlers did run, to some extent, on the child. Gevent 20+ is heavily used in the project. I don't have a really well isolated reproducer. We did not try very recent versions. The GRPC_ENABLE_FORK (or something like that) variable was experimented with at runtime to no avail. In this project we don't actually care about forking without soon exec'ing anyway. There's seemingly nothing wrong in the parent process.

The reproduction context -- trying to not add uncertain things or misinformation in general - seems to be "fork a lot while having >1 thread running on the parent".

@suntropez suntropez changed the title Post-fork hang after atfork handlers in libc Post-fork hang after atfork handlers in glibc (Linux, Gevent) Oct 29, 2024
@sreenithi
Copy link
Contributor

Hi @suntropez, thanks for reporting this. 1.56.0 looks like quite an old version, can you also try running with a recent version of gRPC and see if it helps.
Also, it would be helpful to get the output of strace and multiple stacktraces to investigate this issue further.

@suntropez
Copy link
Author

multiple stacktraces to investigate this issue further.

Just added two new tracebacks we got yesterday from a debug build.

old version

Indeed. We have a pile of things on top, so currently updating (and reproducing in the original environment) would be a whole endeavor. AFAIK there was no luck reproducing with a very small project (just grpc installed for example).

@suntropez
Copy link
Author

Some good news. Based on those traces, we tried disabling GRPC_POSIX_FORK_ALLOW_PTHREAD_ATFORK and GRPC_ENABLE_FORK_SUPPORT (build time) and so far there were no hangs.

@sreenithi
Copy link
Contributor

Hi @suntropez, thanks for the update. I am marking the issue as closed as you've managed to resolve it. Feel free to reopen if there are any further issues.

@suntropez
Copy link
Author

I understand if closing it would be preferred in the end (given forking is known to be delicate and not really supported) but I wanna voice out that the workaround we found is just that. Applying the build flags to not support the forking can be intensely annoying depending on how complex the project is, given the so many gotchas of Python tooling and grpcio really being an a second level dependency in my case. If feasible, I'd suggest a runtime flag. I understand it'd be great to have a experimental evidence on whether using the latest grpcio version would fix or minimize this, however for now at least that's too far from reach for the projects I'm handling this issue for.

@sreenithi
Copy link
Contributor

@eugeneo do we currently support setting these fork support flags at runtime or using environment variables?

@sreenithi sreenithi reopened this Nov 18, 2024
@eugeneo
Copy link
Contributor

eugeneo commented Nov 18, 2024

I am currently working on a fork support for future gRPC versions, starting from a clean slate. I am not familiar with the current implementation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants