-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(profiling): turn off forking test #11559
base: main
Are you sure you want to change the base?
Conversation
|
BenchmarksBenchmark execution time: 2024-11-26 22:12:43 Comparing candidate commit 857c936 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 388 metrics, 2 unstable metrics. |
@@ -35,5 +35,7 @@ endfunction() | |||
dd_wrapper_add_test(test_initialization test_initialization.cpp) | |||
dd_wrapper_add_test(test_api test_api.cpp) | |||
dd_wrapper_add_test(test_threading test_threading.cpp) | |||
dd_wrapper_add_test(test_forking test_forking.cpp) | |||
# TODO: Re-enable test_forking. It was turned off as it started to fail with | |||
# v14.3.1 in https://github.com/DataDog/dd-trace-py/pull/11547. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the failures in that PR (example: https://gitlab.ddbuild.io/DataDog/apm-reliability/dd-trace-py/-/jobs/721327091) it's not immediately clear to me what's going wrong. I'm hesitant to skip these tests as they have helped me catch at least one real problem (a deadlock after forking with the libdatadog string table). Could you say more about why we should skip the tests? Are the failures not indicating a real problem in libdatadog or our library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't reproduce the failures with just plain cmake
+ ctest
, but was able to produce failures running via scripts/ddtest
plus riot
. Building from #11547. Needed to add logging to determine this (TODO: send a PR) but in ForkDeathTest.SampleInThreadsAndForkManyFast
and others we have child processes dying with segfaults. Built with debug info and got some cores. Here's a crash:
(gdb) bt
#0 atomic_load<usize> () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/sync/atomic.rs:3288
#1 load () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/sync/atomic.rs:2396
#2 push_or_else<*mut core::ffi::c_void, crossbeam_queue::array_queue::{impl#4}::push::{closure_env#0}<*mut core::ffi::c_void>> ()
at /go/src/github.com/DataDog/apm-reliability/libddprof-build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-queue-0.3.11/src/array_queue.rs:134
#3 push<*mut core::ffi::c_void> ()
at /go/src/github.com/DataDog/apm-reliability/libddprof-build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-queue-0.3.11/src/array_queue.rs:206
#4 {closure#0} () at ddcommon-ffi/src/array_queue.rs:145
#5 ddog_ArrayQueue_push () at ddcommon-ffi/src/array_queue.rs:143
#6 0x00007f20d6780012 in Datadog::SynchronizedSamplePool::return_sample (this=<optimized out>, sample=<optimized out>)
at /root/project/ddtrace/internal/datadog/profiling/dd_wrapper/src/synchronized_sample_pool.cpp:63
#7 0x00007f20d678042f in Datadog::SampleManager::drop_sample (sample=<optimized out>, sample=<optimized out>) at /usr/include/c++/8/bits/unique_ptr.h:342
#8 0x000055c456cc8054 in send_sample (id=<optimized out>) at /root/project/ddtrace/internal/datadog/profiling/dd_wrapper/test/test_utils.hpp:153
#9 0x000055c456cc819f in emulate_sampler (id=0, sleep_time_ns=10000, done=...) at /root/project/ddtrace/internal/datadog/profiling/dd_wrapper/test/test_utils.hpp:165
#10 0x00007f20d712ab2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007f20d6eb8fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#12 0x00007f20d6dea06f in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) p $_siginfo._sifields._sigfault.si_addr
$1 = (void *) 0x80
Seems like a legitimate issue? Will continue investigating, but I definitely don't want to just turn off these tests at this point.
Edit to add: also worth noting I think the Error uploading (failed ddog_prof_Exporter_send: error trying to connect: received corrupt message of type InvalidContentType: received corrupt message of type InvalidContentType)
lines might be noise, since the test (as far as I can tell) isn't actually set up to point to an agent that actually exists and doesn't even check that uploads succeed. We just see the stderr logs when the test fails.
To unblock fix(crashtracking): resolve issue with zombie processes being behind before re:invent code freeze
Checklist
Reviewer Checklist