Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert() causes OSv to hang in abort if relevant part of the app ELF not populated #1237

Open
wkozaczuk opened this issue Jun 7, 2023 · 3 comments

Comments

@wkozaczuk
Copy link
Collaborator

When writing and trying a new test tst-brk.cc I discovered that OSv hangs instead of printing normally in this case Abortion failed ... and the corresponding backtrace. After connecting with gdb I would see following stack trace:

(gdb) bt
#0  abort (fmt=fmt@entry=0x405ac310 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:113
#1  0x0000000040238949 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at runtime.cc:153
#2  0x00000000402faf97 in page_fault (ef=0x400000f3f078) at arch/x64/mmu.cc:38
#3  <signal handler called>
#4  0x00000000403fa550 in memchr (src=src@entry=0x100000007277, c=c@entry=0, n=n@entry=2147483647) at musl/src/string/memchr.c:16
#5  0x00000000403fb567 in strnlen (s=s@entry=0x100000007277 <error: Cannot access memory at address 0x100000007277>, n=n@entry=2147483647) at musl/src/string/strnlen.c:5
#6  0x00000000403c633c in printf_core (f=f@entry=0x200000200cc0, fmt=fmt@entry=0x405ac310 "Assertion failed: %s (%s: %s: %d)\n", ap=ap@entry=0x200000200b28, 
    nl_arg=nl_arg@entry=0x200000200bc0, nl_type=nl_type@entry=0x200000200b40) at libc/stdio/vfprintf.c:603
#7  0x00000000403c6df0 in vfprintf (f=f@entry=0x200000200cc0, fmt=0x405ac310 "Assertion failed: %s (%s: %s: %d)\n", ap=<optimized out>) at libc/stdio/vfprintf.c:692
#8  0x00000000403c7065 in vsnprintf (s=<optimized out>, n=<optimized out>, fmt=<optimized out>, ap=<optimized out>) at libc/stdio/vsnprintf.c:49
#9  0x00000000402388d5 in abort (fmt=fmt@entry=0x405ac310 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:125
#10 0x0000000040238949 in __assert_fail (expr=expr@entry=0x100000007277 <error: Cannot access memory at address 0x100000007277>, 
    file=file@entry=0x100000007000 <error: Cannot access memory at address 0x100000007000>, line=line@entry=33, 
    func=func@entry=0x1000000072ba <__func__.0> <error: Cannot access memory at address 0x1000000072ba>) at runtime.cc:153
#11 0x000010000000651c in main () at /home/wkozaczuk/projects/osv-true-master/tests/tst-brk.c:33
#12 0x00000000403973dd in osv::application::run_main (this=0x600000bd9c10) at core/app.cc:416
#13 0x00000000403975d9 in operator() (app=<optimized out>, __closure=0x0) at core/app.cc:236
#14 _FUN () at core/app.cc:238
#15 0x00000000403ca2b6 in operator() (__closure=0x600000d58700) at libc/pthread.cc:116
#16 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...)
    at /usr/include/c++/11/bits/invoke.h:61
#17 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...)
    at /usr/include/c++/11/bits/invoke.h:154
#18 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/11/bits/std_function.h:290
#19 0x0000000040366aaa in sched::thread::main (this=0x400000f3a040) at core/sched.cc:1409
#20 sched::thread_main_c (t=0x400000f3a040) at arch/x64/arch-switch.hh:350
#21 0x00000000402fae43 in thread_main () at arch/x64/entry.S:116

After a bit of analysis, I concluded that the corresponding part of the text segment of the application ELF where expr or file or func is located most likely has not been populated yet when this code was being executed:

// __assert_fail() is used by the assert() macros
OSV_LIBC_API
void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func)
{
    abort("Assertion failed: %s (%s: %s: %d)\n", expr, file, func, line);
}

The abort() called down the stream turns off the interrupts but when the printf tries to read that text from memory it triggers the page fault that causes nested abort (interrupts cannot be disabled when handling the faults):

assert(ef->rflags & processor::rflags_if);

I am not 100% sure what the right solution is but pre-fetching this text data in __assert_fail before calling abort seems like one way to solve this problem:

diff --git a/runtime.cc b/runtime.cc
index 5f67e79b..4d78b5b6 100644
--- a/runtime.cc
+++ b/runtime.cc
@@ -150,7 +150,10 @@ void abort(const char *fmt, ...)
 OSV_LIBC_API
 void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func)
 {
-    abort("Assertion failed: %s (%s: %s: %d)\n", expr, file, func, line);
+    if (expr && expr[0] && file && file[0] && func && func[0])
+       abort("Assertion failed: %s (%s: %s: %d)\n", expr, file, func, line);
+    else 
+       abort("Assertion failed!\n");
 }

But may be there is a better solution to this problem.

@nyh
Copy link
Contributor

nyh commented Jun 7, 2023

Can abort itself turn off interrupts only after formatting the string?

@wkozaczuk
Copy link
Collaborator Author

wkozaczuk commented Jun 8, 2023

Possibly we can do that. I wonder what the consequence of it would be. I found the commit that added the line to disable interrupts along with printing the backtrace - d022927. The motivation was to disable the context switch. Printing backtrace may also trigger page faults. Ideally, we should delay disabling the interrupts until after printing the backtrace.

I think the concern is that the abort in general may be called when the system is not stable enough to handle page faults at this time. But maybe this is an acceptable risk. We do want to get as much information as possible about the original condition that triggered the abort on the other hand.

@wkozaczuk
Copy link
Collaborator Author

I found the discussion on the mailing list about this commit - https://groups.google.com/g/osv-dev/c/lntd5UBlMvg/m/AZRs5MHsQCcJ. There was this concern about making the backtrace printing not use malloc():

"When I debugging the patch, I saw the situation that even my print_backtrace() code doesn't call malloc(), another threads calling malloc().

And I started worrying another running thread breaks system before print_backtrace() finishes print out backtrace.
Also it looks meaning-less to make print_backtrace() malloc-free, when other threads actually continuing to call malloc()."

But I wonder why this was a concern. Disabling the interrupts prevents the context switch only on the current cpu, so threads on other CPUs could still call malloc. And why calling malloc by other threads would be bad.

So what would be the consequence of moving the arch::irq_disable(); right after constructing and printing the backtrace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants