-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
baseline stack overrun in Twitch livestream frames #607
Comments
You've pretty much hit the nail on the head -- stack exhaustion is exactly what's going on here. Unfortunately, we already have a full GB dedicated to the CPU stack, which is already half the available 32-bit addressing space. Disabling the JIT by default doesn't seem like a winning strategy, so since this doesn't work anyway, one way around it might be to unconditionally blacklist one of the Twitch components so it doesn't load. |
What mechanism would be the most appropriate for this? With more verification today, Twitch.tv has a frontpage livestream and same result happens there, so that nearly entirely prevents twitch from being browsed to. The crashing script(s) seem to reside in static.twitchcdn.net, and blocking that CDN simply prevents twitch domains from loading. It's not obvious how this should be handled. The Intel people and/or others tracking the repo changesets likely will not want this, so smells like I can put up a PR and do some testing for next beta(if it's okay with you); just let me know where the blacklist entry should be. |
Sure, feel free to take a whack at it. If possible, if you could get a PR up by the end of this week that will give me enough time to pull it in. I think we should do it like #469 but in a block before it and gate it with a different pref which should not exist ( I bet there will be other hosts in this category eventually, so let's have a systematic solution for it. |
…t can crash the browser, but have no obvious workaround
…t can crash the browser, but have no obvious workaround
(Sorry about the above, I should stop force-pushing my tree) So I'm using a browser built off NapalmSauce@cf9bf7d and it's working as intended. I've put a logging function and a logging pref as well. The string matching macro name is BLOC (vs BLOK) to make it easy to tell the blacklist apart from the adblock, should it grow larger. I think it's ready for you to review, so I'll put up my PR in not long. |
…sh the browser, but have no obvious workaround (#609)
Mitigated by #609 but leaving open in case I figure out what's really going on. |
First tag to crash is 45.5.0b1 (45.4.0 survives). #453 had a common regression line. I lack any background in this context to tell if it has to do with hybrid typed-array endianness, but *-leopard-webkit doesn't seem to crash on twitch. Probably not that good a comparison; I can just reverse the JIT portion of 312362 in changesets-20160923 (see comment 4) and let you know what happens. Won't cost me much, and should document better what might be going on. |
Backing that portion of 312362 doesn't change anything about the crash, so maybe the scripts are accessing floats as ints or something else sketchy.. |
No, I don't think it's an endian bug, because it would show up in the interpreter too. I think either there is a JIT-specific bug, or (more likely) the interpreter just correctly terminates when the stack exhausts and the JIT doesn't. |
I'm homing in on |
Could also be |
We'd need some prints in there to see a) if the limit is at the right address and b) that we actually are correctly extracting the stack pointer from the BaselineFrame. |
45.4 and below only had the bug wallpapered since there's no recursion abort in the console. In > 45.5b1 baseline disabled there's a recursion error for script (warning 1.7MB static.twitchcdn.net/assets/worker.min-a76eeba0bf85a2f41428.js). There's emscripten-related names in there, so 45.4 simply halts the script before getting to the crash. I'm doing a debug build to start investigating the recursion limits. |
I think I've found an explanation to how the overflow happens. In tenfourfox/dom/workers/RuntimeService.cpp Line 98 in f89941f
Reducing the stack limit there to 1MB will crash in c++ code, probably because there's still not enough stack space to set up new frames, but last build I did had a 512KB limit and didn't get any stack overflow and correctly aborted. Where might the JSRuntime and the VM mapping disagree on stack space? |
https://github.com/classilla/tenfourfox/blob/master/dom/workers/WorkerThread.cpp#L29 The stack size for worker threads per /dom/workers/WorkerThread.cpp is 1MB. I also tested enlarging the stack for worker threads to 4MB + 512KB (giving 4MB for the JS stack) and the script still aborts with a recursion error. I don't know at this point if there might be something involving infinite recursion going on, but at the very least it's crash-free. Let me know what stack size (and JS stack size) worker threads should be given, then I can put up a new PR that will fix the crash without the need of a blocklist. I can also wrap stack sizes within an |
That's a good question. I will look at both of these things; let me do some thinking. Good detective work! |
Fine by me :P Thanks! |
Something's a little odd here.
What values worked for you? |
By didn't work I assume you mean it crashed on loading twitch? What did you set kWorkerStackSize in WorkerThread.cpp to? (I think I'm just not explaining myself too well) tenfourfox/dom/workers/WorkerThread.cpp Line 29 in 2b07a7a
Right now it's set to 1MB; so worker threads are spawned with a 1MB stack segment that's enforced by VM permissions.
Doesn't work since since the JITcode can use up the whole 1MB stack before aborting in recursion checks, then call C++ code which sets up extra stack frames, go past the limit and raises SIGBUS.
Works since the JS stack uses max 512KB and will always abort with a safe extra 512KB between the JS stack limit and the -real- stack limit as enforced by VM, so C++ code can safely set up stack frames without crashing. The 512KB extra safe zone it somewhat arbitrary, but it's the same "extra" as 32-bit mozilla builds of firefox 45, and it sounds like a pretty big "extra" zone. So what I'm saying is |
Derp. You know, I didn't get a lot of sleep last night. So, I tried
This still doesn't load any videos, but the chat, incredibly, works, and it doesn't crash. There is still a recursion error or two. How does this perform on your system? (This was my maxed-out Quad G5.) |
I haven't done an opt build since (last's an -Og debug build) so it's not very fast, but once twitch's loaded it's usable on my 2x1.5ghz MDD. Olga's fpr4 build does play twitch streams, and ken's fpr23 build does not error out and reaches the point where it complains h.264 is unsupported (just a quick test, haven't tried the mpeg4 enabler) so if the recursion error can be avoided.. twitch might just work (!) |
Interesting. Even with Baseline off, I still get recursion errors, so whatever's going on is either an interpreter bug or an endian one (or an endian interpreter bug). Anyway, I set it to 64MB for giggles to see if this unbreaks anything else. I wonder if it should be even higher. |
Just for laughs, I revved it to 512MB. No more recursion error! The video thread doesn't start, though. It seems to be a failure in NSPR. Investigating that further. |
Wow, impressive, how long does the script take to complete? If it takes this much stack space to complete, that might still be an endian bug, no? Vs a mere 512KB-1MB on intel.. I did a new opt build for my MDD with some testing flags to reduce code size, and the twitch chat works decently; just slowing down because the chat gets spammed on front page channels.. |
Surprisingly, not all that long (admittedly on a G5, but this is a debug build). However, the problem is that If I can't sort it out, I might just ship 512MB as the limit (half the 1GB stack). It may make other things start working, at least, and it fixes the crash here even if it doesn't fix the site. |
It could also be trying to spawn another worker thread, which the 1GB stack obviously can't house simultaneously.. Shipping a 512mb stack for worker threads will break function for any site that has multiple workers in that sense :/ |
That's a possible explanation too. Maybe let's stick with 64MB and see what shakes out. I can't see anything to twiddle for boosting the system thread max, at least not in |
You might want to consider something small like 3MB or 4MB, as no site before twitch right now used the full 1MB stack space(otherwise we would have gotten this crash way earlier). 64MB looks overkill if it doesn't make twitch functional. |
No, we need large frames (that's why we have the 1GB stack segment). I don't think a 64MB limit is that excessive under those circumstances. It's an easy change to back out if the beta has problems. |
Good point! (off-topic): A little while ago I've compiled my own gcc, and I didn't realize before but I think gcc from macports is built at -O0. My -O2 gcc build takes about 2hrs 33mins to complete an opt tenfourfox build on a 2ghz dual-core g5. It took about 5hrs before with gcc-5 from macports! |
Does that extend to |
It does. I even did builds with custom |
I can make gcc on macports build any way we want, pretty much within reason. Iain Sandoe tells us not be to strict with the optimization on the throwaway bootstrap gcc compiler (ergo the weak optimization there) for a bunch of reasons -- the system bootstrap compiler can't handle it, the bootstrap build takes much longer, and the final optimization used in building the final compiler has nothing to do with the opts we set for the bootstrap compiler. There is a BIG problem with an ABI break with libgcc 7.5.0 that we did not see with libgcc 7.4.0, for a head's up. Iain either can't fix it or finds it too time consuming to fix -- but some of you guys might be able to fix it. |
I didn't really notice before because I was looking at 613. I see in the head commit the worker stack size is bumped to 64MB and remains 1MB for intel, but the worker context JS stack limit is still defined to 2MB for both powerpc and intel in /dom/workers/RuntimeService.cpp line 98: tenfourfox/dom/workers/RuntimeService.cpp Line 98 in c530265
So I'm just asking, is the change to |
no ticket in MacPorts to generate O3 builds of gcc or libgcc at present, no |
Isn't it (I do need to push, I was really tired last night and went to bed early NOT COVID-19 @kencu . But that change isn't in there.) |
So the JS can use up to
Both are relevant. The browser won't be unstable without this, it just won't be able use the stack according to the size you gave it |
Then what value do you think it should be, so that we're agreed on it? I don't think I want to do that now, since that could enable functionality that isn't being tested (unless I end up spinning another beta for another reason). I could shovel that into FPR26. |
Since I'm a bit confused right now, I've spent most of the time on this bug reading how the interfaces for the recursion checks work, and I'm just doubting if I'm missing anything important: I'm wondering if twitch didn't throw a recursion error on your 512MB stack test because it couldn't even spawn a single worker thread. I did a test on twitch with 'kWorkerStackSize' = 4.5MB and Am I missing something? That said, there's no problem if you're busy, I'm just trying to help out. |
Bumped to 60MB for PPC. That should leave ample stack for bailing out. |
Anything else more to do here? |
Unless something else comes up, this looks complete to me. All that's left is for twitch to fix their client, and who knows if that will happen. :) |
Unfortunately, yes. Thanks for doing this and figuring it out. |
(Would have tenderapp been a better place for this?)
That didn't happen before but twitch frames bomb the browser. Example URL with bogus channel but crashes(for me at the very least):
https://embed.twitch.tv/?autoplay=true&channel=Whatever&height=480
If I have baseline enabled (with or without ion) the above makes TFx go down. This can be a nuisance since you might not expect this to be embed on some sites.
From crashreporter, running TFxDebug-fpr23:
(huge) backtrace for thread 35:
looking at coredump:
(bad access is 0xf1a19ff0, r1 is 0xf1a1a040 )
(Guard page?)
(thread's stack segment?)
The backtrace makes it look like it's just stack exhaustion from excessive recursion.
(re-checked with gdb768 to see if things are consistent; they are)
The text was updated successfully, but these errors were encountered: