-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable hardware stack zeroing. #111
base: main
Are you sure you want to change the base?
Conversation
beq \top, \scratch, 2f | ||
// If the current zeroing range is a subset of range that we're about to | ||
// zero, don't bother waiting, just zero from the start. | ||
ctestsubset \top, c\base, c\scratch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the only use of ctestsubset
in the RTOS. I think at one point we were thinking of removing the instruction in hopes of simplifying the (micro)architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do it with checks on the top and bottom, since we already have checks that CSP is sane (correct permissions, correctly aligned top and bottom) on entry into the compartment switcher. I'm not sure if we could do that without adding another temporary register, which would require some small tweaks to the users of this macro.
1f5aefd
to
8f22cdc
Compare
This runs the
We'll see if it survives the full test harness! |
It does survive the full test harness, huzzah! Reported cycle numbers for 452d404 + this PR, without and with the zeroizer support enabled:
|
8f22cdc
to
cfcc489
Compare
sdk/core/switcher/entry.S
Outdated
// Derive the capability to the range that should be zeroed. | ||
// This is stored in base, leaving top as a second scratch register to use | ||
csub \top, c\top, c\base | ||
csetboundsexact c\base, c\base, \top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussions out of band, this should probably be
csetboundsexact c\base, c\base, \top | |
csetbounds c\base, c\base, \top | |
cincoffset c\base, c\base, \top |
- We need to set the address to the top of the region to be zeroed, since the state machine counts down.
- Using inexact bounds allows us to tolerate large (> 2KiB) stacks without risk, and...
- Because the stack is guaranteed to be a superset representable region, we'll never try to grow beyond it
- Because the state machine counts down from the (precise) address, it won't zeroize parts of the stack in use
- It might zero a little bit past the HWM, but that's fine.
EXPORT_ASSEMBLY_OFFSET(TrustedStack, mshwmb, (17 * 8) + 4) | ||
#ifdef CHERIOT_HAS_ZTOP | ||
EXPORT_ASSEMBLY_OFFSET(TrustedStack, ztop, 16 * 8) | ||
# define TSTACK_HAS_ZTOP 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely minor, but the differing "types" of CHERIOT_HAS_ZTOP
and TSTACK_HAS_ZTOP
is bugging me. Maybe TSTACK_ZTOP_WORDS
or such?
3b4b670
to
c1d69b5
Compare
This replaces the software stack zeroing path with offload to a hardware engine, when available. The hardware zeroing pipeline is controlled by the ZTOP SCR (currently assumed to be number 27). This zeroes from the top of the capability in ZTOP to the bottom. We assume that short-lived cross-compartment calls or returns from deeply nested calls will result in ranges where each is a subset of another, so we try to coalesce. If we're about to zero a region that is a superset of the current zeroing region, we skip it, otherwise we wait for the previous zeroing region to finish. This is a fairly simple approach but was chosen because adding a lot of instructions in this path can rapidly offset the benefits. 0.3% speedup on Sonata. 2% speedup on the Ibex SAFE simulator. The difference here is largely due to the UART at 115,200 b/s being *very* slow in comparison to the CPU and so the test suite is mostly waiting to write debug messages. The simulator UART can write one character per cycle. The allocator portion of the test suite (which does a lot of work per message) is around 1% faster. The test suite crashes on the Arty A7, which may be a bug in the A7 version but may be a bug in this code and so we shouldn't merge it yet.
Test failure is interesting. The SHWM is reporting a lot more stack usage in this mode, but it shouldn't. There may be a hardware bug in the SHWM's interaction with the zeroing engine. |
This replaces the software stack zeroing path with offload to a hardware engine, when available. The hardware zeroing pipeline is controlled by the ZTOP SCR (currently assumed to be number 27). This zeroes from the top of the capability in ZTOP to the bottom.
We assume that short-lived cross-compartment calls or returns from deeply nested calls will result in ranges where each is a subset of another, so we try to coalesce. We could probably do better aggregating overlapping but not subset ranges.