-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC firmware contains many unbounded while...{}
loops that could block the control loop
#3
Comments
All of the If you are willing to accept that the i2c timeout structure is done correctly, that leaves two types of unbounded loops:
UART TX FULL could use a timeout on it, but it will lead to lost characters if the receiving computer is slow or not starting up. The work-around was to not print stuff at all when not debugging -- this has the added benefit of improving the performance of the EC; executing print statements add significant lag to the EC's performance, but sure, some sort of timeout could make sense there. I think I may have seen some PR you made to fix that anyways. Wifi STATUS TIP - this one is more ambiguous if there is a benefit to doing a timeout. Small wifi commands send in about 16 CPU cycles or about 8 instructions, so pretty much by the time you run through one iteration of a checked loop you would be moving on. So it's more a performance concern. If the wifi link ends up being unstable, then adding something here that also has an error handler of resetting the wifi chip in the event of a time out makes a lot of sense. However, simply timing out and "moving on" without noting that a wifi command had failed somehow may improve the responsiveness of the EC but would probably also make debugging network connectivity issues a lot harder. I think in a way maybe having it hang on or near the offending Wifi command is a feature not a bug because at least we can stop the EC execution and look at the current address in GDB (assuming you've left the debug bridge compiled into the gateware) and determine what the offending sequence of events was, as opposed to only figuring out something broke much later and trying to trace back through that. |
maybe you could turn this into a PR? https://github.com/samblenny/betrusted-ec/commit/f56c1719e775e61dc478f596560790f74864ca3c |
For context, I'm thinking of rule number 2 from Gerard J. Holzmann's Power of 10 rules for developing safety critical code:
The problem is that the functions in the conditionals for the while loops are not guaranteed to ever return the ending value. Some unexpected thing could happen (perhaps a loose GPIO wire at the top FPC pads?) to invalidate the assumptions that normally made the while loop safe, and then the firmware would get wedged in an infinite loop. Perhaps the while loops could be moved inside of the timeout checks? Or, switching to something like But, failing that safety check would need to trigger some type of handler for serious errors. Maybe a reset and retry, logging an error code if that was possible under the circumstances, or perhaps something as extreme as halting the CPU. A related data point is that, based on my experiments with serial port stuff, it seems that tight That said, I've been reading the code for wrapping the hard I2C block, and I see that messing with the timing around the I2C calls in any way would probably be very problematic. I'm not excited about opening that can of worms. For the serial TX buffer stuff, I think that commit is still incomplete. It seems like some of the hal drivers use another copy of Retaining the ability to to inform Xous about errors over the COM bus might be the biggest incentive to avoid getting stuck in loops. If the EC can talk to Xous, Xous can show an error message on the screen. That will be helpful for dealing with errors that happen in the wild rather than strictly on a workbench. If you're interested in a PR that's limited to better error recovery around full TX buffers, I can look into that. |
I think that's the big question -- what, then, do you do if the loop times out? Simply returning to the waiting function doesn't make things better. There is a WDT function, you could have a timed-out loop just reset the firmware, I suppose. But keep in mind you're running on like...a 5MIPS MCU, and error handling code has a real performance penalty.
Sure...if you're excited about writing an error handling and reporting stack for the EC, all the more power to you! But doing this is probably beyond my ability. I am still not terribly comfortable with Rust's built-in error handling mechanisms.
Very likely. Wishbone-tool has a lot of stability issues, but it's a deugging tool and those tend to be held together with tape and glue. I usually just get them to work only well enough to debug the problem and no further.
Yes, targeted PRs are great. Small incremental changes that are easy to merge are the best. If even the commit you referenced was a PR that'd be great. I guess I could also cherry-pick the commit out but generally PRs are preferred to cherry-picked commits, they reduce the friction to incorporate code. |
Okay, I will try to work up a PR that specifically addresses the TX buffer thing. I will also try to figure out if I can reasonably consolidate the two copies of |
I expect I may need to build something along those lines in order to have any hope of making reasonable progress with the network stack. Serial debug might be enough. Not sure. |
TX buffer flow control PR is here: Add TX buffer flow control to EC serial debug #4 |
thanks a ton for that. I really appreciate the extra effort to format the patch! |
You're welcome. |
As far as I'm concerned this can be closed, unless you want to keep it around. Seems like the EC UART stability issues have been fixed and the I2C code appears to work reliably as-is. |
The EC firmware has a lot of unbounded
while
loops that could block the control loop in the event of a peripheral getting into an unanticipated state.Converting the
while
loops tofor
loops with timeout error handling seems like a potentially fruitful opportunity to improve the stability and responsiveness of the EC firmware control loop.As an example of what I mean by unbounded loops:
The text was updated successfully, but these errors were encountered: