-
Notifications
You must be signed in to change notification settings - Fork 110
Review error and exception handling #1601
Comments
… and address errors is in the scope of issue #1601
Related: #1635 |
… and address errors is in the scope of issue #1601
… and address errors is in the scope of issue #1601
… and address errors is in the scope of issue #1601
… and address errors is in the scope of issue #1601
… and address errors is in the scope of issue #1601
Marking this as
|
So I looked at our current error handling. To summarize what we have now:
Here are my suggestions on how we might fix this: It is ok if the balances node have of each other are not an exact match. It might even out over time and even if it doesn't its fine as long as the discrepancy doesn't get larger than In addition to the This should give us the following properties:
On all of the state store io errors we should either drop the peer (using In
Cashout (outline, this not thought through yet):
I have some poc code for the changes to Some open questions I have:
Any comments on these or the suggestions described above are welcome. |
Thanks for a very in-depth analysis @ralph-pichler. Adding my replies as I read:
Per the issue you are referring to, at some point we should be adding code to shut down the node if an I/O error occurs (after the handshake, but during an accounting function). Depending on how we implement this, it might affect connection establishment in a way that reflects this behavior. If not, I would leave the handshake as it is. (in summary: I agree)
Blocking is definitely not an option, so that part is OK. However: couldn't we return routine errors through channels? We would possibly be dealing with them while other go routines are executing, but this is probably preferable from no error handling at all.
It's ignored because it's inside the
I think we got rid of the confirmation message because we didn't need it (we assumed messages were delivered if the function didn't We can add it back though. I would like to think about how it can solve this problem without adding additional ones.
Here I would like to re-iterate the idea of not ignoring errors while still keeping the go routines, if possible.
To add to this: we should not log and then return an error. We should log errors and handle them in the same place.
What about implementing our own timeout? It might be tough to decide on a value, but still might be useful.
I'm trying to verify this, but I'm not sure where the code goes after
I think it's time we tackled this one. But it's probably more of an edge case than it feels like.
A difference in balances (being out of sync) isn't necessarily a serious problem, and if it is, it's better to drop the peer.
Agreed.
I am assuming here that all of this is per peer—meaning, we have either In principle I like this idea, but I must ask: what happens if the node that receives the cheque updates its balance, but the If so: wouldn't this cause the receiving node to update the balance twice? 🤔 Also: it's not entirely clear to me based on this paragraph at exactly which times we register the
To make sure I understand this correctly: if a
Just in case: can you please refer to the line this honey calculation is on?
OK, so this sort of addresses my previous question. But how does a node know it has already processed a cheque before?
A node drops a peer when the balance relative to it is over the threshold and the node is about to process a message which would increase the debt the peer owes to it. Right?
Can you given a example of a case where a nonce issue occurs?
Great! Keep us posted 👍
Oh, I see what you mean now. I have asked this question before but have not got a clear answer yet. It essentially boils down to: can SWAP peers serve any purpose even if they are over-indebted to a node?
I will raise this question during tomorrow's meeting. For now, I would go with option 1.
Good question for core, I think.
Seems like a good start to me. I think we need some sort of flowchart or diagram to fully understand and review the logic. I don't mind drawing it as long as I have clear-cut info. |
I don't think it's necessary as long as we handle all the errors within the called go routine. Other protocols I've looked at, do it that way too.
Yes, delivered but not necessarily processed. This is about getting confirmation that the cheque was indeed accepted (e.g. honey verification worked, balance update, cheque was saved).
I thought about that too, but as far as I see it there is no alternative. There are factors which lead to reject a cheque that the issuer has no control over and without an additional message the issuer cannot find out about that. We removed the old
The error will already be handled within the cashing logic, therefore there is no need to handle it here.
Yes, we will need one for sure. But doing this right can be very complicated and requires deeper study of go-ethereum's transaction sending code. (see the outline at the bottom).
Yes. A peer has a last sent cheque, a last received cheque and potentially a pending cheque.
The receiving node will only update the balance once. If the same cheque is received again we resend the confirmation message but without adjusting balances or saving anything. There is no "waiting" because the receiving node cannot know wether the
Immediately before setting the pending cheque to
No, there is no balance adjustment when creating a cheque. Only once the
What I mean here is https://github.com/ethersphere/swarm/blob/master/swap/swap.go#L383 where we check the difference in cumulative amount against the expected amount for the given amount of honey.
The receiver knows it because the incoming cheque will already be stored as the
Right now it doesn't even disconnect under all such circumstances (e.g. the chunk delivery example described above). But that would be one possible approach, the open question is wether we want that or not.
I think it happens when many cheques are received in rapid succession. Then the transaction api picks the old nonce and there's a conflict. I've seen it several times while running nodes. |
Ralph, thanks for the thorough research and good explanation of the error-handling issue. I think your proposal to introduce a pending cheque and an extra confirm message is neat and I am looking forward to see your implementation. @mortelli, could you please makeupdate your comment above such that it does not display with a scrolling bar? I think the discussion you are having is valuable, but it is difficult to follow now because of this. After this is fixed, I will have a look again at this conversation and add my comments (after verifying they are not already addressed). Just one thing which I can directly ask now:
What does returning an error mean in this context? I tried to follow the error up, but could not figure out where it exactly ends up. If it means that the node shuts down, I don't think it is a good idea, as in this case, it will be easy for an attacker to shut down the whole network. |
Done.
I had the same problem. Once I figure it out I will make it explicit here. |
Afaik it just doesn't set up this protocol for this peer. It shouldn't affect any other protocol or peer. |
Related info from the last research call:
When we say disconnect do we want to
How should a node behave if it cannot send any more cheques? (without risking that those cheques bounce).
|
Current status on what's still missing:
|
With the upcoming refactoring of the protocols package, where handlers could return custom errors, it seems the strategy is to allow for custom handling of issues before dropping - e.g. allowing other protocols to still interact in some cases. Thus, I don't think we should enforce a
This also will be affected by the redesign of the protocols package so I agree no change for now - regarding returning errors. However, we need to make sure that errors inside the go routine can't get us into a broken state or something. In other words, errors must be handled.
Then we can ignore this
This should be somehow "bookmarked"; if we do something right now, it may change due to the new pricing design which is upcoming. making the change obsolete. So there should be a low-priority issue for this.
I think no special treatment for malicious attacks
A crashing or terminating node issue should be handled IMHO. It is bad design if this leads to problems. There should be some sanity check for example when rebooting the node which recovers the state. I believe this should be analyzed and tackled - even if not with the highest priority. I suggest to create an issue for this.
I think it relates to the previous point. We should always be able to recover the state in an error condition.
I agree, this part will be probably heavily redesigned as we also will probably allow users a flexible cashout strategy. Just make sure no error is "lost".
Ignore for now due to the protocols package redesign.
May also suffer changes due to protocols package redesign so hold-on. For now though I do suggest a peer drop.
If this happens, do we have consistent balances? If not, that should be handled.
I agree and if it should, then we should do it so. Atomicity is very important feature in fintech.
We got rid of confirmation messages because they constituted an attack vector and introduced nash equilibrium states. I agree with @mortelli that if we think of this, we should make sure that it does not re-introduce new problems. My current thinking and suggestion is to contemplate something different: design a secure protocol which is added to the handshake which checks on existing state(s) between the nodes (balances, cheques) and syncs them if needed. But this might be more wishful thinking than actually possible. Needs investigation. Suggest to open an issue for this if not existing already.
I believe that the cashing strategy should be completely left to the user. Whatever we take on as "automatic" will just create more burden for us. Maybe we could think of a "default" strategy (cash-in immediately?) which is some minimal implementation. Of course, designing a user-defined cash-in strategy is a lot of work but could lead to a more cleaner handling. Which means this needs an issue.
Of course. But that is the nature of an incentivized Swarm. If you don't pay, you are not playing by the rules and thus you should be disconnected. However, there might be "free" protocols, but this has not been designed properly and never explicitly documented and requested as a feature.
Well using channels just shifts the burden of handling the error, and makes it probably more complex (loss of context). It is more important that the error is handled. It can even happen in the same routine.
Agree, although in this case the important point is to design the cashing strategy well. Wrapping up:
|
This was already addressed with the pending cheques anyway. After the new pricing we can reexamine wether to drop confirm again or at least relax the conditions under which new cheques are allowed.
I made an issue for this (#2004). An easier solution to this though might be to add batching support to the state store abstraction, then it should be able to support writing to different keys atomically. Anyway, pending cheques address most of these issues already. Nodes will always agree on the same last cheques again, only balances can go out of sync in case of crashes which isn't too big of a problem.
We don't. This is documented in #2003 .
I don't think this can work. A node has every incentive to downplay its debt to the other one. The worst thing another node can do with a confirmation message is intentionally not sending it. In that case we won't send any new cheques (which is good as this peer is clearly malicious).
Why not? If it is clear that it is malicious there is no point in wasting further time with this peer. If it is not malicious disconnecting immediately seems excessive. (That being said in this particular case with the pending cheque mechanism this failing always indicates a malicious attacks). |
Every issue raised here that is still open should have its own issue now. Closing this. |
Currently we have a working "happy-day" code: Everything related to Swap works for the normal operation scenario.
We need to review error cases and unexpected flow situations and handle them.
The text was updated successfully, but these errors were encountered: