Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

Review error and exception handling #1601

Closed
holisticode opened this issue Jul 24, 2019 · 14 comments
Closed

Review error and exception handling #1601

holisticode opened this issue Jul 24, 2019 · 14 comments

Comments

@holisticode
Copy link
Contributor

Currently we have a working "happy-day" code: Everything related to Swap works for the normal operation scenario.

We need to review error cases and unexpected flow situations and handle them.

@mortelli
Copy link
Contributor

mortelli commented Jul 30, 2019

Making a note of @zelig's comment, on swarm/swap/types.go:53:

this is a mistake i often made and resented.
Try to avoid idle code, and "we may need it in future" complexities.

Originally posted in #1554

diegomasini added a commit that referenced this issue Jul 30, 2019
@Eknir
Copy link
Contributor

Eknir commented Aug 2, 2019

Related: #1635

holisticode pushed a commit that referenced this issue Aug 9, 2019
holisticode pushed a commit that referenced this issue Aug 15, 2019
holisticode pushed a commit that referenced this issue Aug 20, 2019
holisticode pushed a commit that referenced this issue Aug 26, 2019
holisticode pushed a commit that referenced this issue Aug 27, 2019
@mortelli
Copy link
Contributor

mortelli commented Sep 6, 2019

Marking this as on hold since we are waiting for:

  1. swap: refactor lastReceivedCheque, lastSentCheque, balances to peer #1725 to be finished first
  2. error flow specification and/or diagram to be elaborated

@ralph-pichler
Copy link
Member

So I looked at our current error handling. To summarize what we have now:

  • On connection establishement it fails on a failed handshake, failed blockchain access or if the peer data cannot be loaded. Technically the last point is an io failure which according to shut down node when I/O error occurs during accounting #1515 should be treated as fatal. But since no (Swap) connection is established if this fails it probably doesn't matter here. This seems mostly ok. If we want to enforce a Swap-only network maybe we should peer.Drop here.

  • In handleMsg we always return nil as the error. There is really no alternative as the message handlers are run as a go routine and we don't want to block. Also returning an error here would terminate the protocol which we probably don't want here. This also doesn't need changing.

  • handleEmitChequeMsg is where the error handling starts being really rebroken. For now it either ignores errors or returns them (where it is then ignored because it's a go routine).

    • it aborts and returns (an ignored) error if verifyChequeProperties (through processAndVerifyCheque) fails. The only way this fails is if the sender sent a malicious or malformed cheque.
    • it aborts and returns (an ignored) error if getPrice on the oracle (through processAndVerifyCheque) fails. This one is a problem as the recipient of a cheque rejects it but the sender is not at fault. Worse because we got rid of confirmation messages the nodes will start disagreeing about the last cheque leading to future problms.
    • it aborts and returns (an ignored) error if verifyChequeAgainstLast fails. This can happen for two reasons:
      • the cumulativePayout is lower: this is cleary a malicious sender. This should be treated the same as with verifyChequeProperties
      • the expected amount is not equal to the actual amount (difference between the cumulatives): This might be malicious (intentionally sending less than supposed to for the specified honey amount) or not (if a cheque got lost due to node crashing or terminating at the wrong moment).
    • it aborts if setLastReceivedCheque fails (io error). The next time a cheque is sent the nodes will disagree about the actualAmount leading to the same problem as in the previous point.
    • same for setting the balance. Here we have the additional problem that we already saved the lastReceivedCheque in the previous step. Those two write ops really should happen in one atomic operation (afaik this is only possible with batch writes in leveldb which the state store abstraction doesn't provide). However the probability of io failing in between thoe two steps is probably quite low so this might not be the biggest issue.
    • it also aborts if contract.InstanceAt fails but this is actually impossible as this could only happen on an invalid json abi but it is hardcoded. So we can ignore this one.
    • finally we start cashing the cheque in a go routine and ignore any errors. We should keep it that way since we want to decouple cashing from getting the cheque anyway.
  • In cashCheque we don't do anything with errors except logging

    • if cashing failed (e.g. reverted, network issues) we log an error
    • if cashing bounced we log an error
    • if the tx remains pending forever (e.g. too low gas price) the goroutine hangs forever
  • In Add we return all kinds of errors that can lead to different outcomes depending on how Add was called.

    • if Add is called because a protocol is sending something the error will be returned to the point where that message is sent (using protocols.Peer.Send). In most cases (at least in retrieval) this error is just logged and then ignored. Therefore no peer will be disconnected even if Add returns an error (contradicting the documentation in accounting.go). This means that when the disconnect threshold is hit during delivering a chunk we simply don't deliver the chunk but don't actually disconnect unless we get another message of the peer.
    • if Add is called because a protocol is receiving something the error will cause Peer.run of that protocol (and no other) to terminate and log the error
    • if we don't have a record for that enode in swap we return an error. That's ok because we want to terminate / not handle this connection. (That being said if we never receive protocol messages from that node and only send the protocol will never be terminated. If we want to enforce a swap-only network using .Drop here might be useful).
    • same if we reach the disconnect threshold (as pointed out in threshold handling during message processing? #1506 this has the consequence that we don't allow balance decreasing messages after the threshold has been reached).
    • if updating the balance fails we return the error. According to shut down node when I/O error occurs during accounting #1515 we should treat this as fatal. Here we also have the problem that the peers will be out of sync from now on
    • if above the payment threshold we send a cheque and return whatever that does
      • this errors if the actual creation of the cheque fails. This means that if the cheque cannot be created the request will not be sent (or the protocol will be terminated in case of receive) even though only the payment threshold was reached and the disconnect threshold might be much higher.
      • with setLastSentCheque and updateBalance there is once again the problem that those two things should really be atomic.
      • Finally the actual message is sent. If this fails (or the other peer fails in handleEmitChequeMsg) this also leads to the problem described further above.

Here are my suggestions on how we might fix this:

It is ok if the balances node have of each other are not an exact match. It might even out over time and even if it doesn't its fine as long as the discrepancy doesn't get larger than disconnectThreshold - paymentThreshold. What is important though is that nodes agree on what the latest cheque was as the difference of the cumulatives is used as the value that the amount of honey is verified against. For this I propose to reintroduce a confirmation message altough its function would be a bit different from the old one.

In addition to the lastSentCheque we should also store a pendingCheque which is persisted to the state store like the other one. When a cheque is created it is saved only as the pendingCheque but the base values for creating cheques (e.g. latest cumulativePayout) are still taken from the lastSentCheque. No balance adjustment takes place at this time. Instead the receiving node (in handleEmitCheque) sends a ConfirmChequeMsg containing the cheque after it saved the cheque and updated balances in its store. When the cheque sending node receives the ConfirmChequeMsg it verifies that it matches the pendingCheque, saves it as the lastSentCheque, updates the balance according to the honey amount and finally sets the pendingCheque to nil. The idea is that once a pendingCheque has been created (is != nil) we won't create any other cheque for that peer until it was confirmed. Instead the old pendingCheque is resent if there was no confirmation in a "while", or if the connection is established anew (e.g. because the sending node went down or the receiving node disconnected). If the receiving node receives the EmitChequeMsg for the same cheque a second time and it had already successfully processed it, it will send the same ConfirmChequeMsg again. If we get too close to the disconnect threshold while waiting for a cheque to be confirmed we could block all paid requests for that peer for a while.

This should give us the following properties:

  • If the issuing node fails between setting the pendingCheque and sending it, it doesn't matter as it will resend the same check the next time and in the meantime we still have the same view on the balance and the last cheque as our counterparty.
  • If the receiving node doesn't accept the cheque it will never confirm and the balances remain in sync. If the receiving node is malicious this doesn't matter and otherwise this would only happen in case of io error, network error or oracle access error. In all of this cases updating anything is dangerous so not doing anything and retrying later seems like the most sensible approach.
  • If the receiving node fails during saving the new balance it has already saved the cheque. The next time it receives this cheque it will just confirm it. The balances will now diverge but this should be tolerable as long as the paymentThreshod and disconnectThreshold are far enough apart. More importantly once the other node has processed the ConfirmChequeMsg both nodes will agree again on what the latest cheque was and the honey calculation will continue to work
  • If the receiving node fails during sending the ConfirmChequeMsg the balances and last cheques will be out of sync. Eventually the issuing node will resend the EmitChequeMsg, the receiving node having processed the cheque already will reply again with the ConfirmChequeMsg and once that is processed both nodes agree again.
  • If the issuing node fails during handleConfirmChequeMsg before storing the pendingCheque as nil, it will resend the pendingCheque at some point triggering the scenario from the previous point.
  • If the issuing node fails after setting the pendingCheque as nil but before updating the balance, then both nodes still agree on what the last cheque was. The balance will now diverge slightly but once again this is tolerable as long as it is a rare event.
  • Once the issuing node has updated the balance (or crashed while doing so) the cheque sending is complete and new cheques can be issued.

handleEmitCheque should never return an error as that one cannot be handled elsewhere:

  • if verifyChequeProperties fails we should log the error and drop the peer from all protocols using .Drop (or something similar to drop accounted protocols which would require changes to p2p) as this is clearly malicious behaviour.
  • the same applies for cheques with lower cumulativePayout than the previous one.
  • in case of the oracle price call failing we should just log it and return. The oracle will be tried again when the EmitChequeMsg is resent.
  • if the expected amount is not equal to the actual amount we should probably also disconnect since if both nodes agree on the last cheque this difference can only come from the honey amount not matching the cheque amount in value. The alternative would be send the price as meta-information instead of the honey amount. That would require some changes to the oracle interface.
  • finally it should send the cheque to some cashout routine

On all of the state store io errors we should either drop the peer (using .Drop like some other protocols do) or shutdown entirely. (Presumably using Swarm.Stop altough I'm not sure that's the right way to do it).

In Add:

  • We should just drop the peer (at least for accounted protocols) if it's not registered with swap. This would also prevent an issue experienced during devcon workshop where nodes tried to retrieve chunks from non-swap nodes (which always returns an error in that case).
  • As already proposed in threshold handling during message processing? #1506 we should only do the threshold check if our balance with the counterparty increases but always allow messages which decrease. That way a node can at least still offer services even if above the threshold (e.g. because a cheque hasn't confirmed yet because of an unreachable oracle). However this begs the question under what circumstances we actually disconnect peers if they don't pay (and disconnect them from what? only paid protocols or everything?).
  • If cheque creation fails we should not return an error in Add (unless perhaps if we are already close to the disconnect threshold). Instead just log and try again next time (if pendingCheque is not set).

Cashout (outline, this not thought through yet):

  • Ideally we should never have multiple cashout operations going on at the same time (e.g. start another one before the previous one finished) to avoid nonce issues (which already happen now from time to time) and similar things
  • When a new cheque comes in we should compare it to the last cashed cheque for this peer (which we should also store to avoid having to do blockchain lookups at this point), based on the difference we can decide wether we want to trigger a cashout transaction. If we decide to cash it it should go to some pending cashouts queue (this queue could also be persisted, alternative we can recheck all cheques on startup). If there is already a cheque for a particular chequebook in the queue it could be replaced instead of having it in there twice.
  • Some other background go routine takes cheques from the queue and cashes them. In a loop it would:
    • Take a cheque from the queue (or wait for one)
    • Send the cashout transaction
    • If the sending fails (as in it was never sent to the eth node or got rejected there), log an error, put it back into the queue and retry later.
    • Save this transaction as the current transaction
    • Save the cheque as cashed (1)
    • Wait for the outcome (2)
    • If the cashing reverted, log a critical error and treat the cheque as cashed. This would indicate a fatal flaw either in the go code base or the smart contract so there is little point in retrying
    • If the cashing bounced drop the peer (ideally blacklist if there is such a functionality in p2p layer)
    • If the cashing was successful don't do anything
    • Set the current transaction to nil
  • If the node restarts and we have a current transaction, start the loop at (1)
  • (2) might never terminate so we would want a timeout there. In that case we might want to mark the cheque as uncashed again or do nothing (will have to check how the go bindings handle nonce and pending transactions).
  • This doesn't handle reorganisations. Doing this and nonce handling properly will probably have to be much more complex and requires in-depth study on how the go bindings work.

I have some poc code for the changes to handleEmitChequeMsg, ConfirmChequeMsg and Add but tests are still failing. Will continue to work on that next week.

Some open questions I have:

  • When do we really want to disconnect peers and when do we just not want to answer paid requests (in the latter case is it really a "disconnect" threshold?)?
  • When we say disconnect do we want to 1) disconnect protocols individually when they trigger Add, 2) disconnect all protocols (using .Drop), 3) disconnect only paid protocols?
  • Is there any problem with the kademlia structure if we disconnect peers that don't pay?

Any comments on these or the suggestions described above are welcome.

@mortelli
Copy link
Contributor

mortelli commented Oct 14, 2019

Thanks for a very in-depth analysis @ralph-pichler.

Adding my replies as I read:

On connection establishement it fails on a failed handshake, failed blockchain access or if the peer data cannot be loaded. Technically the last point is an io failure which according to #1515 should be treated as fatal. But since no (Swap) connection is established if this fails it probably doesn't matter here. This seems mostly ok. If we want to enforce a Swap-only network maybe we should peer.Drop here.

Per the issue you are referring to, at some point we should be adding code to shut down the node if an I/O error occurs (after the handshake, but during an accounting function).

Depending on how we implement this, it might affect connection establishment in a way that reflects this behavior. If not, I would leave the handshake as it is.

(in summary: I agree)

In handleMsg we always return nil as the error. There is really no alternative as the message handlers are run as a go routine and we don't want to block. Also returning an error here would terminate the protocol which we probably don't want here. This also doesn't need changing.

Blocking is definitely not an option, so that part is OK.

However: couldn't we return routine errors through channels? We would possibly be dealing with them while other go routines are executing, but this is probably preferable from no error handling at all.

handleEmitChequeMsg is where the error handling starts being really rebroken. For now it either ignores errors or returns them (where it is then ignored because it's a go routine).

it aborts and returns (an ignored) error if verifyChequeProperties (through processAndVerifyCheque) fails. The only way this fails is if the sender sent a malicious or malformed cheque.

It's ignored because it's inside the handleMsg call, right? All the more reason to make an error channel here.

it aborts and returns (an ignored) error if getPrice on the oracle (through processAndVerifyCheque) fails. This one is a problem as the recipient of a cheque rejects it but the sender is not at fault. Worse because we got rid of confirmation messages the nodes will start disagreeing about the last cheque leading to future problms.

I think we got rid of the confirmation message because we didn't need it (we assumed messages were delivered if the function didn't panic).

We can add it back though. I would like to think about how it can solve this problem without adding additional ones.

finally we start cashing the cheque in a go routine and ignore any errors. We should keep it that way since we want to decouple cashing from getting the cheque anyway.

Here I would like to re-iterate the idea of not ignoring errors while still keeping the go routines, if possible.

In cashCheque we don't do anything with errors except logging

if cashing failed (e.g. reverted, network issues) we log an error
if cashing bounced we log an error

To add to this: we should not log and then return an error. We should log errors and handle them in the same place.

if the tx remains pending forever (e.g. too low gas price) the goroutine hangs forever

What about implementing our own timeout? It might be tough to decide on a value, but still might be useful.

In Add we return all kinds of errors that can lead to different outcomes depending on how Add was called.

if Add is called because a protocol is sending something the error will be returned to the point where that message is sent (using protocols.Peer.Send). In most cases (at least in retrieval) this error is just logged and then ignored. Therefore no peer will be disconnected even if Add returns an error (contradicting the documentation in accounting.go). This means that when the disconnect threshold is hit during delivering a chunk we simply don't deliver the chunk but don't actually disconnect unless we get another message of the peer.
if Add is called because a protocol is receiving something the error will cause Peer.run of that protocol (and no other) to terminate and log the error

I'm trying to verify this, but I'm not sure where the code goes after Peer.Run (from protocol.go). It's probably hidden behind an abstraction that I haven't figured out yet.

same if we reach the disconnect threshold (as pointed out in #1506 this has the consequence that we don't allow balance decreasing messages after the threshold has been reached).

I think it's time we tackled this one. But it's probably more of an edge case than it feels like.

if updating the balance fails we return the error. According to #1515 we should treat this as fatal. Here we also have the problem that the peers will be out of sync from now on

A difference in balances (being out of sync) isn't necessarily a serious problem, and if it is, it's better to drop the peer.

with setLastSentCheque and updateBalance there is once again the problem that those two things should really be atomic.

Agreed.

Here are my suggestions on how we might fix this:

It is ok if the balances node have of each other are not an exact match. It might even out over time and even if it doesn't its fine as long as the discrepancy doesn't get larger than disconnectThreshold - paymentThreshold. What is important though is that nodes agree on what the latest cheque was as the difference of the cumulatives is used as the value that the amount of honey is verified against. For this I propose to reintroduce a confirmation message altough its function would be a bit different from the old one

In addition to the lastSentCheque we should also store a pendingCheque which is persisted to the state store like the other one. When a cheque is created it is saved only as the pendingCheque but the base values for creating cheques (e.g. latest cumulativePayout) are still taken from the lastSentCheque. No balance adjustment takes place at this time. Instead the receiving node (in handleEmitCheque) sends a ConfirmChequeMsg containing the cheque after it saved the cheque and updated balances in its store. When the cheque sending node receives the ConfirmChequeMsg it verifies that it matches the pendingCheque, saves it as the lastSentCheque, updates the balance according to the honey amount and finally sets the pendingCheque to nil. The idea is that once a pendingCheque has been created (is != nil) we won't create any other cheque for that peer until it was confirmed. Instead the old pendingCheque is resent if there was no confirmation in a "while", or if the connection is established anew (e.g. because the sending node went down or the receiving node disconnected). If the receiving node receives the EmitChequeMsg for the same cheque a second time and it had already successfully processed it, it will send the same ConfirmChequeMsg again. If we get too close to the disconnect threshold while waiting for a cheque to be confirmed we could block all paid requests for that peer for a while.

I am assuming here that all of this is per peer—meaning, we have either nil or a single cheque as pending per peer.

In principle I like this idea, but I must ask: what happens if the node that receives the cheque updates its balance, but the ConfirmChequeMsg doesn't make its way to the other node? Do we wait for it to re-send the cheque?

If so: wouldn't this cause the receiving node to update the balance twice? 🤔

Also: it's not entirely clear to me based on this paragraph at exactly which times we register the lastSentCheque.

This should give us the following properties:

If the issuing node fails between setting the pendingCheque and sending it, it doesn't matter as it will resend the same check the next time and in the meantime we still have the same view on the balance and the last cheque as our counterparty.

To make sure I understand this correctly: if a pendingCheque isn't confirmed, a node won't issue any more cheques for that peer. Instead, it will re-issue its last cheque after processing another message in its counterparty's favour. However... will it still update its balance while doing so?

If the receiving node fails during saving the new balance it has already saved the cheque. The next time it receives this cheque it will just confirm it. The balances will now diverge but this should be tolerable as long as the paymentThreshod and disconnectThreshold are far enough apart. More importantly once the other node has processed the ConfirmChequeMsg both nodes will agree again on what the latest cheque was and the honey calculation will continue to work

Just in case: can you please refer to the line this honey calculation is on?

If the receiving node fails during sending the ConfirmChequeMsg the balances and last cheques will be out of sync. Eventually the issuing node will resend the EmitChequeMsg, the receiving node having processed the cheque already will reply again with the ConfirmChequeMsg and once that is processed both nodes agree again.

OK, so this sort of addresses my previous question. But how does a node know it has already processed a cheque before?

As already proposed in #1506 we should only do the threshold check if our balance with the counterparty increases but always allow messages which decrease. That way a node can at least still offer services even if above the threshold (e.g. because a cheque hasn't confirmed yet because of an unreachable oracle). However this begs the question under what circumstances we actually disconnect peers if they don't pay (and disconnect them from what? only paid protocols or everything?).

A node drops a peer when the balance relative to it is over the threshold and the node is about to process a message which would increase the debt the peer owes to it. Right?

Ideally we should never have multiple cashout operations going on at the same time (e.g. start another one before the previous one finished) to avoid nonce issues (which already happen now from time to time) and similar things

Can you given a example of a case where a nonce issue occurs?

I have some poc code for the changes to handleEmitChequeMsg, ConfirmChequeMsg and Add but tests are still failing. Will continue to work on that next week.

Great! Keep us posted 👍

Some open questions I have:

When do we really want to disconnect peers and when do we just not want to answer paid requests (in the latter case is it really a "disconnect" threshold?)?

Oh, I see what you mean now. I have asked this question before but have not got a clear answer yet.

It essentially boils down to: can SWAP peers serve any purpose even if they are over-indebted to a node?

When we say disconnect do we want to 1) disconnect protocols individually when they trigger Add, 2) disconnect all protocols (using .Drop), 3) disconnect only paid protocols?

I will raise this question during tomorrow's meeting. For now, I would go with option 1.

Is there any problem with the kademlia structure if we disconnect peers that don't pay?

Good question for core, I think.

Any comments on these or the suggestions described above are welcome.

Seems like a good start to me.

I think we need some sort of flowchart or diagram to fully understand and review the logic.

I don't mind drawing it as long as I have clear-cut info.

@ralph-pichler
Copy link
Member

However: couldn't we return routine errors through channels? We would possibly be dealing with them while other go routines are executing, but this is probably preferable from no error handling at all.
It's ignored because it's inside the handleMsg call, right? All the more reason to make an error channel here.

I don't think it's necessary as long as we handle all the errors within the called go routine. Other protocols I've looked at, do it that way too.

I think we got rid of the confirmation message because we didn't need it (we assumed messages were delivered if the function didn't panic).

Yes, delivered but not necessarily processed. This is about getting confirmation that the cheque was indeed accepted (e.g. honey verification worked, balance update, cheque was saved).

We can add it back though. I would like to think about how it can solve this problem without adding additional ones.

I thought about that too, but as far as I see it there is no alternative. There are factors which lead to reject a cheque that the issuer has no control over and without an additional message the issuer cannot find out about that. We removed the old ConfirmationMsg because it was useless, this one serves an actual purpose.

  * finally we start cashing the cheque in a go routine and ignore any errors. We should keep it that way since we want to decouple cashing from getting the cheque anyway.

Here I would like to re-iterate the idea of not ignoring errors while still keeping the go routines, if possible.

The error will already be handled within the cashing logic, therefore there is no need to handle it here.

  * if the tx remains pending forever (e.g. too low gas price) the goroutine hangs forever

What about implementing our own timeout? It might be tough to decide on a value, but still might be useful.

Yes, we will need one for sure. But doing this right can be very complicated and requires deeper study of go-ethereum's transaction sending code. (see the outline at the bottom).

I am assuming here that all of this is per peer—meaning, we have either nil or a single cheque as pending per peer.

Yes. A peer has a last sent cheque, a last received cheque and potentially a pending cheque.

In principle I like this idea, but I must ask: what happens if the node that receives the cheque updates its balance, but the ConfirmChequeMsg doesn't make its way to the other node? Do we wait for it to re-send the cheque?

If so: wouldn't this cause the receiving node to update the balance twice? 🤔

The receiving node will only update the balance once. If the same cheque is received again we resend the confirmation message but without adjusting balances or saving anything. There is no "waiting" because the receiving node cannot know wether the ConfirmChequeMsg was accepted.

Also: it's not entirely clear to me based on this paragraph at exactly which times we register the lastSentCheque.

Immediately before setting the pending cheque to nil. Guess I forgot to mention that.

To make sure I understand this correctly: if a pendingCheque isn't confirmed, a node won't issue any more cheques for that peer. Instead, it will re-issue its last cheque after processing another message in its counterparty's favour. However... will it still update its balance while doing so?

No, there is no balance adjustment when creating a cheque. Only once the ConfirmMsg is received the balance is adjusted. (This does mean that the balance remains above the payment threshold while waiting but this won't matter if we don't send cheques while there is a pending cheque. Regular accounting continues during that time).

* If the receiving node fails during saving the new balance it has already saved the cheque. The next time it receives this cheque it will just confirm it. The balances will now diverge but this should be tolerable as long as the `paymentThreshod` and `disconnectThreshold` are far enough apart. More importantly once the other node has processed the `ConfirmChequeMsg` both nodes will agree again on what the latest cheque was and the honey calculation will continue to work

Just in case: can you please refer to the line this honey calculation is on?

What I mean here is https://github.com/ethersphere/swarm/blob/master/swap/swap.go#L383 where we check the difference in cumulative amount against the expected amount for the given amount of honey.

* If the receiving node fails during sending the `ConfirmChequeMsg` the balances and last cheques will be out of sync. Eventually the issuing node will resend the `EmitChequeMsg`, the receiving node having processed the cheque already will reply again with the `ConfirmChequeMsg` and once that is processed both nodes agree again.

OK, so this sort of addresses my previous question. But how does a node know it has already processed a cheque before?

The receiver knows it because the incoming cheque will already be stored as the lastReceivedCheque and the issuer knows it because the pendingCheque is nil (also it will already be stored as the lastSentCheque).

A node drops a peer when the balance relative to it is over the threshold and the node is about to process a message which would increase the debt the peer owes to it. Right?

Right now it doesn't even disconnect under all such circumstances (e.g. the chunk delivery example described above). But that would be one possible approach, the open question is wether we want that or not.

Can you given a example of a case where a nonce issue occurs?

I think it happens when many cheques are received in rapid succession. Then the transaction api picks the old nonce and there's a conflict. I've seen it several times while running nodes.

@Eknir
Copy link
Contributor

Eknir commented Oct 17, 2019

Ralph, thanks for the thorough research and good explanation of the error-handling issue. I think your proposal to introduce a pending cheque and an extra confirm message is neat and I am looking forward to see your implementation.

@mortelli, could you please makeupdate your comment above such that it does not display with a scrolling bar? I think the discussion you are having is valuable, but it is difficult to follow now because of this.

After this is fixed, I will have a look again at this conversation and add my comments (after verifying they are not already addressed). Just one thing which I can directly ask now:

On connection establishement it fails on a failed handshake, failed blockchain access or if the peer data cannot be loaded. Technically the last point is an io failure which according to #1515 should be treated as fatal. But since no (Swap) connection is established if this fails it probably doesn't matter here. This seems mostly ok. If we want to enforce a Swap-only network maybe we should peer.Drop here.

What does returning an error mean in this context? I tried to follow the error up, but could not figure out where it exactly ends up. If it means that the node shuts down, I don't think it is a good idea, as in this case, it will be easy for an attacker to shut down the whole network.

@mortelli
Copy link
Contributor

mortelli commented Oct 17, 2019

@mortelli, could you please update your comment above such that it does not display a scrolling bar? I think the discussion you are having is valuable, but it is difficult to follow now because of this.

Done.

What does returning an error mean in this context? I tried to follow the error up, but could not figure out where it exactly ends up. If it means that the node shuts down, I don't think it is a good idea, as in this case, it will be easy for an attacker to shut down the whole network.

I had the same problem. Once I figure it out I will make it explicit here.

@ralph-pichler
Copy link
Member

What does returning an error mean in this context? I tried to follow the error up, but could not figure out where it exactly ends up. If it means that the node shuts down, I don't think it is a good idea, as in this case, it will be easy for an attacker to shut down the whole network.

Afaik it just doesn't set up this protocol for this peer. It shouldn't affect any other protocol or peer.

@ralph-pichler
Copy link
Member

Related info from the last research call:
Bad SWAP peers

  • Is it desirable to drop them? is there any use for keeping these peers around?
  • A: dropping shouldn’t happen if everyone behaves nicely, but is an adequate response to a node not paying up.

When we say disconnect do we want to

  • disconnect protocols individually when they hit accounting,
  • disconnect all protocols (using .Drop)
  • disconnect only paid protocols?
  • A: drop from all protocols.
    If dropped: should they be blacklisted?
  • A: yes.
    is there some existing mechanism for this?
  • A: check with core, but probably not.

How should a node behave if it cannot send any more cheques? (without risking that those cheques bounce).
Proposed answers:

  • A) (more crude) shut down the node. A: not this one, as it can earn back the balance.

  • B) (less crude) stop the SWAP process. A: a quick-and-dirty alternative would be shutting down the HTTP interface.

  • C) (more elegant) low-balance “mode”: stop any action that would incur costs that would make the node cross the payment threshold with its peers. → SWAP runs normally until the payment treshold comes close for one peer, in which case actions that will cause accounting actions to be called with a negative amount are not allowed anymore, until the balance is sufficiently restored (by reciprocal service). Accounting would return to normal by doing a deposit.

    • A: this is the better option: do not consume resources, only provide them.
    • note: being over the payment threshold should not be a problem. being over the disconnect threshold is.
      • A: nodes should not willfully go over the payment threshold, it’s a buffer to deal with balance discrepancies.

@ralph-pichler
Copy link
Member

ralph-pichler commented Nov 19, 2019

Current status on what's still missing:

@holisticode
Copy link
Contributor Author

On connection establishement it fails on a failed handshake, failed blockchain access or if the peer data cannot be loaded. Technically the last point is an io failure which according to #1515 should be treated as fatal. But since no (Swap) connection is established if this fails it probably doesn't matter here. This seems mostly ok. If we want to enforce a Swap-only network maybe we should peer.Drop here.

With the upcoming refactoring of the protocols package, where handlers could return custom errors, it seems the strategy is to allow for custom handling of issues before dropping - e.g. allowing other protocols to still interact in some cases. Thus, I don't think we should enforce a Drop right here, so agreed, keep it as-is for now

In handleMsg we always return nil as the error. There is really no alternative as the message handlers are run as a go routine and we don't want to block. Also returning an error here would terminate the protocol which we probably don't want here. This also doesn't need changing.

This also will be affected by the redesign of the protocols package so I agree no change for now - regarding returning errors. However, we need to make sure that errors inside the go routine can't get us into a broken state or something. In other words, errors must be handled.

it aborts and returns (an ignored) error if verifyChequeProperties (through processAndVerifyCheque) fails. The only way this fails is if the sender sent a malicious or malformed cheque.

Then we can ignore this

it aborts and returns (an ignored) error if getPrice on the oracle (through processAndVerifyCheque) fails. This one is a problem as the recipient of a cheque rejects it but the sender is not at fault. Worse because we got rid of confirmation messages the nodes will start disagreeing about the last cheque leading to future problms.

This should be somehow "bookmarked"; if we do something right now, it may change due to the new pricing design which is upcoming. making the change obsolete. So there should be a low-priority issue for this.

it aborts and returns (an ignored) error if verifyChequeAgainstLast fails. This can happen for two reasons:
the cumulativePayout is lower: this is cleary a malicious sender. This should be treated the same as with verifyChequeProperties

I think no special treatment for malicious attacks

the expected amount is not equal to the actual amount (difference between the cumulatives): This might be malicious (intentionally sending less than supposed to for the specified honey amount) or not (if a cheque got lost due to node crashing or terminating at the wrong moment).

A crashing or terminating node issue should be handled IMHO. It is bad design if this leads to problems. There should be some sanity check for example when rebooting the node which recovers the state. I believe this should be analyzed and tackled - even if not with the highest priority. I suggest to create an issue for this.

it aborts if setLastReceivedCheque fails (io error). The next time a cheque is sent the nodes will disagree about the actualAmount leading to the same problem as in the previous point.
same for setting the balance. Here we have the additional problem that we already saved the lastReceivedCheque in the previous step. Those two write ops really should happen in one atomic operation (afaik this is only possible with batch writes in leveldb which the state store abstraction doesn't provide). However the probability of io failing in between thoe two steps is probably quite low so this might not be the biggest issue.

I think it relates to the previous point. We should always be able to recover the state in an error condition.

finally we start cashing the cheque in a go routine and ignore any errors. We should keep it that way since we want to decouple cashing from getting the cheque anyway.

I agree, this part will be probably heavily redesigned as we also will probably allow users a flexible cashout strategy. Just make sure no error is "lost".

if the tx remains pending forever (e.g. too low gas price) the goroutine hangs forever
Ideally there should be some timeout here. But as we are going to redesign it, as said, I vote to ignore - for now.

if Add is called because a protocol is sending something the error will be returned to the point where that message is sent (using protocols.Peer.Send). In most cases (at least in retrieval) this error is just logged and then ignored. Therefore no peer will be disconnected even if Add returns an error (contradicting the documentation in accounting.go). This means that when the disconnect threshold is hit during delivering a chunk we simply don't deliver the chunk but don't actually disconnect unless we get another message of the peer.

Ignore for now due to the protocols package redesign.

if we don't have a record for that enode in swap we return an error. That's ok because we want to terminate / not handle this connection. (That being said if we never receive protocol messages from that node and only send the protocol will never be terminated. If we want to enforce a swap-only network using .Drop here might be useful).

May also suffer changes due to protocols package redesign so hold-on. For now though I do suggest a peer drop.

this errors if the actual creation of the cheque fails. This means that if the cheque cannot be created the request will not be sent (or the protocol will be terminated in case of receive) even though only the payment threshold was reached and the disconnect threshold might be much higher.

If this happens, do we have consistent balances? If not, that should be handled.

with setLastSentCheque and updateBalance there is once again the problem that those two things should really be atomic. Relates to the comment above about consistent states.

I agree and if it should, then we should do it so. Atomicity is very important feature in fintech.

Finally the actual message is sent. If this fails (or the other peer fails in handleEmitChequeMsg) this also leads to the problem described further above.

For this I propose to reintroduce a confirmation message altough its function would be a bit different from the old one.

We got rid of confirmation messages because they constituted an attack vector and introduced nash equilibrium states. I agree with @mortelli that if we think of this, we should make sure that it does not re-introduce new problems. My current thinking and suggestion is to contemplate something different: design a secure protocol which is added to the handshake which checks on existing state(s) between the nodes (balances, cheques) and syncs them if needed. But this might be more wishful thinking than actually possible. Needs investigation. Suggest to open an issue for this if not existing already.

When a new cheque comes in we should compare it to the last cashed cheque for this peer (which we should also store to avoid having to do blockchain lookups at this point), based on the difference we can decide wether we want to trigger a cashout transaction. If we decide to cash it it should go to some pending cashouts queue (this queue could also be persisted, alternative we can recheck all cheques on startup). If there is already a cheque for a particular chequebook in the queue it could be replaced instead of having it in there twice.

I believe that the cashing strategy should be completely left to the user. Whatever we take on as "automatic" will just create more burden for us. Maybe we could think of a "default" strategy (cash-in immediately?) which is some minimal implementation. Of course, designing a user-defined cash-in strategy is a lot of work but could lead to a more cleaner handling. Which means this needs an issue.

Is there any problem with the kademlia structure if we disconnect peers that don't pay?

Of course. But that is the nature of an incentivized Swarm. If you don't pay, you are not playing by the rules and thus you should be disconnected. However, there might be "free" protocols, but this has not been designed properly and never explicitly documented and requested as a feature.

@mortelli

However: couldn't we return routine errors through channels? We would possibly be dealing with them while other go routines are executing, but this is probably preferable from no error handling at all.

Well using channels just shifts the burden of handling the error, and makes it probably more complex (loss of context). It is more important that the error is handled. It can even happen in the same routine.

Here I would like to re-iterate the idea of not ignoring errors while still keeping the go routines, if possible.

Agree, although in this case the important point is to design the cashing strategy well.

Wrapping up:

  • All errors should be handled. No errors are "lost" or "shadowed"
  • We should always strive for a consistent state. Where not possible, we need to think of recovery
  • For every issue described here there should be an issue if we are going to handle it. This ticket has become too big and impossible to handle, we should close it.

@ralph-pichler
Copy link
Member

This should be somehow "bookmarked"; if we do something right now, it may change due to the new pricing design which is upcoming. making the change obsolete. So there should be a low-priority issue for this.

This was already addressed with the pending cheques anyway. After the new pricing we can reexamine wether to drop confirm again or at least relax the conditions under which new cheques are allowed.

A crashing or terminating node issue should be handled IMHO. It is bad design if this leads to problems. There should be some sanity check for example when rebooting the node which recovers the state. I believe this should be analyzed and tackled - even if not with the highest priority. I suggest to create an issue for this.

I made an issue for this (#2004). An easier solution to this though might be to add batching support to the state store abstraction, then it should be able to support writing to different keys atomically. Anyway, pending cheques address most of these issues already. Nodes will always agree on the same last cheques again, only balances can go out of sync in case of crashes which isn't too big of a problem.

If this happens, do we have consistent balances? If not, that should be handled.

We don't. This is documented in #2003 .

design a secure protocol which is added to the handshake which checks on existing state(s) between the nodes (balances, cheques) and syncs them if needed. But this might be more wishful thinking than actually possible. Needs investigation. Suggest to open an issue for this if not existing already.

I don't think this can work. A node has every incentive to downplay its debt to the other one. The worst thing another node can do with a confirmation message is intentionally not sending it. In that case we won't send any new cheques (which is good as this peer is clearly malicious).

I think no special treatment for malicious attacks

Why not? If it is clear that it is malicious there is no point in wasting further time with this peer. If it is not malicious disconnecting immediately seems excessive. (That being said in this particular case with the pending cheque mechanism this failing always indicates a malicious attacks).

@ralph-pichler
Copy link
Member

Every issue raised here that is still open should have its own issue now. Closing this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants