Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] upgrade/chain halt recovery #837

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
bd12a8b
--wip-- [skip ci]
okdas Sep 24, 2024
4ea092e
document learnings and more checks
okdas Sep 24, 2024
341b987
Merge branch 'main' into dk-upgrade-learnings
okdas Sep 24, 2024
3d86aef
ca-certs are needed for relayminer
okdas Sep 25, 2024
94d8af3
Merge branch 'main' into dk-upgrade-learnings
okdas Oct 14, 2024
cb073bd
LocalNet upgrade procedure
Oct 15, 2024
3fe1fea
--wip-- [skip ci]
Oct 15, 2024
525eb57
--wip-- [skip ci]
Oct 17, 2024
4e8c7dd
spell checking
Oct 17, 2024
9d5696d
Merge branch 'main' into dk-upgrade-learnings
okdas Oct 17, 2024
dd631e6
Empty commit
okdas Oct 17, 2024
9d0bc9c
Merge with main
Olshansk Oct 21, 2024
83a24aa
Partial review
Olshansk Oct 21, 2024
50cc08e
Partial review
Olshansk Oct 21, 2024
3bddc15
Partial review
Olshansk Oct 21, 2024
498d9d8
Partial review
Olshansk Oct 21, 2024
7ecf403
Merge branch 'main' into dk-upgrade-learnings
okdas Oct 22, 2024
be35f1a
requested changes
okdas Oct 24, 2024
55ca1dc
Merge remote-tracking branch 'origin/main' into dk-upgrade-learnings
okdas Nov 18, 2024
7ce8194
change localnet upgrade docs
okdas Nov 19, 2024
25b8407
Merge branch 'main' into dk-upgrade-learnings
okdas Nov 19, 2024
a2a03ba
more requested changes
okdas Nov 19, 2024
d3133ad
Merge branch 'main' into dk-upgrade-learnings
Olshansk Nov 19, 2024
f5a6d0e
WIP
Olshansk Nov 19, 2024
7113cde
WIP review
Olshansk Nov 20, 2024
739a6d3
Merge with main
Olshansk Nov 20, 2024
d8c448c
Merge branch 'main' into dk-upgrade-learnings
Olshansk Nov 26, 2024
ec4314c
Update docusaurus/docs/develop/developer_guide/recovery_from_chain_ha…
Olshansk Nov 26, 2024
9886220
Update docusaurus/docs/protocol/upgrades/contigency_plans.md
Olshansk Nov 26, 2024
80c3e38
Update docusaurus/docs/develop/developer_guide/recovery_from_chain_ha…
Olshansk Nov 26, 2024
d6f6ced
Apply suggestions from code review
Olshansk Nov 26, 2024
a18a197
Fix merge conflict
Olshansk Nov 26, 2024
b7e5151
requested changes
okdas Nov 27, 2024
ad2b0c0
Merge branch 'dk-upgrade-learnings' of github.com:pokt-network/poktro…
okdas Nov 27, 2024
fc679d6
Merge remote-tracking branch 'origin/main' into dk-upgrade-learnings
okdas Nov 27, 2024
18479fc
Merge branch 'main' into dk-upgrade-learnings
Olshansk Dec 12, 2024
8e4ef7b
Review docusaurus/docs/develop/developer_guide/recovery_from_chain_ha…
Olshansk Dec 12, 2024
2a1b717
Merge branch 'main' into dk-upgrade-learnings
Olshansk Dec 12, 2024
58a7493
WIP review on docusaurus/docs/protocol/upgrades/contigency_plans.md
Olshansk Dec 12, 2024
9661c94
Finished reviewing docusaurus/docs/protocol/upgrades/contigency_plans.md
Olshansk Dec 12, 2024
2ce4945
Finish reviewing docusaurus/docs/protocol/upgrades/upgrade_procedure.md
Olshansk Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Dockerfile.release
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ RUN apt-get update && \
apt-get install -y --no-install-recommends ca-certificates && \
rm -rf /var/lib/apt/lists/*


# Use `1025` G/UID so users can switch between this and `heighliner` image without a need to chown the files.
RUN groupadd -g 1025 pocket && useradd -u 1025 -g pocket -m -s /sbin/nologin pocket

Expand Down
10 changes: 10 additions & 0 deletions app/upgrades/historical.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"github.com/cosmos/cosmos-sdk/types/module"
consensusparamtypes "github.com/cosmos/cosmos-sdk/x/consensus/types"

cosmostypes "github.com/cosmos/cosmos-sdk/types"
"github.com/pokt-network/poktroll/app/keepers"
)

Expand All @@ -29,6 +30,8 @@ func defaultUpgradeHandler(
configurator module.Configurator,
) upgradetypes.UpgradeHandler {
return func(ctx context.Context, plan upgradetypes.Plan, vm module.VersionMap) (module.VersionMap, error) {
logger := cosmostypes.UnwrapSDKContext(ctx).Logger()
logger.Info("Starting the migration in defaultUpgradeHandler")
return mm.RunMigrations(ctx, configurator, vm)
}
}
Expand Down Expand Up @@ -87,3 +90,10 @@ var Upgrade_0_0_4 = Upgrade{
// No changes to the KVStore in this upgrade.
StoreUpgrades: storetypes.StoreUpgrades{},
}

// Upgrade_0_0_9 is a small upgrade on TestNet.
var Upgrade_0_0_9 = Upgrade{
PlanName: "v0.0.9",
CreateUpgradeHandler: defaultUpgradeHandler,
StoreUpgrades: storetypes.StoreUpgrades{},
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@ title: Chain Halt Troubleshooting
- [Understanding Chain Halts](#understanding-chain-halts)
- [Definition and Causes](#definition-and-causes)
- [Impact on Network](#impact-on-network)
- [Troubleshooting Process](#troubleshooting-process)
- [Troubleshooting `wrong Block.Header.AppHash`](#troubleshooting-wrong-blockheaderapphash)
- [Step 1: Identifying the Issue](#step-1-identifying-the-issue)
- [Step 2: Collecting Node Data](#step-2-collecting-node-data)
- [Step 3: Analyzing Discrepancies](#step-3-analyzing-discrepancies)
- [Step 4: Decoding and Interpreting Data](#step-4-decoding-and-interpreting-data)
- [Step 5: Comparing Records](#step-5-comparing-records)
- [Step 6: Investigation and Resolution](#step-6-investigation-and-resolution)
- [Troubleshooting `wrong Block.Header.LastResultsHash`](#troubleshooting-wrong-blockheaderlastresultshash)
- [Syncing from genesis](#syncing-from-genesis)

## Understanding Chain Halts

Expand All @@ -40,7 +42,7 @@ Chain halts can have severe consequences for the network:

Given these impacts, swift and effective troubleshooting is crucial to maintain network health and user trust.

## Troubleshooting Process
## Troubleshooting `wrong Block.Header.AppHash`

### Step 1: Identifying the Issue

Expand Down Expand Up @@ -94,3 +96,20 @@ Based on the identified discrepancies:
2. Develop a fix or patch to address the issue.
3. If necessary, initiate discussions with the validator community to reach social consensus on how to proceed.
4. Implement the agreed-upon solution and monitor the network closely during and after the fix.

## Troubleshooting `wrong Block.Header.LastResultsHash`

Errors like the following can occur from using the incorrect binary version at a certain height.

```bash
reactor validation error: wrong Block.Header.LastResultsHash.
```

The solution is to use the correct binary version to sync the full node at the correct height.

Tools like [cosmosvisor](https://docs.cosmos.network/v0.45/run-node/cosmovisor.html) make it easier
to sync a node from genesis by automatically using the appropriate binary for each range of block heights.

## Syncing from genesis

If you're encountering any of the errors mentioned above while trying to sync the historical blocks - make sure you're running the correct version of the binary in accordance with this table [Upgrade List](../../protocol/upgrades/upgrade_list.md).
okdas marked this conversation as resolved.
Show resolved Hide resolved
196 changes: 196 additions & 0 deletions docusaurus/docs/develop/developer_guide/recovery_from_chain_halt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
---
sidebar_position: 7
title: Chain Halt Recovery
---

## Chain Halt Recovery <!-- omit in toc -->

This document describes how to recover from a chain halt.

It assumes that the cause of the chain halt has been identified, and that the
new release has been created and verified to function correctly.

:::tip

See [Chain Halt Troubleshooting](./chain_halt_troubleshooting.md) for more information on identifying the cause of a chain halt.

:::

- [Background](#background)
- [Resolving halts during a network upgrade](#resolving-halts-during-a-network-upgrade)
- [Manual binary replacement (preferred)](#manual-binary-replacement-preferred)
- [Rollback, fork and upgrade](#rollback-fork-and-upgrade)
- [Troubleshooting](#troubleshooting)
- [Data rollback - retrieving snapshot at a specific height (step 5)](#data-rollback---retrieving-snapshot-at-a-specific-height-step-5)
- [Validator Isolation - risks (step 6)](#validator-isolation---risks-step-6)

## Background

Pocket network is built on top of `cosmos-sdk`, which utilizes the CometBFT consensus engine.
Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that **at least** 2/3 of Validators
are online and voting for the same block to reach a consensus. In order to maintain liveness
and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate
and use the same version of the software.

## Resolving halts during a network upgrade

If the halt is caused by the network upgrade, it is possible the solution can be as simple as
skipping an upgrade (i.e. `unsafe-skip-upgrade`) and creating a new (fixed) upgrade.

Read more about [upgrade contingency plans](../../protocol/upgrades/contigency_plans.md).

### Manual binary replacement (preferred)
Olshansk marked this conversation as resolved.
Show resolved Hide resolved

:::note

This is the preferred way of resolving consensus-breaking issues.

**Significant side effect**: this breaks an ability to sync from genesis **without manual interventions**.
For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need
to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including
configuration for `cosmovisor` that could automate the process.

<!-- TODO_MAINNET(@okdas): Add links to Cosmovisor documentation on how the new UX can be used to automate syncing from genesis without human input. -->

:::

Since the chain is not moving, **it is impossible** to issue an automatic upgrade with an upgrade plan. Instead,
we need **social consensus** to manually replace the binary and get the chain moving.

The steps to doing so are:

1. Prepare and verify a new binary that addresses the consensus-breaking issue.
2. Reach out to the community and validators so they can upgrade the binary manually.
3. Update [the documentation](../../protocol/upgrades/upgrade_list.md) to include a range a height when the binary needs
to be replaced.

:::warning

TODO_MAINNET(@okdas):

1. **For step 2**: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See [this ref](https://docs.cometbft.com/v1.0/spec/consensus/consensus).
2. **For step 3**: Add `cosmovisor` documentation so its configured to automatically replace the binary when synching from genesis.

:::

```mermaid
sequenceDiagram
participant DevTeam
participant Community
participant Validators
participant Documentation
participant Network

DevTeam->>DevTeam: 1. Prepare and verify new binary
DevTeam->>Community: 2. Announce new binary and instructions
DevTeam->>Validators: 2. Notify validators to upgrade manually
Validators->>Validators: 2. Manually replace the binary
Validators->>Network: 2. Restart nodes with new binary
DevTeam->>Documentation: 3. Update documentation (GitHub Release and Upgrade List to include instructions)
Validators-->>Network: Network resumes operation

```

### Rollback, fork and upgrade
okdas marked this conversation as resolved.
Show resolved Hide resolved

:::info

These instructions are only relevant to Pocket Network's Shannon release.

We do not currently use `x/gov` or on-chain voting for upgrades.
Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation
executes transactions on their behalf.

:::

:::warning

This should be avoided or more testing is required. In our tests, the full nodes were
propagating the existing blocks signed by the Validators, making it hard to rollback.

:::

**Performing a rollback is analogous to forking the network at the older height.**

However, if necessary, the instructions to follow are:

1. Prepare & verify a new binary that addresses the consensus-breaking issue.
2. [Create a release](../../protocol/upgrades/release_process.md).
3. [Prepare an upgrade transaction](../../protocol/upgrades/upgrade_procedure.md#writing-an-upgrade-transaction) to the new version.
4. Disconnect the `Validator set` from the rest of the network **3 blocks** prior to the height of the chain halt. For example:
- Assume an issue at height `103`.
- Revert the `validator set` to height `100`.
- Submit an upgrade transaction at `101`.
- Upgrade the chain at height `102`.
- Avoid the issue at height `103`.
5. Ensure all validators rolled back to the same height and use the same snapshot ([how to get a snapshot](#data-rollback---retrieving-snapshot-at-a-specific-height-step-5))
- The snapshot should be imported into each Validator's data directory.
- This is necessary to ensure data continuity and prevent forks.
6. Isolate the `validator set` from full nodes - ([why this is necessary](#validator-isolation---risks-step-6)).
- This is necessary to avoid full nodes from gossiping blocks that have been rolled back.
- This may require using a firewall or a private network.
- Validators should only be permitted to gossip blocks amongst themselves.
7. Start the `validator set` and perform the upgrade. For example, reiterating the process above:
- Start all Validators at height `100`.
- On block `101`, submit the `MsgSoftwareUpgrade` transaction with a `Plan.height` set to `102`.
- `x/upgrade` will perform the upgrade in the `EndBlocker` of block `102`.
- The node will stop climbing with an error waiting for the upgrade to be performed.
- Cosmovisor deployments automatically replace the binary.
- Manual deployments will require a manual replacement at this point.
- Start the node back up.
8. Wait for the network to reach the height of the previous ledger (`104`+).
9. Allow validators to open their network to full nodes again.
- **Note**: full nodes will need to perform the rollback or use a snapshot as well.

```mermaid
sequenceDiagram
participant DevTeam
participant Foundation
participant Validators
participant FullNodes
%% participant Network

DevTeam->>DevTeam: 1. Prepare & verify new binary
DevTeam->>DevTeam: 2 & 3. Create a release & prepare upgrade transaction
Validators->>Validators: 4 & 5. Roll back to height before issue or import snapshot
Validators->>Validators: 6. Isolate from Full Nodes
Foundation->>Validators: 7. Distribute upgrade transaction
Validators->>Validators: 7. Start network and perform upgrade

break
Validators->>Validators: 8. Wait until previously problematic height elapses
end

Validators-->FullNodes: 9. Open network connections
FullNodes-->>Validators: 9. Sync with updated network
note over Validators,FullNodes: Network resumes operation
```

### Troubleshooting

#### Data rollback - retrieving snapshot at a specific height (step 5)

There are two ways to get a snapshot from a prior height:

1. Execute

```bash
poktrolld rollback --hard
```

repeately, until the command responds with the desired block number.

2. Use a snapshot from below the halt height (e.g. `100`) and start the node with `--halt-height=100` parameter so it only syncs up to certain height and then
gracefully shuts down. Add this argument to `poktrolld start` like this:

```bash
poktrolld start --halt-height=100
```

#### Validator Isolation - risks (step 6)

Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the
following errors in logs are the sign of the nodes syncing blocks from the wrong fork:

- `found conflicting vote from ourselves; did you unsafe_reset a validator?`
- `conflicting votes from validator`
100 changes: 100 additions & 0 deletions docusaurus/docs/protocol/upgrades/contigency_plans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: Failed upgrade contingency plan
sidebar_position: 5
---

:::tip

This documentation covers failed upgrade contingency for `poktroll` - a `cosmos-sdk` based chain.

While this can be helpful for other blockchain networks, it is not guaranteed to work for other chains.

:::

## Contingency plans <!-- omit in toc -->

There's always a chance the upgrade will fail.

This document is intended to help you recover without significant downtime.

- [Option 0: The bug is discovered before the upgrade height is reached](#option-0-the-bug-is-discovered-before-the-upgrade-height-is-reached)
- [Option 1: The migration didn't start (i.e. migration halt)](#option-1-the-migration-didnt-start-ie-migration-halt)
- [Option 2: The migration is stuck (i.e. incomplete/partial migration)](#option-2-the-migration-is-stuck-ie-incompletepartial-migration)
- [Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)](#option-3-the-migration-succeed-but-the-network-is-stuck-ie-migration-had-a-bug)
- [MANDATORY Checklist of Documentation \& Scripts to Update](#mandatory-checklist-of-documentation--scripts-to-update)

### Option 0: The bug is discovered before the upgrade height is reached

**Cancel the upgrade plan!**

See the instructions of [how to do that here](./upgrade_procedure.md#cancelling-the-upgrade-plan).

Check warning on line 30 in docusaurus/docs/protocol/upgrades/contigency_plans.md

View workflow job for this annotation

GitHub Actions / misspell

[misspell] docusaurus/docs/protocol/upgrades/contigency_plans.md#L30

"cancelling" is a misspelling of "canceling"
Raw output
./docusaurus/docs/protocol/upgrades/contigency_plans.md:30:69: "cancelling" is a misspelling of "canceling"

### Option 1: The migration didn't start (i.e. migration halt)

**This is unlikely to happen.**

Possible reasons for this are if the name of the upgrade handler is different
from the one specified in the upgrade plan, or if the binary suggested by the
upgrade plan is wrong.

If the nodes on the network stopped at the upgrade height and the migration did not
start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed),
we **MUST** gather social consensus to restart validators with the `--unsafe-skip-upgrade=$upgradeHeightNumber` flag.

This will skip the upgrade process, allowing the chain to continue and the protocol team to plan another release.

`--unsafe-skip-upgrade` simply skips the upgrade handler and store migrations.
The chain continues as if the upgrade plan was never set.
The upgrade needs to be fixed, and then a new plan needs to be submitted to the network.

:::caution

`--unsafe-skip-upgrade` needs to be documented in the list of upgrades and added
to the scripts so the next time somebody tries to sync the network from genesis,
they will automatically skip the failed upgrade.
[Documentation and scripts to update](#documentation-and-scripts-to-update)

<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this -->

:::

### Option 2: The migration is stuck (i.e. incomplete/partial migration)

If the migration is stuck, there's always a chance the upgrade handler was executed on-chain as scheduled, but the migration didn't complete.

In such a case, we need:

- **All full nodes and validators**: Roll back validators to the backup

- A snapshot is taken by `cosmovisor` automatically prior to upgrade when `UNSAFE_SKIP_BACKUP` is set to `false` (the default recommended value;
[more information](https://docs.cosmos.network/main/build/tooling/cosmovisor#command-line-arguments-and-environment-variables))

- **All full nodes and validators**: skip the upgrade

- Add the `--unsafe-skip-upgrade=$upgradeHeightNumber` argument to `poktroll start` command like so:

```bash
poktrolld start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments
```

- **Protocol team**: Resolve the issue with an upgrade and schedule a new plan.

- The upgrade needs to be fixed, and then a new plan needs to be submitted to the network.

- **Protocol team**: document the failed upgrade

- Document and add `--unsafe-skip-upgrade=$upgradeHeightNumber` to the scripts (such as docker-compose and cosmovisor installer)
- The next time somebody tries to sync the network from genesis they will automatically skip the failed upgrade; see [documentation and scripts to update](#documentation-and-scripts-to-update)

<!-- TODO_MAINNET(@okdas): new cosmovisor UX can simplify this -->

### Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)

This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade. See [Recovery From Chain Halt](../../develop/developer_guide/recovery_from_chain_halt.md) for more information on how to handle such issues.

### MANDATORY Checklist of Documentation & Scripts to Update

- [ ] The [upgrade list](./upgrade_list.md) should reflect a failed upgrade and provide a range of heights that served by each version.
- [ ] Systemd service should include`--unsafe-skip-upgrade=$upgradeHeightNumber` argument in its start command [here](https://github.com/pokt-network/poktroll/blob/main/tools/installer/full-node.sh).
- [ ] The [Helm chart](https://github.com/pokt-network/helm-charts/blob/main/charts/poktrolld/templates/StatefulSet.yaml) should point to the latest version;consider exposing via a `values.yaml` file
- [ ] The [docker-compose](https://github.com/pokt-network/poktroll-docker-compose-example/tree/main/scripts) examples should point to the latest version
Loading
Loading