Offchain runtime upgrades #102

eskimor · 2024-07-13T17:44:12Z

One step closer to making reduction of PVF storage deposits smaller feasible and also in generally improving performance and reliability for parachains.

bkchr · 2024-07-14T21:54:17Z

text/0102-offchain-parachain-runtime-upgrades.md

+### Introduce a new UMP message type `RequestCodeUpgrade`
+
+As part of elastic scaling we are already planning to increase flexibility of [UMP
+messages](https://github.com/polkadot-fellows/RFCs/issues/92#issuecomment-2144538974), we can now use this to our advantage and introduce another UMP message:


"We just need this hack for one thing and will not use it for anything else" ;)

It does feels indeed it is creeping out into other features/change like this one but it offers a lot of advantages in the short term. I would not call it a hack, but more of a generalisation of the UMP queue. The alternative is PVF versioning which I believe is the long term solution that we'll likely to develop in 2025.

I mean for CoreIndex it is clearly a hack. RequestCodeUpgrade is an actual message that is sort of fine to be passed here. However, this brings up the topic on, should we add it to XCM? Should we make UMP messages generic, where one variant is XCM and the others are more UMP related?

We don't want to add it to XCM, instead we will have a UMP queue separator between regular XCM messages and the possible additional ones for CoreIndex and the RequestCodeUpgrade. I will soon post the RFC which explains the UMP changes in detail.

We don't want to add it to XCM, instead we will have a UMP queue separator between regular XCM messages and the possible additional ones for CoreIndex and the RequestCodeUpgrade

I know what the plan was/is. However, this doesn't really invalidate what I said above.

Ah, I see. I would prefer to make the UMP messages more generic in this case, having two variants, one wrapping XCM and the other UMPSignal as defined here. Sounds much better than using a separator. If we agree to this I will update it also in #103

burdges · 2024-07-17T08:46:21Z

text/0102-offchain-parachain-runtime-upgrades.md

+Change the upgrade process of a parachain runtime upgrade to become an off-chain
+process with regards to the relay chain. Upgrades are still contained in
+parachain blocks, but will no longer need to end up in relay chain blocks nor in
+relay chain state.


Yes, off-chain upgrades make sense: I mildly pushed for PVF upgrade to live in parablocks early on, but we descided for upgradfes on the relay chain since all validators need the data eventually anyways. It's true however that (a) validator set churn makes off-chain an optimization, and being on-chain incurs extra costs, like repeated downloads.

burdges · 2024-07-17T08:47:47Z

text/0102-offchain-parachain-runtime-upgrades.md

+
+In case they received the collation via PoV distribution instead of from the
+collator itself, they will use the exact same message to fetch from the valiator
+they got the PoV from.


Why not make the code upgade simply be the parachain block? Isn't that how substrate worked from the beginning?

If the code were bigger than a block, then you could incrementally build the PVF in parachain state, and incrementally hash it. Or do some special larger code block type.

burdges · 2024-07-17T09:00:58Z

text/0102-offchain-parachain-runtime-upgrades.md

+Then on each further candidate from that chain that counter gets decremented.
+Validators which have not yet succeeded fetching will now try again. This game
+continues until the counter reached `0`. Now it is mandatory to have to code in
+order to sign a `1` in the bitfield.


You've just pushed the availability into the last of these fake blocks here. I guess this works, but I'm not convinced this is better than doing some big block availability variant:

We'd process the code availability in a single big parachain block, which only provides data but nerver gets executed. This takes as long as it takes, maybe runnoing at some lower priority. It occupies the availability code for that whole time, exactly like this scheme does.

After that runs, we have code available on chain so everyone must fetch it and build the artifact. We must delay the PVF upgrade being usable until those builds succeed, which could be done either by a second fake parablock type, or else by some message of the sort discussed here.

burdges · 2024-07-17T09:01:42Z

text/0102-offchain-parachain-runtime-upgrades.md

+Validators in availability distribution will be changed to only sign a `1` in
+the bitfield of a candidate if they not only have the chunk, but also the
+currently active PVF. They will fetch it from backers in case they don't have it
+yet.


Yeah this makes sense regardless.

burdges · 2024-07-17T09:04:57Z

text/0102-offchain-parachain-runtime-upgrades.md

+But the majority of validators should always keep the latest code of any
+parachain and only prune the previous one, once the first candidate using the
+new code got finalized. This ensures that disputes will always be able to
+resolve.


Yeah 1 is an improvement here, previously I'd envisions parachains doing code reuploads once per day, just so the code stays in availability

sandreim · 2024-07-17T14:30:39Z

text/0102-offchain-parachain-runtime-upgrades.md

+
+1. They received a collation sending `RequestCodeUpgrade`.
+2. They received a collation, but they don't yet have the code that was
+   previously registered on the relaychain. (E.g. disk pruned, new validator)


Is it still feasible to prepare PVFs in advance (when node becomes a validator in next session)?

sandreim · 2024-07-17T15:52:39Z

text/0102-offchain-parachain-runtime-upgrades.md

+
+1. Fetching can happen over a longer period of time with low priority. E.g. if
+   we waited for the PVF at the very first avaialbility distribution, this might
+   actually affect liveness of other chains on the same core. Distributing


Don't we still starve the next parachain if the inclusion is delayed until the code was fetched by 2/3 validators ? I mean, if we treat these as low priority this can be an issue.

That's why we have a configurable amount of parachain blocks to do the fetching. If we ever run into availability problems we can:

Increase that amount of blocks we have time to fetch the PVF.

Limit the amount of runtime upgrades we are willing to do in a timespan and add priority fees (already planned) to requests, to secure a spot in case of competition.

Note however that right now we do distribute those upgrades within a single relaychain slot twice, once via statement distribution then via the relay chain block. In the new scheme if we set the number of required parachain blocks to 10, we reduced pressure 20 times. Thus I doubt it will be a problem in practice and if it ever were, we have means to fix it.

sandreim · 2024-07-17T16:11:58Z

text/0102-offchain-parachain-runtime-upgrades.md

+order to sign a `1` in the bitfield.
+
+PVF pre-checking will happen after the candidate which brought the counter to
+`0` has been successfully included and thus is also able to assume that 2/3 of


Is there an expiry date for when the parachain needs to reach 0, otherwise the code upgrade is dropped ?

Good point. Will add a a section.

eskimor · 2024-07-17T17:15:48Z

For fees with this proposal, given that storage cost is essentially now limited to validator disk space, we should be able to bring down deposit costs significantly. E.g. if we assume that 1TB costs 100 Euro and such a disk lives for 3 years, we have yearly costs of 33 Euro (all very roughly). Everything we store is stored on thousand validators, thus 1MB needs 1GB of storage. This means roughly storage costs for 1MB PVF of 3.3 Cent. So a full blown PVF of 5 MB roughly 17 cent. With 10% staking rewards, this means we would only need to lock up token worth of roughly 2 Euro. Even if we make it 10x that or even 100 times that, we would still be way lower than the current on chain storage.

Obviously this is just a back-of-the-envelope calculation, but assuming I missed a few cost factors (like electricity, ...), going 10x my calculation would still be cheap (20 Euro worth of token locked).

I love it and I will extend the RFC a bit more to at least have everything prepared for smart contract storage.

bkchr · 2024-07-17T19:13:09Z

For fees with this proposal, given that storage cost is essentially now limited to validator disk space, we should be able to bring down deposit costs significantly.

Even before this was already taking validators into account. This is a decentralized network and you have no control over how many nodes are running aka how many copies exist. Thus, the previous model could also already not include any kind of storage costs from random nodes in the network.

Your biggest argument last week was also the costs for compiling the code. I don't see how this RFC changes the cost for compiling the PVFs.

eskimor · 2024-07-18T12:10:56Z

Your biggest argument last week was also the costs for compiling the code. I don't see how this RFC changes the cost for compiling the PVFs.

The biggest concern was actually the blockspace used on the relay chain which is fully solved by this proposal. For preparation, indeed nothing changed. Best solution to this problem would be PolkaVM. Until we have that or some other solution, you indeed brought up a good argument to not go that low with fees. Although we should probably differentiate between storage used for the PVF which needs to be prepared and additional storage offered (e.g. for smart contracts), which don't impose a cost on PVF compilation. 🤔

burdges · 2024-07-18T13:02:00Z

All my above comments can be sumarized like:

Why is this availability voting countdown hack better than simply occupying one availability core for longer?

We're not going starve the system of cores of course. A priori, we do not really care how long an availability core stays occupied since they never delay finality.

Are you worried there are parablocks which must be aswsigned to one particular core?

If this were the concern, then we could solve this in other ways, some of which maybe more "orthogonal" in some sense. We could've a "code upgrade" system parachian into which all parachains post their code. It'd be "virtual" in that it has no state, no collators, and no PVF of its own, but it takes arbitrarily large blocks.

You want this countdown for billing perhaps? I'd buy that reasoning, not much point having a whole seperate billing system.

burdges · 2024-07-18T13:11:23Z

I noticed "chunk" twice in this document. If you envision ever doing reconstruction from erasure coded data, then you need approval checkers who check the erasure coding, otherwise someone could replace some chunks with garbage.

Instead, you could've some notion of mirroring/code core, or state of an availability core, in which validators only sign the bit once they've fetched the the whole data block. This saves some nodes reencoding the PVF since everyone wants the PVF eventually anyways.

eskimor · 2024-07-18T13:59:55Z

Why is this availability voting countdown hack better than simply occupying one availability core for longer?

Because that would affect block times of that parachain, if it was core sharing even block times of other chains on that core.

The counter is just an easy solution to:

Give it more time than normal availability - so we can re-use availability for this, without introducing another protocol.
Indeed it is also an easy way to do some billing. Not perfect, but kind of apt for the purpose. Updating your runtime affects all validators and is therefore rather costly, hence the worst case of wasting a bit of coretime for empty blocks seems fine (low volume, on-demand chain) and we charge for it anyway.

(1) is more important. My biggest concern is usability issues, but should be fine as well with good documentation and emitting events about counter state.

Virtual cores are an interesting idea, although I think this adds actually more complexity, both to code and to cognitive load. It would be complexity we don't need to expose though, thus maybe good. Will think about it.

In fact I plan on using the coretime chain for the initial upload of the PVF (parachain registration).

I noticed "chunk" twice in this document. If you envision ever doing reconstruction from erasure coded data, then you need approval checkers who check the erasure coding, otherwise someone could replace some chunks with garbage.

I don't think it makes sense to chunk the data given that all validators need the full data anyway.

eskimor · 2024-10-09T08:33:34Z

text/0102-offchain-parachain-runtime-upgrades.md

+continues until the counter reached `0`. Now it is mandatory to have to code in
+order to sign a `1` in the bitfield.
+
+PVF pre-checking will happen after the candidate which brought the counter to


Question: Do we need to use availability bitfields here or can we rely on pre-checking only?

Bitfields offer the advantage that we have an incentive for backers (at least for the last one) and it avoids having impose the work of pre-checking without the "attacker" having paid their bill (produced enough blocks).

Other things to consider:

To remove the wart for on-demand chains of having to produce n blocks, we could introduce a backwards compatible "fast-track" fee. With this you either produce n blocks (backwards compatible with existing chains) or you pay the fast-track fee, which removes this restriction and also will remove the 2 session delay: We just have pre-checking only succeed if either those two sessions passed or validators have seen the including block finalized and the fast-track fee has been paid. Backers can then be incentivized to provide the code by getting a cut of that fast-track fee iff prechecking succeeds. Which will only succeed if validators were able to fetch the code obviously.

We could see a stop-gap solution until we go full off-chain by doing the following:
2.1. Introduce a requirement to be eligible for a runtime upgrade, by having produced n blocks since the last one. With n being something like 1000. This will hardly be noticeable by existing chains (backwards compatible), but will rate-limit upgrades for on-demand chains + ramp up the cost. The effective rate limit is for x cores being available: x/10. So with 100 cores, this would be 1/10. Meaning by fully utilizing 100 cores someone could trigger a runtime-upgrade every 10 blocks, causing 10% service degradation worst case. We can get even better by either fully implementing this RFC or by increasing n further.
2.2. Have that above fast-track fee to cater to legit on-demand chains and also allow for secure fast-tracking of upgrades in general.
2.3. A relay chain block containing a candidate which contains a runtime-upgrade is illegal, if the parachain has not produced n blocks and is not paying the fast-track fee. Note: This might be problematic, as depending on n this might no longer be that backwards compatible and more importantly a parachain could end up permanently DoSing itself.

eskimor added 5 commits July 13, 2024 19:42

Offchain runtime upgrades

9ec6a42

Add note about storage deposits and future extensibility.

b5098e1

Generalize req/res protocol

ce9d5f3

Clarifications.

c8cd560

Some refinements on future directions.

127d3ec

bkchr reviewed Jul 14, 2024

View reviewed changes

More clarifications

95f9763

burdges reviewed Jul 17, 2024

View reviewed changes

sandreim reviewed Jul 17, 2024

View reviewed changes

burdges mentioned this pull request Jul 19, 2024

Reduce storage deposit for parachain PVFs paritytech/polkadot-sdk#5012

Open

eskimor commented Oct 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offchain runtime upgrades #102

Offchain runtime upgrades #102

eskimor commented Jul 13, 2024

bkchr Jul 14, 2024

sandreim Jul 15, 2024

bkchr Jul 15, 2024

sandreim Jul 15, 2024

sandreim Jul 15, 2024

bkchr Jul 15, 2024

sandreim Jul 15, 2024 •

edited

Loading

burdges Jul 17, 2024

burdges Jul 17, 2024 •

edited

Loading

burdges Jul 17, 2024 •

edited

Loading

burdges Jul 17, 2024

burdges Jul 17, 2024

sandreim Jul 17, 2024

sandreim Jul 17, 2024

eskimor Jul 17, 2024

sandreim Jul 17, 2024

eskimor Jul 17, 2024

eskimor commented Jul 17, 2024

bkchr commented Jul 17, 2024

eskimor commented Jul 18, 2024

burdges commented Jul 18, 2024 •

edited

Loading

burdges commented Jul 18, 2024

eskimor commented Jul 18, 2024

eskimor Oct 9, 2024

eskimor Oct 9, 2024

eskimor Oct 9, 2024

Offchain runtime upgrades #102

Are you sure you want to change the base?

Offchain runtime upgrades #102

Conversation

eskimor commented Jul 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreim Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burdges Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

burdges Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskimor commented Jul 17, 2024

bkchr commented Jul 17, 2024

eskimor commented Jul 18, 2024

burdges commented Jul 18, 2024 • edited Loading

burdges commented Jul 18, 2024

eskimor commented Jul 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreim Jul 15, 2024 •

edited

Loading

burdges Jul 17, 2024 •

edited

Loading

burdges Jul 17, 2024 •

edited

Loading

burdges commented Jul 18, 2024 •

edited

Loading