Initiative: Improvements we should introduce to the mainnet upgrade process - OP stack #452

BlocksOnAChain · 2024-11-06T17:06:54Z

BlocksOnAChain
Nov 6, 2024
Collaborator

I wanted to start a discussion and share specific ideas for improving our release flow for the OP stack.

Our motive to try to make our process better:
For the past year, we've been shipping hardforks on a regular schedule.
As we grow the network, we need to involve more core developers who are building with us in the entire process. Currently, we potentially lack sufficient public documentation to help everyone understand the steps and procedures behind our release flow.
Additionally, we aim to continuously introduce improvements and refine the hardfork process over time.

In order to start making things even better and recognize the ideas and feedback we received about the current upgrade process, I'm proposing the following:

Improvements we should make

The product teams need to be involved in the hardfork process and help plan features and deliverables for the upcoming hardforks.
WHY: We need product managers to share their views around:

Prioritization for the upcoming features/deliverables and assist the team in determining our hardfork scope better.
Validate if our hardfork plans meet the overall acceptance criteria or achieve the team/organizational roadmap goals.

We should have a cut-off period in which we “lock” the scope for each hardfork we plan to do. We suggest doing it as soon as we start working on the engineering tasks and after we merge all the relevant specifications for that specific hardfork.
We should try to execute a hardfork release every 3 month or even sooner, so we can keep regularly shipping improvements to our users.
The FMA document (good example) should be created and reviewed as part of the specification reviews. By this moment, we know the scope and the specific items we plan to ship, so we can create a draft FMA from it. Starting the FMA document sooner than later is one the big learnings we had in the last 2,3 hardforks.
We should start tracking metrics around the following development activities:

The time we need from the code complete stage to the actual activation on mainnet.
Try to estimate the amount of work we are taking on for each hardfork and determine in the early hardfork stages of the project - do we maybe need to cut the scope or move some items to the next hardfork project?
The time that we needed to test the entire implementation on a devnet/testnet level.

We should have a dedicated hardfork discord channel that will allow everyone to participate and explain their ideas, specs, and current tasks. #Holocene-HF is a good example of how this should look like
As part of our planning and scoping efforts - we should create a buffer time that will allow us to do proper scope and specification reviews since this is a stage in which ideas are turned into actual future work.
Every hardfork should have its own retrospective call - all core development teams that participated will be invited. We will use this dedicated call to share ideas, learn and improve the deliver processes over time.

BlocksOnAChain · 2024-11-06T17:39:01Z

BlocksOnAChain
Nov 6, 2024
Collaborator Author

Related to this topic and potentially useful in terms of context for other core devs working on the OP stack, I want to share a good example of what our current release timeline looks like and how much time we are usually spending for each major release step:

0 replies

roberto-bayardo · 2024-11-06T18:04:17Z

roberto-bayardo
Nov 6, 2024

I think teams should try to have a prototype in progress in parallel with the spec update, which will allow us to more quickly work out the kinks. For example we ended up pivoting a couple times on the configurable EIP-1559 parameter design a few times because of challenges that were only uncovered once we started implementing.

0 replies

ajsutton · 2024-11-20T01:01:04Z

ajsutton
Nov 20, 2024
Collaborator

Makes sense to me basically. I definitely agree with doing prototypes and showing things working. We often spend a ton of time writing specs and design docs and then discover it doesn't work as soon as we start writing code.

I'm also a little concerned about the hard fork every 3 months schedule. That is a very fast schedule for hard forks which means spending a huge amount of time coordinating the fork. I love incremental delivery and small change sets but given a hard fork requires something about 6 weeks of coordination and we actually need to be spending more time doing things like devnets and cross-client compatibility testing after we reach code complete, doing one every 3 months is a lot of overhead and leaves extremely little time for actual development. Yes theoretically we can pipeline, practically not so much and it hugely increases time pressure because if one fork slips there are others downstream that are impacted (see the Hokey Pokey happening with Isthmus changes). Personally I'd focus on getting the hard fork process smooth when doing one fork at a time before trying to pipeline stuff.

5 replies

mds1 Nov 20, 2024
Collaborator

I am less familiar with the hard fork process than everyone else involved in this discussion so far, so would like to better understand where the large amount of time spent coordinating the fork comes from so we can reduce the overhead. How much of the current overhead is inherent and can never be removed, and how much can we reduce by improving our processes?

ajsutton Nov 20, 2024
Collaborator

Hard forks are always hard to coordinate and time consuming. You have a heap of different chains that need to coordinate upgrades, cross client testing, governance processes (and answering the questions etc), multiple releases required. Basically you're requiring all users to take action by a certain date and that will always be expensive to coordinate. And core consensus changes are ones that need to be tested most carefully which takes time and effort as well. When we try to streamline things, it's generally that testing time that gets squeezed out first (e.g. holocene was supposed to have cross-client devnets but is scheduled to activate on sepolia already without that testing which I get, but shows why time pressure is so dangerous).

mds1 Nov 20, 2024
Collaborator

So is your stance that the proposed 3 month schedule can become attainable with process improvements, but is likely close to the max fork cadence? (I am not advocating to fork more frequently or doubting your reply, just trying to get a better understanding and calibrate my own expectations)

ajsutton Nov 20, 2024
Collaborator

I'm not sure what the most rapid fork cadence we could realistically handle is. I would be surprised if it really is worth doing a hard fork every 3 months because of the inherent cost of coordinating with every chain and node operator in the ecosystem. It's also likely undesirable from a customer's perspective because it means they have more work to do to keep things up to date.

Mostly I think we should stop being so timeline driven. Saying we'll hard fork every 3 months just creates a ton of time pressure which encourages us to prioritise shipping today vs actually investing in the tooling and process improvements that will let us ship more efficiently in the future. If we say we'll invest in one thing each hard fork to make the next hard fork smoother and faster, we might wind up at a point where we're shipping hard forks every 3 months. If we say we'll ship a hard fork every 3 months, we'll be so busy trying to do that, that I don't believe we'll ever make the investments required to make that sustainable and safe, so we'll likely wind up delaying a lot of hard forks and complaining about how things aren't getting better.

mds1 Nov 20, 2024
Collaborator

If we say we'll invest in one thing each hard fork to make the next hard fork smoother and faster, we might wind up at a point where we're shipping hard forks every 3 months. If we say we'll ship a hard fork every 3 months, we'll be so busy trying to do that, that I don't believe we'll ever make the investments required to make that sustainable and safe, so we'll likely wind up delaying a lot of hard forks and complaining about how things aren't getting better.

I like this framing a lot

Thanks for your insights here :)

mds1 · 2024-11-20T03:12:54Z

mds1
Nov 20, 2024
Collaborator

Overall the improvements suggested here sound great to me. The way this is written sounds like it applies strictly to hard forks. Is this correct, or does it apply to contract changes also? Sometimes hard forks also require contract upgrades, sometimes they have no associated contract upgrades (aside from the required dispute game changes), and sometimes contract upgrades are required without an associated hard fork. We should clarify how contracts fit into this proposal

0 replies

mds1 · 2024-11-20T03:20:53Z

mds1
Nov 20, 2024
Collaborator

I've noticed we spend a lot of time discussing the same things for each hard fork and contract upgrade. Below is a checklist (that is probably missing some things) covering everything I could think of that's involved in shipping to production. Not all items are always applicable.

I'm curious to get thoughts around standardizing something like this to help reduce a lot of the discussions we always have. The idea is:

We finalize a set of features and fixes for a fork/upgrade.
Each feature owner makes a copy of this checklist. They start by populating target completion dates (TCDs) for each item, so everyone knows their schedule upfront. TCDs get updated as needed throughout the development process
As they develop their feature they replace the TCDs with info or links to the appropriate artifact, signaling completion of a checklist item
A completed checklist is all we need to know a feature is ready to ship.
We can include the checklists for each feature in governance proposals.

Checklist: Shipping Features and Fixes to Production

- Target governance cycle: <cycle number and dates>
- Design docs: <link to design docs>
- Specs: <link to specs>
- FMA: <link to FMA>
- Audit report(s): <link to audit report(s)>
- Update to security reviews folder: <PR to update https://github.com/ethereum-optimism/optimism/tree/develop/docs/security-reviews>
- Standard rollup charter (standard config) changes: <link to doc detailing suggested changes and open questions>
- Superchain registry changes: <link to suggested superchain registry changes>
- New incident response runbooks: <link to new runbooks>
- Existing incident response modifications: <link to doc detailing diff of existing runbooks>
- Monitoring/alerting requirements: <link to monitoring/alerting requirements>
- Draft governance post: <link to relevant portion of governance post>
- Deployment/Rollout plan: <for all chains we hold keys for, consider devnet, sepolia, mainnet>
- Cut op-geth release candidate: <links to op-geth release>
- Cut op-node release candidate: <links to op-node release>
- Cut op-program release candidate: <links to op-program release>
- Cut op-contracts release candidate: <links to op-contracts release>
- <etc. for any missing op-* package release candidates that are needed>
- <repeat of the above for the final releases>
- Devnet superchain-ops upgrade: <link to draft playbook>
- Communication of devnet upgrade to appropriate teams
- Sepolia superchain-ops upgrade: <link to draft playbook>
- Communication of sepolia upgrade to appropriate teams
- Mainnet superchain-ops upgrade: <link to draft playbook>
- Communication of sepolia upgrade to appropriate teams
- Docs: <link to docs PRs>
- Draft email for foundation approval: <link to draft email>
- Infrastructure rollouts: <what's needed>

1 reply

geoknee Dec 2, 2024
Collaborator

Could we also have a system where every PR in a key repository (monorepo,op-geth,etc) gets a hf-Holocene or similar label. It should be easy to track down all the code changes related to the upgrade.

zhiqiangxu · 2024-11-20T10:29:31Z

zhiqiangxu
Nov 20, 2024

I think auditing progress or plan of HFs is also of interest to lots of users.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initiative: Improvements we should introduce to the mainnet upgrade process - OP stack #452

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Initiative: Improvements we should introduce to the mainnet upgrade process - OP stack #452

BlocksOnAChain Nov 6, 2024 Collaborator

Improvements we should make

Replies: 6 comments · 6 replies

BlocksOnAChain Nov 6, 2024 Collaborator Author

roberto-bayardo Nov 6, 2024

ajsutton Nov 20, 2024 Collaborator

mds1 Nov 20, 2024 Collaborator

ajsutton Nov 20, 2024 Collaborator

mds1 Nov 20, 2024 Collaborator

ajsutton Nov 20, 2024 Collaborator

mds1 Nov 20, 2024 Collaborator

mds1 Nov 20, 2024 Collaborator

mds1 Nov 20, 2024 Collaborator

geoknee Dec 2, 2024 Collaborator

zhiqiangxu Nov 20, 2024

BlocksOnAChain
Nov 6, 2024
Collaborator

Replies: 6 comments 6 replies

BlocksOnAChain
Nov 6, 2024
Collaborator Author

roberto-bayardo
Nov 6, 2024

ajsutton
Nov 20, 2024
Collaborator

mds1 Nov 20, 2024
Collaborator

ajsutton Nov 20, 2024
Collaborator

mds1 Nov 20, 2024
Collaborator

ajsutton Nov 20, 2024
Collaborator

mds1 Nov 20, 2024
Collaborator

mds1
Nov 20, 2024
Collaborator

mds1
Nov 20, 2024
Collaborator

geoknee Dec 2, 2024
Collaborator

zhiqiangxu
Nov 20, 2024