SRE as an enabling team (contrary to the Google approach). Long live DevOps as culture and approach!
Running software in production:
- Design document should include aspects of reliability,
- Production Readiness and Pre-mortems
- Runbooks
- Post-mortems
- SLA/SLO/SLI/uptime
- Error budget
- OnCall Duty
- Classification of incident/issue/bug
- Incident Management
Continuous (even smallest) improvements.
Not necessary for every feature, its main purpose is to get feedback from other engineers on how a given functionality is going to be built.
Production Readiness - checklist for checking that new code is production ready:
-
Do we have enough observability for the onCall and operating the component?
- monitoring dashboards,
- alerts,
- logging,
- training.
-
pre-mortem should be included as well - 2-3 scenarios for things that may go wrong,
-
New onCallers will use this document to understand how the component works in the production.
Best practices:
- Do hands-on exercises, dry runs, etc.
Types:
- manual
- semi-manual
- automatic
Tools:
- Ansible, custom Kubernetes operator, manual (with copy&paste commands) instructions
- https://backstage.io/
- https://www.rundeck.com/open-source
- platforms for continuous deployment of IaC, e.g., Spacelift
- platforms for continuous deployment of software, e.g., gitlab
- docs/internal wiki, e.g., archbee or notion.so
Best practices:
- Have a template document for all your Postmortem
- Pick a facilitator, possibly person who was not involved
- Do the postmortem in up to 1 weeks (max 1.5).
- Do blameless postmortem,
- Assign responsible persons for each of the action items.
see: https://www.atlassian.com/incident-management/postmortem/blameless
When your service works and when it does not:
Abbr | Name | Description |
---|---|---|
SLA | Service Level Agreement | what we promise to customer, possibly a penalty if not met |
SLO | Service Level Objective | what we promise to ourselves, what we see as our objective, SLO >= SLA |
SLI | Service Level Indicator | What we measure to say our system is available |
Example:
99.95% of well-formed queries correctly processed under 500 ms
Remember:
Measure at the system boundaries and focus on the customer/consumer experience.
- Prometheus/Grafana
- Datadog / Newrelic
- statuscake
- Opsgenie
- https://cloud.google.com/blog/products/devops-sre/availability-part-deux-cre-life-lessons
- SRE fundamentals 2021: SLIs vs SLAs vs SLOs
- how to define SLI and SLO by newrelic
- https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli
Goal: go fast but slow down when there is a tight corner.
- Shall we speed up? We did not have any incident, we can move faster.
- Shall we slow down? E.g., two incident, close to violate/or not delivering our SLA, we slow down.
We track SLA vs SLIs to tell us where we are.
Small / medium company:
- Level1
- Level2
- Whole engineering team
flowchart TD
a(alert) -- wakes up -->l1(L1)
l1 -- 20 min without ack / or triggered by L1 --> l2(L2)
l2 -- 20 min without ack / or triggered --> l3(Escalation)
Difference:
- time to ACK
- time to ACT
- time to inform customers, e.g., with the statuspage
- interval of updates on the progress
Important:
- Training
- Runbooks
- Dry of hands-on drills
- Resilience/failover/graceful degradation is a part of the feature design
- Implement the action items from post-mortems
Tools:
- Opsgenie
- Pagerduty
- Status.io or statuspage, e.g.., https://spacelift.statuspage.io
see Practice of Cloud System Administration vol2.
Impact | Description |
---|---|
P1 | business dow situation or high financial impact, client unable to operate |
P2 | A major component of the clients’ ability to operate is affected. Some aspects of the business can continue but its a major problem. |
P3 | issue is affecting efficient operation by one or more people. Core not affected |
P4 | inconvenience or annoying, a walkaround exists |
- P1 detected and needs to be addressed
- P2 if detected, addressing it might start during the business hours
- P3
- P4
Commitment to customers (common in contracts with enterprises):
- How fast they are notified about the incident,
- Time to stop bleeding / finding workaround
- Time for fixing the issue
- Response time to customer tickets
Hint.
Here helps to release certain functionality as apha or RC or behind feature flags to most friendly customers to be able to move faster.
Incident management skills and practices exist to channel the energies of enthusiastic individuals.
flowchart
I(Incident) -. Well defined process .->l
l(Incident\nCommander) --> H(Communication\n Manager)
l --> D(Direct\nContributor/s)
l --> C[[Communication\nChannel]]
Roles:
-
Incident Commander (IC) - "the commander holds all positions that they have not delegated. If appropriate, they can remove roadblocks that prevent Ops from working most effectively" (sre book). IC has super powers.
-
Communication Manager (CM) - manages communications, periodically updates about the incident to the other teams, stakeholders, and clients (directly or indirectly). The communication could be done by email, slack, or a statuspage.
-
Direct Contributor (DC) - a person or persons that works on solving the problem
Procedure:
- Alert -> OnCaller or person on duty become a IC
- IC creates a channel to coordinate the incident (a call, slack channel, etc.)
- IC names CM and DCs
- CM handles the external communication and stakeholder management
- DCs works on the solution, ICs use his super powers to ensure we can resolve the incident.
After:
- IC schedule a postmortem meeting and find a facilitator
- (optional) DCs with ICs check which clients were affected
- CM with IC publish a result of postmortem to clients if needed
Best Practices:
- Prioritise: Stop the bleeding, restore service, and preserve the evidence for root-causing,
- Clear communication and hand-offs,
- see: sre books on managing incidents.
-
Business Continuity Plan
-
Disaster recovery (see RTO and RPO):
- RTO - Recovery Time Objective
- RPO - Recovery Point Objective