Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Summary

SRE as an enabling team (contrary to the Google approach). Long live DevOps as culture and approach!

Running software in production:

Design document should include aspects of reliability,
Production Readiness and Pre-mortems
Runbooks
Post-mortems
SLA/SLO/SLI/uptime
Error budget
OnCall Duty
Classification of incident/issue/bug
Incident Management

Continuous (even smallest) improvements.

Design document

Not necessary for every feature, its main purpose is to get feedback from other engineers on how a given functionality is going to be built.

Production Readiness && Pre-mortems

Production Readiness - checklist for checking that new code is production ready:

Do we have enough observability for the onCall and operating the component?
- monitoring dashboards,
- alerts,
- logging,
- training.
pre-mortem should be included as well - 2-3 scenarios for things that may go wrong,
New onCallers will use this document to understand how the component works in the production.

Runbooks

Best practices:

Do hands-on exercises, dry runs, etc.

Types:

manual
semi-manual
automatic

Tools:

Ansible, custom Kubernetes operator, manual (with copy&paste commands) instructions
https://backstage.io/
https://www.rundeck.com/open-source
platforms for continuous deployment of IaC, e.g., Spacelift
platforms for continuous deployment of software, e.g., gitlab
docs/internal wiki, e.g., archbee or notion.so

Postmortem

Best practices:

Have a template document for all your Postmortem
Pick a facilitator, possibly person who was not involved
Do the postmortem in up to 1 weeks (max 1.5).
Do blameless postmortem,
Assign responsible persons for each of the action items.

see: https://www.atlassian.com/incident-management/postmortem/blameless

SLA & Uptime

Uptime

When your service works and when it does not:

https://uptime.is/

SLA/SLO/SLI

Abbr	Name	Description
SLA	Service Level Agreement	what we promise to customer, possibly a penalty if not met
SLO	Service Level Objective	what we promise to ourselves, what we see as our objective, SLO >= SLA
SLI	Service Level Indicator	What we measure to say our system is available

Example:

99.95% of well-formed queries correctly processed under 500 ms

Remember:

Measure at the system boundaries and focus on the customer/consumer experience.

Tools

Prometheus/Grafana
Datadog / Newrelic
statuscake
Opsgenie

Additional Reading materials

Error budget

Goal: go fast but slow down when there is a tight corner.

Shall we speed up? We did not have any incident, we can move faster.
Shall we slow down? E.g., two incident, close to violate/or not delivering our SLA, we slow down.

We track SLA vs SLIs to tell us where we are.

Oncall duty

Small / medium company:

Level1
Level2
Whole engineering team

flowchart TD
  a(alert) -- wakes up -->l1(L1)
  l1 -- 20 min without ack / or triggered by L1 --> l2(L2)
  l2 -- 20 min without ack / or triggered --> l3(Escalation)

Difference:

time to ACK
time to ACT
time to inform customers, e.g., with the statuspage
interval of updates on the progress

Important:

Training
Runbooks
Dry of hands-on drills
Resilience/failover/graceful degradation is a part of the feature design
Implement the action items from post-mortems

Tools:

Opsgenie
Pagerduty
Status.io or statuspage, e.g.., https://spacelift.statuspage.io

see Practice of Cloud System Administration vol2.

Classification of incident/issue/bug

Impact	Description
P1	business dow situation or high financial impact, client unable to operate
P2	A major component of the clients’ ability to operate is affected. Some aspects of the business can continue but its a major problem.
P3	issue is affecting efficient operation by one or more people. Core not affected
P4	inconvenience or annoying, a walkaround exists

P1 detected and needs to be addressed
P2 if detected, addressing it might start during the business hours
P3
P4

Commitment to customers (common in contracts with enterprises):

How fast they are notified about the incident,
Time to stop bleeding / finding workaround
Time for fixing the issue
Response time to customer tickets

Hint.

Here helps to release certain functionality as apha or RC or behind feature flags to most friendly customers to be able to move faster.

Incident Management

Incident management skills and practices exist to channel the energies of enthusiastic individuals.

flowchart
  I(Incident) -. Well defined process .->l
  l(Incident\nCommander) --> H(Communication\n Manager)
  l --> D(Direct\nContributor/s)
  l --> C[[Communication\nChannel]]

Roles:

Incident Commander (IC) - "the commander holds all positions that they have not delegated. If appropriate, they can remove roadblocks that prevent Ops from working most effectively" (sre book). IC has super powers.
Communication Manager (CM) - manages communications, periodically updates about the incident to the other teams, stakeholders, and clients (directly or indirectly). The communication could be done by email, slack, or a statuspage.
Direct Contributor (DC) - a person or persons that works on solving the problem

Procedure:

Alert -> OnCaller or person on duty become a IC
IC creates a channel to coordinate the incident (a call, slack channel, etc.)
IC names CM and DCs
CM handles the external communication and stakeholder management
DCs works on the solution, ICs use his super powers to ensure we can resolve the incident.

After:

IC schedule a postmortem meeting and find a facilitator
(optional) DCs with ICs check which clients were affected
CM with IC publish a result of postmortem to clients if needed

Best Practices:

Prioritise: Stop the bleeding, restore service, and preserve the evidence for root-causing,
Clear communication and hand-offs,
see: sre books on managing incidents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05_slides

05_slides

README.md

Summary

Design document

Production Readiness && Pre-mortems

Runbooks

Postmortem

SLA & Uptime

Uptime

SLA/SLO/SLI

Tools

Additional Reading materials

Error budget

Oncall duty

Classification of incident/issue/bug

Incident Management

Related topics

References

Files

05_slides

Directory actions

More options

Directory actions

More options

Latest commit

History

05_slides

Folders and files

parent directory

README.md

Summary

Design document

Production Readiness && Pre-mortems

Runbooks

Postmortem

SLA & Uptime

Uptime

SLA/SLO/SLI

Tools

Additional Reading materials

Error budget

Oncall duty

Classification of incident/issue/bug

Incident Management

Related topics

References