Skip to content

Latest commit

 

History

History
29 lines (18 loc) · 4.4 KB

File metadata and controls

29 lines (18 loc) · 4.4 KB

sre Site Reliability Engineering (SRE) Learning Path (4 📚)

Risk management, SLAs, SLIs, SLOs, error budget, toil, outage post mortems are just a few concepts involved in an SRE engineer's daily life. The focus of this path is to acquire the necessary skills to keep systems alive and optimize their performance. To do that you need automation, observable systems, monitoring, service lifecycle management skills, and more. Major companies in the field apply interesting techniques to debug, fix and prevent outages in hundreds of services consumed by millons of users simoultaneously, learn how they achieve such performances!

Order Cover Info Description
1 img Site Reliability Engineering: How Google Runs Production Systems
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Myrphy
Published in 2017
552 pages
🐕 📗 🆙
Born in the core of Google, Site Reliability Engineering, has spread as a set of practices to run production systems, automate them and what to do when they fail. In this book the authors tell how those practices came to be, and how the comany had to invent a new way to manage hundreds of services at the same time while keeping the pace of feature releases in a buzzing market. Automation, team organization, and developing new systems to make possible managing services at scale are some of the things you will find in this book.
2 img The Site Reliability Workbook: Practical Ways to Implement SRE
Betsy Beyer, Niall Myrphy, David K. Rensin, Kent Kawahara, Stephen Thorne
Published in 2018
512 pages
🐕 📙
Site Reliability Engineering is being adopted the last few years by many organizations to improve the operational aspects of running production services. This book intends to be a guide to help such organizations and its engineers to understand and integrate the different practices in their workflow.
3 img Observability Engineering: Achieving Production Excellence Engineering
Charity Majors, Liz Fong-Jones, George Miranda
Published in 2022
318 pages
🐕 📙
Every running system is bound to suffer an outage sooner or later, once you are in the critical zone you need to asses the state your system is in as quick as you can to repair it and minimize the down time. The authors show deep knowledge about how to make a system obserbable by exposing metrics and traces, the way you need to store them, and of course how to exploit them to reflect the system state every moment. Aside of outages, observability is also useful to leverage insights of the services regarding the manner you may want to improve them or adapt to new coming necessities.
3 img Building Secure & Reliable Systems: Best Practices for Designing, Implementing and Maintaining Systems
Heather Adkins, Betsy Beyer, Paul Blankship, Piotr Lewandowski, Ana Opera, Adam Stubblefield
Published in 2020
555 pages
🐅 📙
Once familiar with SRE basics and principles this book will teach you how to build systems aiming for reliability and security. Inside you will find the priciples to apply, the best practices to follow, and the patterns to implement when designing systems to make them stable, scalable, reliable and secure.

The following paths are opened to you now, choose wisely:

Bonus quest: learn about these related concepts! 📍 🔰 💎

#risk-management #sla #sli #slo #error-budget #toil #post-mortems #cascading-failures


Last modified 2024-03-25

⬆ back to top