Risk management, SLAs, SLIs, SLOs, error budget, toil, outage post mortems are just a few concepts involved in an SRE engineer's daily life. The focus of this path is to acquire the necessary skills to keep systems alive and optimize their performance. To do that you need automation, observable systems, monitoring, service lifecycle management skills, and more. Major companies in the field apply interesting techniques to debug, fix and prevent outages in hundreds of services consumed by millons of users simoultaneously, learn how they achieve such performances!
Order | Cover | Info | Description |
---|---|---|---|
1 | Site Reliability Engineering: How Google Runs Production Systems Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Myrphy Published in 2017 552 pages 🐕 📗 🆙 |
Born in the core of Google, Site Reliability Engineering, has spread as a set of practices to run production systems, automate them and what to do when they fail. In this book the authors tell how those practices came to be, and how the comany had to invent a new way to manage hundreds of services at the same time while keeping the pace of feature releases in a buzzing market. Automation, team organization, and developing new systems to make possible managing services at scale are some of the things you will find in this book. | |
2 | The Site Reliability Workbook: Practical Ways to Implement SRE Betsy Beyer, Niall Myrphy, David K. Rensin, Kent Kawahara, Stephen Thorne Published in 2018 512 pages 🐕 📙 |
Site Reliability Engineering is being adopted the last few years by many organizations to improve the operational aspects of running production services. This book intends to be a guide to help such organizations and its engineers to understand and integrate the different practices in their workflow. | |
3 | Observability Engineering: Achieving Production Excellence Engineering Charity Majors, Liz Fong-Jones, George Miranda Published in 2022 318 pages 🐕 📙 |
Every running system is bound to suffer an outage sooner or later, once you are in the critical zone you need to asses the state your system is in as quick as you can to repair it and minimize the down time. The authors show deep knowledge about how to make a system obserbable by exposing metrics and traces, the way you need to store them, and of course how to exploit them to reflect the system state every moment. Aside of outages, observability is also useful to leverage insights of the services regarding the manner you may want to improve them or adapt to new coming necessities. | |
3 | Building Secure & Reliable Systems: Best Practices for Designing, Implementing and Maintaining Systems Heather Adkins, Betsy Beyer, Paul Blankship, Piotr Lewandowski, Ana Opera, Adam Stubblefield Published in 2020 555 pages 🐅 📙 |
Once familiar with SRE basics and principles this book will teach you how to build systems aiming for reliability and security. Inside you will find the priciples to apply, the best practices to follow, and the patterns to implement when designing systems to make them stable, scalable, reliable and secure. |
The following paths are opened to you now, choose wisely:
Bonus quest: learn about these related concepts! 📍 🔰 💎
#risk-management #sla #sli #slo #error-budget #toil #post-mortems #cascading-failures
Last modified 2024-03-25