counter create hit Site Reliability Engineering: How Google Runs Production Systems - Download Free eBook
Hot Best Seller

Site Reliability Engineering: How Google Runs Production Systems

Availability: Ready to download

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitmen The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient--lessons directly applicable to your organization. This book is divided into four sections: Introduction--Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples--Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices--Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systemsManagement--Explore Google's best practices for training, communication, and meetings that your organization can use


Compare

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitmen The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient--lessons directly applicable to your organization. This book is divided into four sections: Introduction--Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples--Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices--Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systemsManagement--Explore Google's best practices for training, communication, and meetings that your organization can use

30 review for Site Reliability Engineering: How Google Runs Production Systems

  1. 4 out of 5

    Simon Eskildsen

    Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, Much of the information on running production systems effectively from Google has been extremely important to how I have changed my thinking about the SRE role over the years—finally, there's one piece that has all of what was previously something you'd had to look long and hard for in various talks, papers and abstracts: error budgets, the SRE role definition, scaling, etc. That said, this book suffers a classic problem from having too many authors write independent chapters. Much is repeated, and each chapter stands too much on its own—building from first principles each time, instead of leveraging the rest of the book. This makes the book much longer than it needs to be. Furthermore, it tries to be both technical and non-technical—this confuses the narrative of the book, and it ends up not excelling at either of them. I would love to see two books: SRE the technical parts, and SRE the non-technical parts. Overall, this book is still a goldmine of information to a 5/5—but it is exactly that, a goldmine that you'll have to put a fair amount of effort into dissecting to retrieve the most value from, because the book's structure doesn't hand it to you—that's why we land at a 3/5. When recommending this book to coworkers, which I will, it will be chapters from the book—not the book at large.

  2. 4 out of 5

    Mircea

    Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them act Boring as F. The main message is: oh look at us, we have super hard problems and like saying 99.999% a lot. And oh yeah... SREs are developers. We don't spend more than 50% on "toil" work. Pleeeease. Book has some interesting stories and if you are good at reading between the lines you might learn something. Everything else is BS. Does every chapter needs to start telling us who edited the chapter? I don't give a f. The book also seems to be the product of multiple individuals (a lot of them actually) whose sole connection is that they wrote a chapter for this book. F the reader, F structure, F focusing on the core of the issue. Let's just dump a stream of consciousness kind of junk and after that tell everyone how hard it is and how we care about work life balance. Again, boring and in general you're gonna waste your time reading this (unless you want to know what borg, chubby and bigtable are)

  3. 5 out of 5

    Michael Scott

    Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or use Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a collection of essays on the design and operation of large-scale datacenters, with the goal of making them simultaneously scalable, robust, and efficient. Overall, despite (willing?) ignorance of the history of distributed systems and in particular (grid) datacenter technology, this is an excellent book that teaches us how Google thinks (or used to think, a few years back) about its datacenters. If you're interested in this topic, you have to read this book. Period. Structure The book is divided into four main parts, each comprised of several essays. Each essay is authored by what I assume is a Google engineer, and edited by one of Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. (I just hope that what I didn't like about the book can be attributed to the editors, because I really didn't like some stuff in here.) In Part I, Introduction, the authors introduce Google's Site Reliability Engineering (SRE) approach to managing global-scale IT services running in datacenters spread across the entire world. (Truly impressive achievement, no doubt about it!) After a discussion about how SRE is different from DevOps (another hot term of the day), this part introduces the core elements and requirements of SRE, which include the traditional Service Level Objectives (SLOs) and Service Level Agreements (SLAs), management of changing services and requirements, demand forecasting and capacity, provisioning and allocation, etc. Through a simple service, Shakespeare, the authors introduce the core concepts of running a workflow, which is essentially a collection of IT tasks that have inter-dependencies, in the datacenter. In Part II, Principles, the book focuses on operational and reliability risks, SLO and SLA management, the notion of toil (mundane work that scales linearly (why not super-linearly as well?!?!) with services, yet can be automated) and the need to eliminate it (through automation), how to monitor the complex system that is a datacenter, a process for automation as seen at Google, the notion of engineering releases, and, last, an essay on the need for simplicity . This rather disparate collection of notions is very useful, explained for the laymen but still with enough technical content to be interesting even for the expert (practitioner or academic). In Parts III and IV, Practices and Management, respectively, the book discusses a variety of topics, from time-series analysis for anomaly detection, to the practice and management of people on-call, to various ways to prevent and address incidents occurring in the datacenter, to postmortems and root-cause analysis that could help prevent future disasters, to testing for reliability (a notoriously difficult issue), to software engineering int he SRE team, to load-balancing and overload management (resource management and scheduling 101), communication between SRE engs, etc. etc. etc., until the predictable call for everyone to use SRE as early as possible and as often as possible. Overall, palatable material, but spread too thin and with too much much overlap with prior related work of a decade ago, especially academic, and not much new insight. What I liked I especially liked Part II, which in my view is one of the best introductions to datacenter management available today to the students of this and related topics (e.g., applied distributed systems, cloud computing, grid computing, etc.) Some of the topics addressed, such as risk and team practices, are rather new for many in the business. I liked the approach proposed in this book, which seemed to me above and beyond the current state-of-the-art. Topics in reliability (correlated failures, root-cause analysis) and scheduling (overload management, load balancing, architectural issues, etc.) are currently open in both practice and academia, and this book emphasizes in my view the dearth of good solutions but for the simplest of problems. Many of the issues related to automated monitoring and incident detection could lead in the future to better technology and much innovation, so I liked the prominence given to these topics in this book. What I didn't like I thoroughly disliked the statements claiming by omission that Google has invented most of the concepts presented in the book, which of course in the academic world would have been promptly sent to the reject pile. As an anecdote, consider the sentence Ben Treynor Sloss, Google’s VP for 24/7 Operations, originator of the term SRE, claims that reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it!. I'll skip the discussion about who is the originator of the term SRE, and focus on the meat of this statement. By omission, it makes the reader think that Google, through its Ben Treynor Sloss, is the first to understand the importance of reliability for datacenter-related systems. In fact, this has been long-known in the grid computing community. I found in just a few minutes explicit references from Geoffrey Fox (in 2005, on page 317 of yet another grid computing anthology, "service considers reliable delivery to be more important than timely delivery"), Alexandru Iosup (in 2007, on page 5 of this presentation, and again in 2009, in this course, "In today’s grids, reliability is more important than performance!"). Of course, this notion has been explored for the general case of services much earlier... anyone familiar with air and especially space flight? The list of concepts actually not invented at Goog but about which the book implies to the contrary goes on and on... I also did not like some of the exaggerated claims of having found solutions for the general problems. Much remains to be done, as hiring at Google in these areas continues unabated. (There's also something called computer science, whose state-of-the-art indicates the same.)

  4. 5 out of 5

    Dimitrios Zorbas

    I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them. It helped me wrap my head around concepts for which I used to rely on intuition. I've shaped processes and created template documents (postmortem / launch coordination checklist) for work based on this book. I have so many bookmarks in this book and consider it an invaluable read. While not every project / company needs to operate at Google scale, it helps streamlining the process to define SLO / SLAs for the occasion and establishing communication channels and practices to achieve them. It helped me wrap my head around concepts for which I used to rely on intuition. I've shaped processes and created template documents (postmortem / launch coordination checklist) for work based on this book.

  5. 5 out of 5

    Michael Koltsov

    I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge. For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealin I don’t normally buy paper books, which means that in the course of the last few years I’ve bought only one paper book even though I’ve read hundreds of books during that period of time. This book is the second one I’ve bought so far, which means a lot to me. Not mentioning that Google is providing it on the Internet free of charge. For me, personally, this book is a basis on which a lot of my past assumptions could be argued as viable solutions with the scale of Google. This book is not revealing any Google’s secrets (do they really have any secrets?) But it’s a great start even if you don’t need the scale of Google but want to write robust and failure-resilient apps. Technical solutions, dealing with the user facing issues, finding peers, on-call support, post-mortems, incident-tracking systems – this book has it all though, as chapters have been written by different people some aspects are more emphasized than the others. I wish some of the chapters had more gory production-based details than they do now. My score is 5/5

  6. 4 out of 5

    Alexander Yakushev

    This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company. The text quality is top-notch, the book is written with clarity in mind and thoroughly edited. I'd rate the content itself at four stars. But the book deserves the fi This book is great on multiple levels. First of all, it packs great content — the detailed explanation of how and why Google has internally established what we now call "the DevOps culture." Rationale coupled together with hands-on implementation guide provide incredible insight into creating and running SRE team in your own company. The text quality is top-notch, the book is written with clarity in mind and thoroughly edited. I'd rate the content itself at four stars. But the book deserves the fifth star because it is a superb example of a material that gives you the precise understanding of how some company (or its division) operates inside. Apparently, Google can afford to expose such secrets while not many other companies can, but we need more low-BS to-the-point books like this to share and exchange the experience of running the most complex systems (that is, human organizations) efficiently.

  7. 4 out of 5

    James Stewart

    Loads of interesting ideas and thoughts, but a bit of a slog to get through. The approach of having different members of the team write different sections probably worked really well for engaging everyone, but it made for quite a bit of repetition. It also ends up feeling like a few books rolled into one, with one on distributed systems design, another on SRE culture and practices, and maybe another on management.

  8. 4 out of 5

    Alex Palcuie

    I think this is the best engineering book in the last decade.

  9. 5 out of 5

    Regis Hattori

    This book is divided into five parts: Introduction, Principles, Practices, Management, and Conclusions. I see a lot of value in the first two parts for any people involved in software development. It convinces us about the importance of the subject with very good arguments, no matter if you are a software engineering, a product manager or even a user. This part deserves 5 stars After some chapters of the Practices part, the conclusion I made is that this part of the book may only be useful if you This book is divided into five parts: Introduction, Principles, Practices, Management, and Conclusions. I see a lot of value in the first two parts for any people involved in software development. It convinces us about the importance of the subject with very good arguments, no matter if you are a software engineering, a product manager or even a user. This part deserves 5 stars After some chapters of the Practices part, the conclusion I made is that this part of the book may only be useful if you are facing a specific problem and are looking for some insights but not to read end-to-end. Some examples are too specific for Google or similar companies that have not the same budget, skills, and pre-requisites. In general, 3 stars is fair, but I will rate as 4 because I really liked the first 2 parts.

  10. 4 out of 5

    Tomas Varaneckas

    This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how maje This was a really hard read, in a bad sense. The first couple of dozen pages were really promising, but the book turned out to be unnecessarily long, incredibly boring, repetative and inconsistent gang bang of random blog posts and often trivial information. It has roughly 10% of valuable content, and would greatly benefit from being reduced to 50-pager. At it's current state it seems that it was a corporate collaborative ego-trip, to show potential employees how cool Google SRE is, and how majestic their scale happens to be. After reading this book, I am absolutely sure I would never ever want to work for Google.

  11. 5 out of 5

    Chris

    There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments. There's a ton of great information here, and we refer to it regularly as we're trying to change the culture at work. I gave it a 4 instead of a 5 because it does suffer a little from the style – think collection of essays rather than a unified arc – but it's really worth reading even if it requires some care to transfer to more usual environments.

  12. 4 out of 5

    Bjoern Rochel

    A little disclaimer: My review here is more about the concept and organizational parts than the pure technical aspects. Mostly because I manage engineering teams nowadays and these areas are the more important ones for me. This book contains also a lot of technical information on how to implement SRE that I would highly recommended for interested software engineers. One aspect I liked in particular about SRE is the Error Budget concept, Googles way to manage the age old conflict between product a A little disclaimer: My review here is more about the concept and organizational parts than the pure technical aspects. Mostly because I manage engineering teams nowadays and these areas are the more important ones for me. This book contains also a lot of technical information on how to implement SRE that I would highly recommended for interested software engineers. One aspect I liked in particular about SRE is the Error Budget concept, Googles way to manage the age old conflict between product and engineering on how to distribute development efforts around non functional requirements and especially technical debt on one side and new features on the other side. The data driven approach and consequently the depersonalization of this debate seems very sane and professional to me. I also liked their emphasis on training, simulation and careful on-boarding for SREs. For me this is still an area where the majority of the industry has plenty room for improvement. Looking at what Google does here makes the rest of us look like f***ing amateurs. Another thing that I’m almost guaranteed to steal is the idea of establishing a Production Readiness Review to ensure reliability of new products and features from multiple angles (design, security, capacity, etc.). What I’m still trying to wrap my head around is whether having dedicated SRE teams are a good idea (in contrast to a you-build-it-you-run-it approach where every delivery team effectively owns the responsibility to reach the defined SLA/Os). A principle that I like a lot is to give engineers a lot of freedom but to also make them accountable for their decisions and the software they produce. Separating out production - fitness into a separate group/team sounds like it goes into the opposite direction. I can imagine that several factors play into this (standardization, active tech/stack management, skill availability, etc.) and certainly Google has carefully evolved it to where it is now, but my initial reaction for this idea was negative. Overall a very good resource that I will come back to

  13. 5 out of 5

    Liviu Costea

    A lot of food for thought, a book that became a reference in the field. The only problem is the wide coverage, you might find some chapters very niche, like not everybody cares how to build layer 4 load balancer. Highly recommended if you are following devops approaches.

  14. 4 out of 5

    Vít Listík

    I like the fact that it is written by multiple authors. Everything stated in the book seems so obvious but it is so sad to read it because it is not yet an industry standard. A must read for every SRE.

  15. 5 out of 5

    Amir Sarabadani

    It's basically a looong advertisement for google with some useful information inside while it should be other way around. It's basically a looong advertisement for google with some useful information inside while it should be other way around.

  16. 4 out of 5

    Ahmad hosseini

    What is SRE? Site Reliability Engineering (SRE) is Google’s approach to service management. An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). Typical SRE activities fall into the following approximate categories: • Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work. • System engineering: Involves configuring p What is SRE? Site Reliability Engineering (SRE) is Google’s approach to service management. An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). Typical SRE activities fall into the following approximate categories: • Software engineering: Involves writing or modifying code, in addition to any associated design and documentation work. • System engineering: Involves configuring production systems, modifying configuration, or documenting systems in a way that products lasting improvements from a one-time effort. • Toil: work directly to running a service that is repetitive, manual, etc. • Overhead: Administrative work not tied directly to running a service. Quotes “Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is gained by investigating why a system doesn’t work.” – Brain Redman “Ways in which things go right are special cases of the ways in which things go wrong.” – John Allspaw About book This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. “Essential reading for anyone running highly available web services at scale.” – Adrian Cockcroft, Battery Ventures, former Netflix Cloud Architect

  17. 4 out of 5

    Luke Amdor

    Some really great chapters especially towards the beginning and the end. However, I feel like it could have been edited better. It meanders a lot.

  18. 5 out of 5

    David

    The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most The book seems largely to be a collection of essays written by disparate people within Google's SRE organization. It's as well-organized and coherent as that can be (and I think it's a good format for this -- far better than if they'd tried to create something with a more unified narrative). But it's very uneven: some chapters are terrific while some seem rather empty. I found the chapters on risk, load balancing, overload, distributed consensus, and (surprisingly) launches to be among the most useful. On the other hand, the chapter on simplicity was indeed simplistic, and the chapter on data integrity was (surprisingly) disappointing. The good: there's a lot of excellent information in this book. It's a comprehensive, thoughtful overview for anybody entering the world of distributed systems, cloud infrastructure, or network services. Despite a few misgivings, I'm pretty on board with Google's approach to SRE. It's a very thoughtful approach to the problems of operating production services, covering topics ranging from time management, prioritization, onboarding, plus all the technical challenges in distributed systems. The bad: The book gets religious (about Google) at times, and some of it's pretty smug. This isn't a big deal, but it's likely to turn off people who've seen from experience how frustrating and unproductive it can be when good ideas about building systems become religion.

  19. 4 out of 5

    Scott Maclellan

    A fantastic and in-depth resource. Great for going deeper and maturing how a company builds and runs software at scale. Touches on the specific tactical actions your team can take to build more reliable products. The extended sections on culture slowed me down alot, but have led to some very interesting conversations at work.

  20. 4 out of 5

    Tadas Talaikis

    "Boring" (at least from the outside world perspective, ok with me), basically can be much shorter. Culture, automation of everything, load balancing, monitoring, like everywhere else, except maybe Borg thing. "Boring" (at least from the outside world perspective, ok with me), basically can be much shorter. Culture, automation of everything, load balancing, monitoring, like everywhere else, except maybe Borg thing.

  21. 4 out of 5

    Luca

    There’s interesting content for sure. But the writing isn’t engaging (the book is long so that becomes boring kinda fast) and some aspects of the google culture are real creepy (best example: “humans are imperfect machines” while talking about people management...)

  22. 5 out of 5

    Mengyi

    This is a complete collection of everything about building the SRE team, from their practices to how to onboard a new SRE to the team. I am personally really inspired by the concept of error Budget and the share by default culture folders by practices such as blameless postmortem.

  23. 5 out of 5

    David Robillard

    A must read for anyone involved with online services.

  24. 5 out of 5

    Gary Boland

    A useful checklist for production engineering is tarnished by the undercurrent of marketing/recruiting. Still deserves its place on the shelf if you deliver software for a living

  25. 5 out of 5

    Sundarraj Kaushik

    A wonderful book to learn how to manage websites so that they are reliable. Some good random extracts from the book. Site Reliability Engineering 1. Operations personnel should spend 50% of their time in writing automation scripts and programs. 2. the decision to stop releases for the remainder of the quarter once an error budget is depleted 3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning o A wonderful book to learn how to manage websites so that they are reliable. Some good random extracts from the book. Site Reliability Engineering 1. Operations personnel should spend 50% of their time in writing automation scripts and programs. 2. the decision to stop releases for the remainder of the quarter once an error budget is depleted 3. an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s). 4. codified rules of engagement and principles for how SRE teams interact with their environment—not only the production environment, but also the product development teams, the testing teams, the users, and so on 5. operates under a blame-free postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them. 6. There are three kinds of valid monitoring output: Alerts: Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation. Tickets: Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result. Logging: No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so. 7. Resource use is a function of demand (load), capacity, and software efficiency. SREs predict demand, provision capacity, and can modify the software. These three factors are a large part (though not the entirety) of a service’s efficiency. SLI - Service Level Indicator - Indicators used to measure the health of a service. Used to determine the SLO and SLA. SLO - Service Level Objective - The objective that must be met by the service. SLA - Service Level Agreement - The Agreement with the client with respect to the services rendered to them. Don’t overachieve Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads. "If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow." Four Golden Signals of Monitoring The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four. Latency: The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors. Traffic: A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second. Errors: The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content. Saturation: How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation. Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours." If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring. Why it is important to have control over the software that one is using? Why and when it makes sense to roll out one's own framework and/or platform? Another argument in favor of automation, particularly in the case of Google, is our complicated yet surprisingly uniform production environment, described in The Production Environment at Google, from the Viewpoint of an SRE. While other organizations might have an important piece of equipment without a readily accessible API, software for which no source code is available, or another impediment to complete control over production operations, Google generally avoids such scenarios. We have built APIs for systems when no API was available from the vendor. Even though purchasing software for a particular task would have been much cheaper in the short term, we chose to write our own solutions, because doing so produced APIs with the potential for much greater long-term benefits. We spent a lot of time overcoming obstacles to automatic system management, and then resolutely developed that automatic system management itself. Given how Google manages its source code, the availability of that code for more or less any system that SRE touches also means that our mission to “own the product in production” is much easier because we control the entirety of the stack. When developed in-house the platform/framework can be designed to manage any failures automatically. There is no external observer required to manage this. One of the negatives of automation is that humans forget how to do a task when required. This may not be always good. Google Cherry Picks features for release. Should we do the same? "All code is checked into the main branch of the source code tree (mainline). However, most major projects don’t release directly from the mainline. Instead, we branch from the mainline at a specific revision and never merge changes from the branch back into the mainline. Bug fixes are submitted to the mainline and then cherry picked into the branch for inclusion in the release. This practice avoids inadvertently picking up unrelated changes submitted to the mainline since the original build occurred. Using this branch and cherry pick method, we know the exact contents of each release." Note that cherry picking is of specific release branches and not changes in specific branch. Surprises vs. boring "Unlike just about everything else in life, "boring" is actually a positive attribute when it comes to software! We don’t want our programs to be spontaneous and interesting; we want them to stick to the script and predictably accomplish their business goals. In the words of Google engineer Robert Muth, "Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code." Surprises in production are the nemeses of SRE." Commenting or flagging code "Because engineers are human beings who often form an emotional attachment to their creations, confrontations over large-scale purges of the source tree are not uncommon. Some might protest, "What if we need that code later?" "Why don’t we just comment the code out so we can easily add it again later?" or "Why don’t we gate the code with a flag instead of deleting it?" These are all terrible suggestions. Source control systems make it easy to reverse changes, whereas hundreds of lines of commented code create distractions and confusion (especially as the source files continue to evolve), and code that is never executed, gated by a flag that is always disabled, is a metaphorical time bomb waiting to explode, as painfully experienced by Knight Capital, for example (see "Order In the Matter of Knight Capital Americas LLC" [Sec13])." Writing blameless RCA Pointing fingers: "We need to rewrite the entire complicated backend system! It’s been breaking weekly for the last three quarters and I’m sure we’re all tired of fixing things onesy-twosy. Seriously, if I get paged one more time I’ll rewrite it myself…" Blameless: "An action item to rewrite the entire backend system might actually prevent these annoying pages from continuing to happen, and the maintenance manual for this version is quite long and really difficult to be fully trained up on. I’m sure our future on-callers will thank us!" Establishing a strong testing culture One way to establish a strong testing culture is to start documenting all reported bugs as test cases. If every bug is converted into a test, each test is supposed to initially fail because the bug hasn’t yet been fixed. As engineers fix the bugs, the software passes testing and you’re on the road to developing a comprehensive regression test suite. Project Vs. Support Dedicated, noninterrupted, project work time is essential to any software development effort. Dedicated project time is necessary to enable progress on a project, because it’s nearly impossible to write code—much less to concentrate on larger, more impactful projects—when you’re thrashing between several tasks in the course of an hour. Therefore, the ability to work on a software project without interrupts is often an attractive reason for engineers to begin working on a development project. Such time must be aggressively defended. Managing Loads Round Robin Vs. Weighted Round Robin (Round Robin, but taking into consideration the number of tasks pending at the server) Overload of the system has to be avoided by usage of load testing. If despite this the system is overloaded then any retries have to be well controlled. A retry at a higher level can cascade the retries at the lower level. Use jitter retries (retry at random intervals) and exponential retry (exponentially increase the time between the retries) and fail quickly to prevent overload on the already overloaded system. If queuing is used to prevent overloading of server then sometimes FIFO may not be a good option as the user waiting for the tasks at the head of the queue may have left the system not expecting a response. If task is split into multiple pipelined tasks then it will be good to check at each stage if there is sufficient time for performing the rest of the tasks based on the expected time that will be taken by the remaining tasks in the pipeline. Implement a deadline propagation. Safeguarding the data Three levels of guard against data loss 1. Soft Delete (Visible to user in the recycle bin) 2. Back up (incremental and full) before actual deletion and test ability to restore. Replicate live and backed up data. 3. Purge data (Can be recovered only from backup now) Out of Band data validation to prevent surprising data loss. Important to 1. Continuously test the recovery process as part of your normal operations 2. Set up alerts that fire when a recovery process fails to provide a heartbeat indication of its success Launch Coordination Checklist This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity: 1. Architecture: Architecture sketch, types of servers, types of requests from clients 2. Programmatic client requests 3, Machines and datacenters 4, Machines and bandwidth, datacenters, N+2 redundancy, network QoS 5. New domain names, DNS load balancing 6. Volume estimates, capacity, and performance 7. HTTP traffic and bandwidth estimates, launch “spike,” traffic mix, 6 months out 8. Load test, end-to-end test, capacity per datacenter at max latency 9. Impact on other services we care most about 10. Storage capacity 11. System reliability and failover What happens when: Machine dies, rack fails, or cluster goes offline Network fails between two datacenters For each type of server that talks to other servers (its backends): How to detect when backends die, and what to do when they die How to terminate or restart without affecting clients or users Load balancing, rate-limiting, timeout, retry and error handling behavior Data backup/restore, disaster recovery 12. Monitoring and server management Monitoring internal state, monitoring end-to-end behavior, managing alerts Monitoring the monitoring Financially important alerts and logs Tips for running servers within cluster environment Don’t crash mail servers by sending yourself email alerts in your own server code 13. Security Security design review, security code audit, spam risk, authentication, SSL Prelaunch visibility/access control, various types of blacklists 14. Automation and manual tasks Methods and change control to update servers, data, and configs Release process, repeatable builds, canaries under live traffic, staged rollouts 15. Growth issues Spare capacity, 10x growth, growth alerts Scalability bottlenecks, linear scaling, scaling with hardware, changes needed Caching, data sharding/resharding 16. External dependencies Third-party systems, monitoring, networking, traffic volume, launch spikes Graceful degradation, how to avoid accidentally overrunning third-party services Playing nice with syndicated partners, mail systems, services within Google 17. Schedule and rollout planning Hard deadlines, external events, Mondays or Fridays Standard operating procedures for this service, for other services As mentioned, you might encounter responses such as "Why me?" This response is especially likely when a team believes that the postmortem process is retaliatory. This attitude comes from subscribing to the Bad Apple Theory: the system is working fine, and if we get rid of all the bad apples and their mistakes, the system will continue to be fine. The Bad Apple Theory is demonstrably false, as shown by evidence [Dek14] from several disciplines, including airline safety. You should point out this falsity. The most effective phrasing for a postmortem is to say, "Mistakes are inevitable in any system with multiple subtle interactions. You were on-call, and I trust you to make the right decisions with the right information. I'd like you to write down what you were thinking at each point in time, so that we can find out where the system misled you, and where the cognitive demands were too high." "The best designs and the best implementations result from the joint concerns of production and the product being met in an atmosphere of mutual respect." Postmortem Culture Corrective and preventative action (CAPA) is a well-known concept for improving reliability that focuses on the systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE's strong culture of blameless postmortems. When something goes wrong (and given the scale, complexity, and rapid rate of change at Google, something inevitably will go wrong), it's important to evaluate all of the following: What happened The effectiveness of the response What we would do differently next time What actions will be taken to make sure a particular incident doesn't happen again This exercise is undertaken without pointing fingers at any individual. Instead of assigning blame, it is far more important to figure out what went wrong, and how, as an organization, we will rally to ensure it doesn't happen again. Dwelling on who might have caused the outage is counterproductive. Postmortems are conducted after incidents and published across SRE teams so that all can benefit from the lessons learned. Decisions should be informed rather than prescriptive, and are made without deference to personal opinions—even that of the most-senior person in the room, who Eric Schmidt and Jonathan Rosenberg dub the "HiPPO," for "Highest-Paid Person's Opinion"

  26. 5 out of 5

    Jeremy

    This is the kind of book that can be quite hard to digest in one go, cover to cover. It took me more than two years to (casually) read it! Of course not everything can be applied everywhere. Not every organization is of the size of Google, and has the same amount of resources to apply the principles. Still, there is good advice mentioned in the book which can come handy in many situations.

  27. 5 out of 5

    Amr

    The book is great in terms of getting more understanding of google’s SRE culture. But I got to a place where it became irrelevant to me to continue the book so I decided to drop it.

  28. 4 out of 5

    Mark Hillick

    This review has been hidden because it contains spoilers. To view it, click here. Having worked in tech for many years, at a fairly large scale but not Google-scale, I'm probably the ideal audience for this book. I've never had so bookmarks or follow-up reading from a book, awesome knowledge-sharing from the Google SRE team. Although this book itself is not overly technical, its subject is very technical and this book is undoubtedly well-worth reading for all engineers, even if you don't operate at scale. You can learn what works and what doesn't, and then incorporate the vari Having worked in tech for many years, at a fairly large scale but not Google-scale, I'm probably the ideal audience for this book. I've never had so bookmarks or follow-up reading from a book, awesome knowledge-sharing from the Google SRE team. Although this book itself is not overly technical, its subject is very technical and this book is undoubtedly well-worth reading for all engineers, even if you don't operate at scale. You can learn what works and what doesn't, and then incorporate the various best practices, and possibly technologies, into your day job (in a controlled fashion with a clear strategy). There are so many good things that are worth calling out about this book, a short summary of highlights would include: - What makes a good SRE, and it ain't all technical - Everything about toil, what it is and why it is bad, particularly for the team's health, its success and the individual growth of team members - How to successfully build a SRE team (discipline), engage/embed with other teams, and bringing in new team members - The links to further reading or external papers, especially when the book didn't have enough space to dive into things technically (e.g. Maglev load balancer) - I live that the book called out burnout, ensuring that team members still do tedious but necessary work, while still having time to take a break and ensuring they can have dedicated blocks for more interesting or project work - The templates for on-call, triage, incident response, and postmortems are excellent (I love that they called the "no-blame" approach and the usefulness of checklists) Some things I'd have liked to see: - Better flow the earlier sections (2 & 3), particularly around alerting and monitoring. At times, reading was a drag here. - There's often repetition, probably caused by the changes in authors with "first principles" suffering the most from repetition (which would've clearly reduced the length of the book and made it easier to read) - At times, I felt there could have been more detail and meet to some of the internal tools and incidents, though the fact that Google have published this book in the first place and been honest that they've screwed up at times is amazing and quite unique in the tech industry. I want to recommend this book to colleagues but I will probably recommend specific chapters as opposed to the whole book, due mainly to repetition mentioned above. Lastly, I work in InfoSec and I sincerely hope those in InfoSec read this book in order to understand how the SRE team came into existence at Google and became such a success that they have to turn Product Development teams away when asked for 100% engagement support. Sadly many InfoSec teams are in an echo chamber in their corner as their company scales.

  29. 5 out of 5

    Tim O'Hearn

    “Perfect algorithms may not have perfect implementations.” And perfect books may not have perfect writers. Site Reliability Engineering is an essay collection that can be rickety at times but is steadfast in its central thesis. Google can claim credit for inventing Site Reliability Engineering and, in this book, a bunch of noteworthy engineers share their wisdom from the trenches. When it comes to software architecture and product development, I’ve found delight in reading about how startups’ p “Perfect algorithms may not have perfect implementations.” And perfect books may not have perfect writers. Site Reliability Engineering is an essay collection that can be rickety at times but is steadfast in its central thesis. Google can claim credit for inventing Site Reliability Engineering and, in this book, a bunch of noteworthy engineers share their wisdom from the trenches. When it comes to software architecture and product development, I’ve found delight in reading about how startups’ products are built because the stories are digestible. It’s possible for a founder, lead engineer, or technical writer to lay down the blueprint of a small-scale product and even get into the nuts and bolts. When it comes to large tech companies, this is impossible from a technical point of view and improbable from a compliance standpoint. This is beside the purpose of the book, but arrangements like this one help bridge the gap between one’s imagination and the inner-workings of tech giants. There are plenty of (good!) books that tell you all about how Google the business works, but this one happens to be the best insight into how the engineering side operates. Sure, you have to connect some dots and bring with you some experience, but the result is priceless--you start to feel like you get it. The essays are almost all useful. If you haven’t spent at least an internship’s worth of time in the workforce, you should probably table this one until you have a bit more experience. I would have enjoyed this book as an undergraduate, no doubt, but most of it wouldn’t have clicked. The Practices section--really, the meat of the book--is where the uninitiated might struggle. When I emerged on the other side I had a list of at least twenty topics that I needed to explore in more detail if I was to become truly great at what I do. I highly recommend this book to anyone on the SRE/DevOps spectrum as well as those trying to understand large-scale tech companies as a whole. See this review and others on my blog

  30. 5 out of 5

    Moses

    When I started working on software infrastructure at large companies, I was struck by how little of what I was working on had been covered in school, and how little I could find in academia. Talking to friends in industry, many of us were facing the same problems, but there didn't seem to be any literature on what we were doing. Everything we learned, we learned either through the school of hard knocks, or from more experienced folks. This book fills a much needed gap. Furthermore, since many com When I started working on software infrastructure at large companies, I was struck by how little of what I was working on had been covered in school, and how little I could find in academia. Talking to friends in industry, many of us were facing the same problems, but there didn't seem to be any literature on what we were doing. Everything we learned, we learned either through the school of hard knocks, or from more experienced folks. This book fills a much needed gap. Furthermore, since many companies have evolved their processes in silos, even engineers who already have a pretty good idea of how to increase 9s will learn something new, since Google's history has probably led them down a different evolutionary path than what your company followed. Because of this, I hope that folks don't consider the matter of reliability open and shut now that this book has come out. In truth, this book is in many ways a history book about how Google handles reliability, and is not the end-all, be-all of reliability in distributed systems. This book is a good starting place, but not all of their practices or ideas are right for all systems, and we can remember that we're in a nascent field, and there's still work to be done. With that said, this book comes with the same problems that many books that are collections of essays have. There isn't a cohesive narrative, it often repeats itself, and the essays are uneven. Some of them are radiant, and some of them are not. Even considering the flaws of this book, I highly recommend it for anyone who is trying to make distributed systems reliable within a large engineering organization.

Add a review

Your email address will not be published. Required fields are marked *

Loading...
We use cookies to give you the best online experience. By using our website you agree to our use of cookies in accordance with our cookie policy.