Debugging Incidents in Google's Distributed Systems
Google has published two books about Site Reliability Engineering (SRE) principles, best practices, and practical applications.1,2 In the heat of the moment when handling a production incident, however, a team's actual response and debugging approaches often differ from ideal best practices. This article covers the outcomes of research performed in 2019 on how engineers at Google debug production issues, including the types of tools, high-level strategies, and low-level tasks that engineers use in varying combinations to debug effectively. It examines the research approach used to capture data, summarizing the common engineering journeys for production investigations and sharing examples of how experts debug complex distributed systems. Finally, the article extends the Google specifics of this research to provide some practical strategies that you can apply in your organization. As this study began, its focus was on developing an empirical understanding of the debugging process, with the overarching goal of creating optimal product solutions that met the needs of Google engineers. We wanted to capture the data that engineers need when debugging, when they need it, the communication process among the teams involved, and the types of mitigations that are successful.
Sep-24-2020, 04:11:01 GMT
- Country:
- North America > United States > New York (0.04)
- Genre:
- Research Report (0.54)
- Industry:
- Information Technology > Services (0.88)
- Technology: