Collaborating Authors

The Best DevOps Conference of 2018 !


VIEW & REGISTER HERE Only 57 Seats Left as of August 9, 2018 The complexity of managing and delivering the high level of reliability expected of web-based, cloud hosted systems today, and the expectation of Continuous Delivery of new features has led to the evolution of a totally new field of Service Reliability Engineering catered for such systems. Google, who has been a pioneer in this field, calls it Site Reliability Engineering (SRE). While it would be more aptly named Service Reliability Engineering, the name has caught on. The seminal work documenting Google approach and practices is in the book by Google by the same name (commonly referred to as the'SRE book'), and has become the defacto standard on how to adopt SRE in an organization. This session will cover adopting SRE as a practice in organizations also adopting DevOps; address the challenges to adopting SRE faced by large traditional enterprises, and how to overcome them.

Google - Site Reliability Engineering


Imagine a situation where your services report healthy and serving but you receive multiple user reports of poor availability. How are these users accessing your service? Most likely, they are using your service through a client application, such as a mobile application on their phone. SRE traditionally has only supported systems and services run in datacenters rather than the code running on the client, and this can lead to issues that go unnoticed until it is too late. This report explains the importance of client-side reliability, describes the challenges of working in such an environment, and provides a useful set of SRE concepts and potential tools to apply to your own client applications.

Introduction to Machine Learning Reliability Engineering


Machine Learning Reliability Engineering (MLRE) is an upcoming specialization of Site Reliability Engineering (SRE). In this article, I'll introduce to you why specialization is required in SRE and some of the other specializations that already exist. I'll also talk about the roles and responsibilities of an MLRE and provide brief insight into how different engineering functions will interact with this new role. Throughout this article, I'll use MLRE to refer to both the field of Machine Learning Reliability Engineering and also for referring to a Machine Learning Reliability Engineer. Google first came up with the idea of SRE by applying the principles of software engineering to DevOps more than 15 years ago.

Metrics That Matter

Communications of the ACM

Site reliability engineering, or SRE, is a software-engineering specialization that focuses on the reliability and maintainability of large systems. In its experience in the field, Google has found some critical but oft-neglected metrics that are important for running reliable services. This article, based on Ben Treynor's talk at the Google Cloud Next 2017 conference,7 addresses those metrics, specifically for product development and SRE teams, managers of such teams, and anyone else who cares about the reliability of Web products or infrastructure. To further explain its approach to product reliability, Google has published Site Reliability Engineering: How Google Runs Production Systems1 (hereafter referred to as the SRE book) and The Site Reliability Workbook: Practical Ways to Implement SRE2 (hereafter referred to as the SRE workbook).

Why SRE Documents Matter

Communications of the ACM

Site Reliability Engineering (SRE) is a job function, a mind-set, and a set of engineering approaches for making Web products and services run reliably. SREs operate at the intersection of software development and systems engineering to solve operational problems and engineer solutions to design, build, and run large-scale distributed systems scalably, reliably, and efficiently. SREs focus on the life cycle of services--from inception and design, through deployment, operation, refinement, and eventual decommissioning. Before services go live, SREs support them through activities such as system design consulting, developing software platforms and frameworks and capacity plans, and conducting launch reviews. Once services reach end of life, SREs decommission them in a predictable fashion with clear messaging and documentation. A mature SRE team likely has well-defined bodies of documentation associated with many SRE functions. If you manage an SRE team or intend to start one, this article will help you understand the types of documents your team needs to write and why each type is needed, allowing you to plan for and prioritize documentation work along with other team projects. Before discussing the nuances of SRE documentation, let's examine a night and day in the life of Zoë, a new SRE. Zoë is on her second on-call shift as an SRE for Acme Inc.'s flagship AcmeSale product.