Swiss Digital Network

Site Reliability Engineering – An Overview

If you let a software engineer come up with an ideal set-up for application operations, you will most likely hear about a concept that is very similar to that of Site Reliability Engineering (SRE)Site Reliability Engineering. Beyer, Jones, Petoff, Murphy, 2016 as Ben Treynor, VP of engineering at Google and founder of Google’s SRE team, describes. 

The motivation to introduce and define such a role is based on conflicting goals in modern operating models. Two examples of such conflicts are: 

  • Trade-off between release speed, release quality and operation stability. Typically, this is created from the silo split into:
    – Operational teams that strive for stability.  
    – Quality assurance teams that want to find the maximum of issues. 
    – Development teams that are required to release new features with high velocity. 
  • Economic and cost efficiency goals that conflict with the increasing service complexity (e.g. virtualisation, containerisation, microservices) and go against the classical specialist system administrator set-up that was used to streamline standardised operating environments. 
Typical conflicts from siloed goals.
Typical conflicts from siloed goals.

The solution principles that we enonciate below are some of the key patterns of Site Reliability Engineering that should be considered by companies of any industry who are seeking to adopt SRE. 

Reliability Engineering vs. System Administration 

Traditionally, Systems Engineering departments take over the commissioning of applications and hand them over to the operations departments, including operating instructions. The classic system administrator therefore takes over operation based on the operating manual. 

In contrast, the Site Reliability Engineer is involved throughout the application life cycle and develops the operational goals in coordination with product management and application development. 

 In other words, the SR engineer is never disconnected from the operation and have the opportunity to monitor its progression in every step.  
Therefore, not only engineering-skills are important, but also communication skills. 

While the traditional system administrator has mainly concentrated on operating the system according to the operating manual, the SRE brings the stability and availability of the system in line with the more frequent releases required by product management today. 

Reliability Engineering vs. Operational Activities 

An SRE strives to automate to a large extent the tedious and repetitive tasks such as cleaning up file systems or processing specialist tasks like erroneous payments according to the principle “automate wherever possible and sensible”. 

By that, he strives minimizing errors that often occur during the operation of applications. Those errors can be either of a functional nature (e.g. payments which could not be processed by the system) or of non-functional nature (e.g. hard disks running full). 

Even if recurring incidents occur, SREs have the task of not only solving them in the short term and thus fixing the problem, but also of solving them by adapting the operating infrastructure or by developing (engineering) the solution themselves, possibly also in dialog with the developers. 

These continuous improvements simplify the operation of the application and reduce costs. 

In order for SREs to take on these engineering tasks, they must receive the support of the organization in order to ideally spend 50% or more of their time on it. 

Service Level Objectives vs. Service Level Agreements 

As soon as Service Level Agreements (SLA) are discussed, contracts specify which goals must be met and, above all, what penalties are foreseen in the event of non-compliance. Because of the contractual-legal binding nature of this construct, it is inappropriate and inert for internal goals. 

When working out Service Level Objectives (SLO), SREs work together with product management to define the goals of accessibility and their measurability. In doing so, SREs already know the architecture of the application and can in that order define meaningful and measurable goals.

Relation between SLA, SLO and SLI.
Relation between SLA, SLO and SLI.

Commonly used Service Level Indicators (SLI) for measuring SLO are, for example, requests/s, error rate and duration (RED). We will detail these in one of the subsequent blog posts. 

Error Budget vs. SLOs 

The desire for high availability and stability conflicts, at first sight, with a high change frequency. On the one hand, the product management wants new features to be released quickly and at the same time requires a high level of stability. 

The SRE actively work on this discrepancy and negotiates an error budget with Product Management, which is set against availability.  
For example, if the service should have an availability of 99.99% in a quarter, an SRE has the possibility to stop the service for 13.1 minutes (0.01% of three months) and install new features. These features can be new functional requirements/improvements or non-functional capabilities/improvements (e.g. performance, scalability, security, automation, monitoring, etc.). 

How to adopt SRE for your environment? 

The adoption of SRE for a non-high-tech organization is quite challenging despite the major contributions (i.e. basic training, several publications) of Google to adapt the original Software Reliability Engineering concepts from the original highly scalable online platforms. 

The Founders of Digital Architects with 20+ years’ experience in industrialising and architecting adjacent disciplines such as Unified Monitoring and Performance Engineering have recognized these SRE adoption challenges and have developed the concept of “Effective SRE”Effective SRE. Digital Architects Zurich, 2020.  

Effective SRE’s main intention is to industrialize the SRE approach and make it accessible to any organisation looking to build The Digital Highway for Continuous and Reliable Software Delivery while leveraging Google SRE best practices and AI-Driven CI/CD and AIOps Patterns.  

The secret sauce of the industrialized “Effective SRE” approach of Digital Architects will be explained with more detail in further blog posts.

References 

  1. Site Reliability Engineering. Beyer, Jones, Petoff, Murphy, 2016. 
  2. Effective SRE. Digital Architects Zurich, 2020. 

More About Digital Architects Zurich