Site Reliability Engineers: Living Under High Pressure - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
DevOps
Commentary
4/29/2020
07:00 AM
Nithyanand Mehta, Executive Vice President, Technical Services and GM, Catchpoint
Nithyanand Mehta, Executive Vice President, Technical Services and GM, Catchpoint
Commentary
100%
0%

Site Reliability Engineers: Living Under High Pressure

Why the role of site reliability engineer is so stressful and what can be done about it.

Image: Pixabay
Image: Pixabay

There’s rarely a dull moment in the life of a site reliability engineer. When applications and services are down, SREs get the call. If thousands of users or millions of dollars are on the line and the clock is ticking, all eyes turn to the SRE to save the day.

The downside to carrying this kind of responsibility: a huge amount of stress. Late nights, high pressure, and constant demands to swoop in and fix problems (even ones that don’t necessarily fall under the SRE role) are all common complaints. And the problem doesn’t seem to be improving.

Why is the SRE role so hard on the people doing these jobs? And what can we do to make it better?

Evolution of the SRE

The role of the SRE evolved in response to changing methods of building digital products and services. In recent years, as more companies have embraced agile software methodologies and DevOps, they’re moving faster than ever to push out new code. When things inevitably break, it’s often the SRE’s job to fix them regardless of whether they were involved in the development and rollout processes.

In principle, SREs are not supposed to be constantly putting out fires. Rather, as Google originally defined the job, they should spend a significant portion of their time on proactive, strategic tasks like increasing system reliability, optimizing capacity planning, and improving documentation. When an incident arises, SREs don’t just bring services back online. Ideally, they conduct extensive post-mortems. They identify why the issue arose, share knowledge about the incident, and build systems and automation to prevent it from happening again.

Unfortunately, many SREs say the reactive aspects of the job end up taking most of their time. That imbalance puts more pressure on SREs than they should be asked to bear. Worse, the steps that could reduce that stress -- increasing system reliability, automating problem resolution, and improving documentation -- are the very things that get pushed aside.

Navigating SRE challenges

Several factors contribute to the stress and frustration: 

  • Poorly defined job responsibilities: Because the SRE role is still relatively new, there’s a lot of variation -- and misunderstanding -- about what exactly the job entails. Too often, the lines between SREs and delivery and operations teams get blurred. As one SRE told us, “Because the SRE role changes from organization to organization, there can be confusion about the SRE role versus pre-existing operations roles. This creates extra work for SREs, as we end up having to do tasks that may not be under our scope or having to push back on requests from people who don’t understand our role.”
  • Outsized focus on reactive incident remediation: Along those lines, many SREs see their roles effectively morph into “ultra sysadmin.” They spend so much time detecting and solving problems, there’s little bandwidth to focus on building systems that are more reliable, efficient, and automated.
  • High-pressure scenarios: SREs often feel like the control-booth technician at a big conference. When a presenter’s slides won’t load, all eyes immediately turn to the booth. For every minute that goes by in silence, the anxiety grows. SREs tell us that while they appreciate being trusted with so much responsibility, what they’d really like is some empathy.

Reimagining SRE roles

Too many organizations have a problem with maintaining the well-being and job satisfaction of their SREs. If we’re going to realize the benefits that drove the creation of the SRE role in the first place -- if companies want to be able to scale up more quickly without sacrificing reliability -- we need to make this function work better. Here are two steps to consider: 

  1. Implement firm timetables for the different parts of the SRE job: There’s no point in bringing in SREs if they end up spending all their time on troubleshooting and operations. Organizations have to consciously carve out time for SREs to devote to building systems and working on proactive initiatives and enforce those timetables. And to lower the time they spend debugging and fixing problems, get them involved earlier in the development life cycle.
  2. Focus on the right metrics: A lot of companies collect data on how long it takes to resolve problems but don’t track how long it takes them to detect problems, or how long until the business is impacted. These are just as important.

It’s time to take better care of SREs

As the guardians of an organization’s critical services, SREs will always shoulder a big responsibility. That’s just the nature of the job. But there’s no reason the role has to come with so much stress and frustration. Organizations can do a better job of empathizing with SREs and making sure that everyone understands what their role is, and what it’s not. They can also make sure they’re giving SREs the time, tools, and visibility they need to be proactive in their jobs.

By taking these steps, organizations can help SREs detect and solve problems more quickly. That in turn creates more time for SREs to focus on initiatives. Ultimately, we can transform the SRE role into a virtuous circle of ongoing improvement and automation. As we do, we’ll end up with a lot less stress and frustration -- among SREs, the broader company, customers, and end users.

Nithyanand Mehta is Executive Vice President, Technical Services and GM at Catchpoint. Mehta leads global Catchpoint Technical Services teams that includes Professional Services, Sales Engineers and Support.

 

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
Commentary
Is Cloud Migration a Path to Carbon Footprint Reduction?
Joao-Pierre S. Ruth, Senior Writer,  10/5/2020
News
IT Spending, Priorities, Projects: What's Ahead in 2021
Jessica Davis, Senior Editor, Enterprise Apps,  10/2/2020
White Papers
Register for InformationWeek Newsletters
2020 State of DevOps Report
2020 State of DevOps Report
Download this report today to learn more about the key tools and technologies being utilized, and how organizations deal with the cultural and process changes that DevOps brings. The report also examines the barriers organizations face, as well as the rewards from DevOps including faster application delivery, higher quality products, and quicker recovery from errors in production.
Video
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Slideshows
Flash Poll