There’s rarely a dull moment in the life of a site reliability engineer. When applications and services are down, SREs get the call. If thousands of users or millions of dollars are on the line and the clock is ticking, all eyes turn to the SRE to save the day.
The downside to carrying this kind of responsibility: a huge amount of stress. Late nights, high pressure, and constant demands to swoop in and fix problems (even ones that don’t necessarily fall under the SRE role) are all common complaints. And the problem doesn’t seem to be improving.
Why is the SRE role so hard on the people doing these jobs? And what can we do to make it better?
Evolution of the SRE
The role of the SRE evolved in response to changing methods of building digital products and services. In recent years, as more companies have embraced agile software methodologies and DevOps, they’re moving faster than ever to push out new code. When things inevitably break, it’s often the SRE’s job to fix them regardless of whether they were involved in the development and rollout processes.
In principle, SREs are not supposed to be constantly putting out fires. Rather, as Google originally defined the job, they should spend a significant portion of their time on proactive, strategic tasks like increasing system reliability, optimizing capacity planning, and improving documentation. When an incident arises, SREs don’t just bring services back online. Ideally, they conduct extensive post-mortems. They identify why the issue arose, share knowledge about the incident, and build systems and automation to prevent it from happening again.
Unfortunately, many SREs say the reactive aspects of the job end up taking most of their time. That imbalance puts more pressure on SREs than they should be asked to bear. Worse, the steps that could reduce that stress -- increasing system reliability, automating problem resolution, and improving documentation -- are the very things that get pushed aside.
Navigating SRE challenges
Several factors contribute to the stress and frustration:
Reimagining SRE roles
Too many organizations have a problem with maintaining the well-being and job satisfaction of their SREs. If we’re going to realize the benefits that drove the creation of the SRE role in the first place -- if companies want to be able to scale up more quickly without sacrificing reliability -- we need to make this function work better. Here are two steps to consider:
It’s time to take better care of SREs
As the guardians of an organization’s critical services, SREs will always shoulder a big responsibility. That’s just the nature of the job. But there’s no reason the role has to come with so much stress and frustration. Organizations can do a better job of empathizing with SREs and making sure that everyone understands what their role is, and what it’s not. They can also make sure they’re giving SREs the time, tools, and visibility they need to be proactive in their jobs.
By taking these steps, organizations can help SREs detect and solve problems more quickly. That in turn creates more time for SREs to focus on initiatives. Ultimately, we can transform the SRE role into a virtuous circle of ongoing improvement and automation. As we do, we’ll end up with a lot less stress and frustration -- among SREs, the broader company, customers, and end users.
Nithyanand Mehta is Executive Vice President, Technical Services and GM at Catchpoint. Mehta leads global Catchpoint Technical Services teams that includes Professional Services, Sales Engineers and Support.
The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio