How to Reduce Toil with SRE and Automation
Toil is a term coined by Google to describe tedious, repetitive tasks associated with running a production environment. For Site Reliability Engineering (SRE) teams, the aim is to reduce or even eliminate toil in order to maximize the time spent on engineering and innovation. But how can they do that?
In this blog post, we’ll dive into the role toil plays is SRE and what operational teams can do to reduce the time spent on toil.
Make sure you watch our on-demand webinar with Credit Suisse, where Credit Suisse’s VP of SRE shares the strategies the organization employed to reduce toil with test automation, boost efficiency and enable operational teams to focus on more satisfying, high-value work.
First, some definitions are in order.
What is SRE?
SRE, or Site Reliability Engineering, is a term coined by Google to describe a set of practices and a culture, as well as a job role (Site Reliability Engineer).
The concept is not vastly different from DevOps, as its core focus is to bring the, sometimes contradicting, aims of Development and Operations together. In short terms, the aim of an SRE is to push development forward to improve systems quickly, while ensuring high quality, reliable and flexible production environments and applications.
Learn more about SRE and the differences and similarities between SRE and DevOps here.
In order to bring together reliability and speed, SREs must eliminate as much manual work as possible so that they can focus on the actual engineering. This makes the introduction of automation essential.
What is toil and why should it be reduced?
Google define toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
Tasks described as toil are relatively easy to execute, but don’t provide a lot of value. They don’t require an engineer’s skill or human judgment. Rather, they keep the engineer from progressing with product and service development.
If teams spend the majority of their time on these types of tasks, they have less time for high-value work. As a consequence, operational costs rise and the focus becomes more reactive than proactive. This prohibits innovation.
The benefit of reducing toil is that the time freed up can be saved and reinvested. As Michael Jones, VP of SRE Automation at Credit Suisse, explains in the webinar:
“We don’t want this to be done as an exercise to remove people. It’s more to empower our people to become more engineering focused, move away from mundane tasks, and remove operational risk.”
What’s more, focusing on reducing toil can help prevent future toil from emerging. In that sense, it’s a positive spiral. Google explain how:
“Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.”
Strategies for reducing toil
So what can operational teams do to reduce toil and boost productivity and innovation?
We’ve summed up some of the strategies Credit Suisse used to transition towards an SRE culture. Watch the webinar with Credit Suisse here for full insight into how they reduced toil and boosted productivity.
Make automation a part of the DNA
Key to reducing toil is the introduction of automation.
Toil is by definition automatable, making automation an obvious area of focus for SRE organizations.
As Google explain; “If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.”
Credit Suisse introduced no-code Robotic Process Automation (RPA) to reduce toil. They started out by automating 10% of toil and increased it to 45% within one year. Their aim for 2021 is to automate 55%, and at the time they did the webinar (May 2021) they were already at 50%.
This was made possible because of their determination to make automation part of each team’s DNA.
And how did they manage that? By decentralizing automation efforts.
Decentralize automation and reduce dependency
Despite common belief, automation doesn’t need to be controlled by centralized teams.
Rather, automation should be accessible to each team.
By bringing automation closer to the people with the strongest understanding of the business applications instead of reserving it for technical specialists, dependencies are reduced and productivity can be boosted.
This requires the right tools and the right governance.
Find the right tools
No-code automation is a key ingredient in creating fully self-sufficient SRE teams.
With automation tools that are no-code, and therefore don’t require specialist technical knowledge, business experts become self-sufficient and enabled to optimize their own work flows and processes.
In order to ensure optimal adoption of tools as well as transparency and overview, a tool inventory can be implemented.
This inventory should explain what, why, how, where, and when each tool should be used for the various types of toil. Each tool should have a tool evangelist with the responsibility for communications around it.
Ensure governance for minimal risk
With decentralized responsibility comes a risk. However, with strong governance, automation will only reduce risk - not introduce it.
By employing a governance team to keep an overview of automated robots this can be effectively managed.
The team’s role should be to review all automation bots, understand what they do, which applications they interface with, and who is responsible for them.
The outcome of reducing toil? More time for engineering
As a result of reducing toil through strategic automation, more time can be made for engineering, with the same level of staffing.
This can drastically improve a business’ ability to innovate and improve systems quickly.
What’s more, innovation can reduce the tendency for human error that typically occurs in connection with tedious, repetitive work. And, just as importantly, by removing these tedious tasks, people can do more satisfying work that they actually enjoy.
Learn more about reducing toil with SRE and automation
Watch the on-demand webinar with Credit Suisse to learn how they used RPA and self-service tools to boost efficiency, so operational teams have more focus on higher value work and innovation.
Additional topics discussed in the webinar:
- How to manage automation resources
- How to manage data and create data transparency
- Where to start with SRE
- How to align DevOps and SRE
- ...and much more