I don’t know about you, but lately, I’ve been hearing quite a lot about SREs (or non-acronymized Site Reliability Engineers). Now, there are probably a dozen different meanings for this role and it varies from company to company. I’m going to talk about what we had in the Agoda Homes team and the impact on morale and the impact on the actual reliability of our platform. Basically, for my definition, an SRE is an engineer within the team task with monitoring the reliability of the product and investigating the cause and determining the priority of bugs.
The Job No Engineer Wanted
Initially, we created the role within our product because we were almost at 100% features and we had traffic. So, we needed someone (or a team) to monitor how our production environment was performing and determine which bugs are critical to the success of the project and what the actual impact of the bugs are. I can tell you now, that if you create this role out of thin air — your engineers will probably hate you. I’m being dramatic (of course), but in the end, no engineer wanted to take on the role. It was rotated every sprint (we figured a week was too short and a month was probably too long).
First a Team
As I kind-of alluded to above, we first started out assigning this SRE role to a team. We’d reduce the number of stories the team would need to produce and let them have free reign on what bugs to tackle/determine impact. Now, as I said — the point of the SRE is not to solve the bugs — but investigate and determine priority. Can you already guess where I’m heading? Rather than investigating and determining priority — the team would usually investigate and solve. That sounds nice — until the team is spending a significant amount of time on bugs that probably aren’t a high priority when we have features that need to be completed.
In the end, though, the SRE role assigned to a team led to decreased morale (within the team chasing bugs), and very high unproductivity. We didn’t really change the reliability of our product and we ended up affecting our velocity. With bugs being reported all the time, the team were constantly dropping product work and context switching within a sprint. The cost of this constant ramp-up (think — where did I get with the story) was too great.
Then — a Single Engineer
Right, so the team as an SRE role didn’t work. We also tried having a single engineer from the product every sprint as SRE. This was better but still not good. Basically, the one poor software engineer ended up being named the bug buster. Or bug boy. Or any play on the word bug you could imagine. Now, what happened is that this single engineer would need one to three days handover from the previous bug boy. That’s a lot of time spent just getting to know what the bugs are in the system. Remember, this software engineer was not meant to solve the bugs, but to figure out where they were happening and how big of a priority it should be. That’s hard.
We had a rotating roster. We didn’t ask for volunteers, it was mandatory. Also not great for culture. But it worked. People got on with their jobs. But the bug boy was left isolated and alone. They were no longer part of the team (even though they came to stand-ups and meetings). They had different priorities from the rest of the engineers. What we found was that this role became very inefficient. There was so much time spent ramping up each sprint and knowledge transfer — that bugs were left on our radar for weeks at a time because they were not reproducible (which should mean low priority, right?).
We also found that engineers who were the SRE didn’t necessarily come back with knowledge of the different parts of the system (as you might expect). What ended up happening is that a high priority bug would come through from the PO (Product Owner) and the QAs (Quality Assurance/Testers) and from customer feedback; the SRE would have to drop the current bug she/he is working on and figure out the new bug. So — their knowledge was reduced to the high profile bug.
For the rest of the engineers, there were no more distractions. This was what we wanted, right? No POs nagging us and product work pushing ahead full steam. But having a member away from your sprint meant that the teams became disconnected. Knowledge of bugs was passed from SRE to SRE rather than shared among the team. It was like a “right of passage” to be an SRE. No one looked forward to the role.
What We Do Now
We no longer have SREs within the Agoda Homes team. The toll the role took on the people and the effectiveness of the teams was too great. We still get high priority bugs. We still investigate bugs. But it’s more like a Product task now. The PO chats with the QAs. QAs help determines how much of an impact the bug has on the product. The PO weighs up product and bug work and determines what will bring the most business value. It’s not perfect, but as engineers, we work together as a team again.
Originally published at www.alexaitken.nz on July 23, 2018.
