ALTERNATE UNIVERSE DEV

Page It to the Limit

On-Call Nightmares With Jay Gordon

“All these conversations at the bar…why is nobody recording them?” - Jay Gordon, the host of the popular On-Call Nightmares podcast, talking about where the idea for the show came from.

One of the biggest myths is that on-call is just an extra part of a SRE or sysadmin’s job. That it’s not really a big part of their duties. It’s just a thing you do; it hasn’t always been taken seriously, especially the impact of being on-call to the individual.

Remember - on-call isn’t just for ops or SRE. Andrew Clay Shafer used to describe himself as a “conscientious developer”, even prior to the ideas of DevOps. Because he thought about things this way, it caused him to be a better developer, and this heavily contributed to the foundation of the DevOps movement.

Software engineers are often resistent to being on-call because of what they think it means - based on the horror stories they hear from their coworkers and friends who work in Ops.

How has on-call changed?

Jay: “Automation has made so much of the difference”

Well-documented automation makes it easier to track down what might be contributing to issues. Having things watching what is going on through the deployment process and watching what’s going on. We have a greater ability to spin up replacement systems, too.

We are changing from a model of having one team who is on-call for everything inside the business; now it is more about selected domain experts on call for the thing they know really well. Being on-call as a developer, you know you are only being called about things you know about. Additionally, the more people that go on call, it’s much less actual impact to all the folks who are on-call. So the experience is a lot different. “We’ve reduced the individual blast radius by distributing it” - Jay.

“The beautiful thing about going on-call is you get to go off-call. If you aren’t on-call, I have news for you - you’re always on-call” - Matt. It’s very relieving to know you are not on call, so you don’t have to worry that someone will call you. “Trust me - your ops team knows how to find you, and they will” - Matt.

On-call requirements are different

Not every company or service requires 24 hour on-call support. When you are thinking about where you want to work, consider this. That said, if you do work for an organization that provides a service around the clock, on-call is likely a part of that job, and everyone should consider it part of their service ownership. But ultimately, make the decision for the role that works for you. It’s less about the title or role, than it is for the type of company or organization and what they need. As Jay points out, “in the end, we are all just people, and we have basic requirements - like eating, having water, getting enough sleep, and spending time with people we like. On-call should still let you do these things”.

A good question to ask when getting into a role that has a on-call component, is ask “how are incident responders rotated off of an incident?” Responders stop being effective after a couple of hours - understanding things like “what’s the size of the rotation?”, “what are the expectations of a responder during an incident?”, are much more important to know than “how often will I get paged?”

How to avoid having an on-call nightmare

Jay: “It always comes down to tech debt. It’s amazing how much tech debt comes down to a lack of documentation. It becomes one of those scary parts that if it falls down, nobody will know what to do”

Additional Resources

Episode source