We need to discuss something. There is a lot of things about SRE career path which we don't talk at happy hours or with colleagues. Let me share my 5 confessions about the job.
Preventing outages is why companies hire SREs, but there's a tiny part of us that lives for the thrill of things going wrong. When that pager goes off, our heart races and our palms sweat as we frantically work to pinpoint the root cause and remediate the issue. We'd never admit it, but secretly enjoy the adrenaline rush of outages and dopamine boost when finding a root cause.
Don't get me wrong, SRE job is rewarding. But sometimes we look over at developers who spend their days blissfully coding away, no pager duty or 3am alerts in sight. We wish we could trade the stress of keeping the lights up for just writing features and fixing bugs. The grass is always greener!
We love our complex distributed systems - so many moving parts, so resilient! Yet human mistakes take everything down no matter how robust we designed it to be. A little "fat finger" here, a lapse in judgement there and 6 hours later in the outage resolution you question your career choices.
It's easy for us to run fire drills for little incidents like config errors or minor outages. But we shy away from truly war-gaming catastrophic scenarios like data center power outage, network partitions, or database backup recovery. We tell ourselves these disasters are too unlikely to worry about. In reality, we know we should be more prepared, but usually other things have a higher priority.
Site changes is the #1 cause of outages. As much as we try to anticipate problems, we know most outages boil down to some code or config that got pushed. Outages and alerts in my company are down 70-90% WoW during the period of no code deploys (e.g. holidays, company shutdowns, etc) - statistics you can't ignore.
That's why SREs often perceived as "gatekeepers". We require change reviews, rollback plans, and testing for a reason - because nothing wreaks havoc on reliability like change.
So there you have it, share your SRE confession on twitter and tag us at @codereliant - we want to learn from you!
Join our weekly Reliability newsletter!
No spam. Unsubscribe anytime.