Learning From Google SRE Team (part-2)

Nov 10, 2023

In our previous discussion, we covered the initial five lessons from the eleven lessons shared by Google's Site Reliability Engineering (SRE) team, reflecting on their two decades of expertise.

Today, we will explore the remaining six lessons from their post. Our aim is to expand upon each lesson, offering our insights and perspectives to enhance understanding.

Additionally, we aim to connect theoretical concepts with practical application, offering a more knowledgeable approach to addressing SRE challenges. We will try demonstrate how these lessons can be applied to keep and maintain the health of your systems. We plan to use real-world examples to deepen your understanding of these principles, helping you improve your methods for ensuring system uptime and operational efficiency.

Communication Backup

It's crucial to have multiple reliable channels of communication, especially during incidents and outages. This was evident when AWS's us-east-1 region experienced downtime, affecting Slack and numerous systems of companies relying on it as their primary chat platform. In such scenarios, not only do we face the challenge of an outage, but also the inability to communicate as a team to address or resolve the issue effectively.

COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!

Here are some tips on communication backups:

Have both in-band (email, chat) and out-of-band (SMS, phone) communication channels set up for critical teams. Don't rely solely on one channel.
Ensure contact information is kept up-to-date across all systems. Stale contacts amplify chaos.
Document and test backup contacts and escalation procedures. Know who to reach if the first contact doesn't respond.
Geo-distribute and fail-over communication systems so there is no single point of regional failure.
Regularly test backup communication channels to ensure they are functioning, not forgotten. Rotate testing.
Audit access permissions and multi-factor authentication to avoid getting locked out of channels.
Support communication across multiple teams like technical, leadership, PR, customer support. Coordinate flows.
Have a common lexicon for major incidents so terminology is consistent. Helps reduce confusion.
Quickly establish incident chat channels but also document decisions in durable ticket systems.

Redundant comms are indispensable for coordination during chaos. Invest in multiple channels, rigorous testing, and distribution to keep conversations flowing when systems fail.

Degradation Mode In Production

if your customers or systems are used to a specific performance, they rely on this performance and become hidden service level objective , and not the slo that you have promised them. this is known as Hyrum's law:

all behaviors of an API will be depended upon, regardless of whether they are voluntary or not.

Therefore, it's important to operate not only in various performance modes in production but also to run in a degraded mode when possible. Doing so helps you understand the limits of your systems and challenges your assumptions about potential outcomes under different conditions.

Intentionally degrade performance modes

Here are some ways to intentionally build in degraded performance modes to add resilience:

Feature flags to quickly disable or throttle non-critical functions to preserve core services.
Config parameters to ratchet down cache times, request timeouts, concurrent users.
Load balancer rules to route a percentage of traffic to lightweight static/backup pages.
Switch to less resource-intensive algorithms, compression, transmission modes.
Allow manual invocation of "lightweight mode" to run services in reduced complexity.
Create "soft failure" points that gracefully degrade performance vs hard failures.
Size spare capacity to allow systems to throttle down to sustainable levels.
Categorize service tiers and have plans to deprioritize lower tiers during surges.
Build circuit breakers that trigger when thresholds are exceeded to restrict loads.
Cluster/shard architectures that can selectively route requests to subsets.
Rate limiters, quota systems, and queues to smooth traffic spikes.
Fuse designs that cut non-vital subsystems off before cascading failures.

The goal is to architect resiliency by design rather than rely solely on excess capacity margins. Plan degraded modes in advance.

Test Disaster Resilience

Testing for disaster resilience is important because it prepares your systems and organization for unforeseen events and failures. By simulating disasters, you can identify weaknesses in your infrastructure, allowing you to address them proactively.

This testing ensures that your systems can withstand and recover from various types of disruptions, minimizing downtime and potential data loss. It's not just about avoiding immediate technical failures, but also about preserving business continuity.

Disaster resilience tests don't necessarily need to occur in a production environment; they can be structured like a tabletop exercise, involving a series of thought-provoking "What if" scenarios. This approach allows teams to mentally simulate emergencies and brainstorm potential responses in a controlled, low-risk setting. By discussing hypothetical disasters and their implications, teams can develop robust contingency plans and strategies without the pressure and risks associated with real-world failures, enhancing preparedness and response capabilities effectively.

Here are some best practices for testing disaster resilience:

Simulate different disaster scenarios - power outages, database corruptions, DNS failures, cloud availability zone downtime, etc.
Conduct tests at random times without warning to evaluate preparedness for surprises.
Inject real faults into systems like shutting off power, not just simulating them in code. Test actual recovery capabilities.
Evaluate worst case scenarios - e.g. outage during peak traffic, on critical day like Cyber Monday.
Analyze interdependence risks - what happens if your cloud provider depends on same disrupted regional infrastructure as you?
Assess contingency plans for key suppliers and partners - can they maintain service levels you depend on?
Test redundancy of backups - database replicas, failover locations, content delivery networks.
Validate communications strategies - ability to reach critical personnel, security access to key systems.
Check disaster recovery practices like rebuilding from backups, redeploying to alternate sites.
Examine incident response team effectiveness - do participants understand roles and coordinate well?
Ensure early warning and monitoring systems correctly detect anomalies and trigger alerts.
Document lessons learned from each test and implement improvements to continue raising resilience.

Regular rigorous testing builds confidence in your ability to withstand disasters and quickly restore critical services.

Automate your mitigations

Have you ever navigated a lengthy Runbook to perform a mitigation, only to discover at the end that a command is non-functional, or the Runbook is outdated? This is a scenario we've encountered numerous times, and it's not just limited to Runbooks. We've faced similar issues with readme files and installation or setup documentation as well.

To circumvent such issues, automation is the key solution. The scope of automation can range from basic scripts to advanced autonomous systems capable of operating independently.

Example of what to automation include auto-scaling capacity, terminating compromised instances, database failover, traffic shifting, and isolating unhealthy nodes.

Automating mitigation responses to incidents and outages provides huge advantages:

Speed: Automated actions execute within milliseconds versus manual responses taking minutes or hours. Critical for reducing blast radius.
Consistency: The same steps get followed precisely every time, unaffected by human panic.
Scalability: Automated mitigations can respond to surges in events that would overwhelm humans.
Reliability: Automation reduces dependence on availability and judgement of on-call staff.
Compliance: Hardcoded responses follow security and compliance protocols. Less risk of human error.
Logging: Automated responses provide an audit trail of timestamps and actions taken for analysis.
Testing: Automated mitigations are easier to properly test/simulate before putting into production.
Flexibility: Automation can integrate with monitoring systems to provide context-specific responses.

The more mitigations and recoveries that can be encoded into systems versus manual intervention, the greater the resilience. Automation brings speed, consistency, scalability.

Continuous Deployments

Frequent deployments play a crucial role in testing these pipelines and maintaining up-to-date knowledge about deployment systems. Engaging in continuous deployment brings additional reliability benefits, such as ensuring rapid rollout of updates, quicker feedback loops for identifying issues, and fostering a culture of regular, incremental improvements. This approach can significantly enhance system stability and operational efficiency.

Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong

we would recommend:

Focus on reducing cycle time through automation, not cutting corners.
Monitor key quality metrics per release to catch regressions early.
Feature flags and canary releases allow testing changes safely before full rollout.
Good test coverage and rehearsals mitigate rollout risks regardless of velocity.
Plan substantive releases with care around deprecation schedules, user training, marketing, etc.

The goal should be sustainable delivery velocity and stability, not maximizing speed alone. Thoughtful pacing, testing and monitoring is key.

A single global hardware version is a single point of failure

Relying on a single global hardware version creates a single point of failure, meaning that if this particular hardware encounters a problem, it can lead to a widespread system failure. This is because there is no diversity or redundancy in the hardware setup; every system or service depending on this hardware is at risk if it fails. The lack of alternatives or backups in such a scenario can lead to significant operational disruptions and challenges in maintaining continuous service, highlighting the importance of having varied hardware configurations or backup systems to ensure reliability and minimize risk.

Below is a list of how this can impact availability:

A fault in that hardware model could knock out your entire fleet globally at once. No layer of isolation.
Supply chain disruptions affecting that hardware model impact everything simultaneously. Eg chip shortages.
Hardware bugs and security vulnerabilities present in the model affect the whole fleet.
No heterogeneity makes attacks against hardware easier to scale. Common exploits paralyze more.
Hardware end-of-life and replacements are a massive synchronized effort vs gradual.
Harder to test OS, app and configuration changes safely across differing environments.

Mitigations to consider:

Use multiple server models and hardware configurations to segment risk.
Phase in hardware upgrades gradually region by region vs all at once.
Create redundancy across sites with different hardware to limit blast radius of outages.
Maintain a shelf stock of older models as spares to handle supply chain issues.
Place risky systems like databases on more fault tolerant hardware.
Standardize at the OS and software level, not the hardware level.

A diversity of hardware mitigates risk and strengthens resilience against hardware-specific issues emerging globally.

Looking at the final six lessons from Google's SRE team, drawn from their extensive 20-year experience, has been a a great addition for us, and hopefully for you as well.

Codereliant’s Substack

Discussion about this post