Learning From Google SRE Team (part-1)

Nov 07, 2023

Recently, the Google Site Reliability Engineering team published a blog post called "Lessons Learned from Twenty Years of Site Reliability Engineering." The post highlights lessons that have proven valuable over time and through challenges; they are derived from incidents affecting some of Google's major systems and products like Youtube and Google Calendar.

In this blog post, we aim to expand on the first 5 lessons shared by Google's Site Reliability Engineering team, offering a closer look at practical implementation examples. Our objective is to illustrate how you can apply these lessons to ensure the robustness and health of your systems. By drawing on real-world scenarios, we hope to provide a richer understanding of these insights, empowering you to enhance your strategies for maintaining system uptime and overall operational efficiency. Also, we aspire to bridge the gap between theory and practice, facilitating a more informed approach to tackling challenges in the world of SRE.

Outage severity guides risk appetite

The first lesson from the two decade experience:

The riskiness of a mitigation should scale with the severity of the outage

There are several questions that comes to mind when you want to apply this lesson, and here a few thoughts that should help:

The more severe outages generally warrant riskier mitigation actions, since the potential damage of not acting is greater. However, the riskiness of the mitigation should be proportional to the severity, not necessarily scale directly.
Consider both the likelihood and impact of the mitigation action failing or backfiring. A very risky action that has a high likelihood of failure may not be justified even for a very severe outage.
The risk appetite of the organization should also be considered. More risk-averse cultures may prefer relatively cautious mitigations even for severe issues. Other organizations may be willing to take bigger risks.
Outage severity isn't the only factor; also, consider duration, affected users, damage to reputation, etc. A short but severe outage may warrant a different response than a moderate but prolonged one.
Have contingencies ready for if the risky mitigation fails or makes things worse. The mitigation plan B could be less risky but take more time, for example.
Consult experts in assessing the risks of mitigation options, when feasible. Experienced engineers may know likelihood of failure or side effects better than non-technical managers.

Examples:

This lesson may not have direct practical implementations as it is closely tied to specific incidents and the available solutions at the time. Nonetheless, below, we provide examples of incidents of varying severities along with potential solutions, to offer an idea into how this lesson might be contextualized and applied in different scenarios.

In the case of an internal DDOS (Distributed Denial of Service) attack affecting multiple services, a viable solution could be to halt traffic from the offending service until the issue is resolved. While this service might be crucial for delivering a particular feature, stopping the traffic temporarily can prevent further damage and allow for a controlled, focused resolution. This measure underlines the importance of assessing the risk and reacting proportionally to ensure that a bigger problem isn't created while trying to solve a smaller one.
Memory Leak causing instances to restart; thus, impacting uptime and causing an 5% increase in error rate. In this scenario, increasing the memory slice size, boosting capacity, or proactively restarting instances when memory usage hits 90% are practical steps to manage the situation. These measures serve as temporary cushions, allowing for continued operation while the underlying bug causing the memory leak is identified and resolved. This example underscores the essence of having flexible, responsive strategies in place to maintain system health and uptime, especially when dealing with unforeseen issues that require a more thorough investigation and fix.

Untested recovery is non existent recovery

The real concern during a major data loss scenario isn't just the ability to create backups, but ensuring a seamless process to restore and run your database from one of the latest backups. Discovering corrupt, empty, or non-functional backups at such a critical time can worsen the situation. Hence, the focus should be on the entire backup and restore procedure, ensuring it works flawlessly when needed.

Recovery mechanisms should be fully tested before an emergency

Therefore, it's important to stress not only the recovery mechanisms but also their testing. Whenever we decide on a recovery process, the cost of its testing should be factored into the decision-making.

Examples:

Data backups: This should be tested continuously using production backups, but in an isolated environment, to ensure the recovery processes work accurately without affecting the live systems.
Load shedding: It should be included as part of a load test that is conducted either periodically or continuously, to ascertain the performance and reliability of the recovery mechanisms under varying levels of demand.
Traffic shift: This process should be effortless to execute, and ideally, it should be run weekly, if not daily, to ensure the utmost readiness and reliability of the recovery mechanisms in place.
Retries: Injecting errors between services should be an integral part of chaos engineering tests conducted to evaluate the tolerance of services to transient errors. Through this practice, you can better understand the resilience of your systems and identify areas for improvement to ensure smoother operation even under unpredictable conditions.
Circuit Breakers: Testing the mechanism that halts the flow of traffic to failing systems to prevent further degradation. Similar to retry, this should be part of a resiliency testings that's continuously conducted.

Limit Risk With Incremental Canaries

Have you ever witnessed a change that took a system down or impacted users simply because it did not go through a canary strategy. we witnessed this countless times during our careers, we even worked for companies that no change would go without going through canary and gradual rollout that was approved by the SRE team. And it seems that the Google SRE team had experience outages in both Youtube and Google calendar due to changes not being part of a canary; hence, their third lesson is:

Canary all changes

What do they mean by all the changes? it means all the changes, no matter how small or big it is.

From that no one cares about environment variable to the latest release your are about to deploy. Below are some notes to help you understand this better:

Roll out changes to a small subset of users or servers first. Monitor carefully before deploying more widely.
Choose canary groups that represent a cross-section of users, geographies, use cases etc, to maximize detection of potential issues.
Define success criteria upfront for canary deployments. This includes metrics to monitor and thresholds that indicate a successful or failed canary.
Have an automated rollback process ready to go for canary groups if issues arise. Minimize manual steps to speed up rollback.
Gradually increase the size of the canary pool over time if no problems emerge. Go slowly at first though - 1% to 5% to 10% to 25% of users/servers, then 50% etc.
Monitor canary pools for a sufficient length of time during each ramp up. Issues like memory leaks may take time to emerge.
Perform canarying across multiple stages - dev, test, staging environments first before production canaries.
Have a plan for partial rollbacks if needed - reverting just the canary users versus a full revert.
Analyze canary issues deeply to understand root cause before moving forward with wider rollout.
Consider canarying database/config changes separately from app changes, to limit scope of issues.

In summary, canary slowly and carefully with good monitoring to limit outage risks from changes. But have automated rollback plans ready just in case.

Emergency & generic mitigations

"Big Red" buttons aren't one-size-fits-all solutions. Rather, they serve as generic mitigations we have ready for a range of issues that could come up while managing a service or system. Each button addresses a set of known problems, acting as a quick-response mechanism to maintain stability during unforeseen situations.

Having a "big red button" and generic mitigations available can be useful strategies when responding to outages:

The "big red button" refers to having a single, well-designed way to quickly take a system offline or roll back in an emergency. This removes delays in reacting.
It should fully rollback or disable the problem system to stop damage, even if root cause is unknown. Speed is key.
Implement with caution because you don't want it to be too easy to trigger accidentally. Safeguards should be in place.
The button itself can be literal or figurative; an actual switch, master kill script, or just a widely understood shorthand for emergency procedures.
Generic mitigations are pre-defined actions that can quickly stop common outage causes or symptoms. For example, adding server capacity, disabling non-essential features, restarting overloaded components.
These provide options to try immediately before you have full outage insight. Buys time.
Have playbooks documenting the generic mitigations ready to go. Include criteria for when to use each one.
Retire mitigations that are no longer relevant or get superseded by better responses. Keep the list fresh.

Below are examples of "Big Red Buttons" that you can implement and that we think should be handy for most SREs:

Rollback Button: To revert recent changes and return to the previous stable version.
Traffic Shifting Button: To divert traffic away from problematic areas to maintain service availability.
Service Toggling Button: To disable non-essential services to preserve resources.
Emergency Stop Button: To halt processes that are causing or exacerbating an issue.
Cache Clearing Button: To resolve issues stemming from stale or corrupted cache data.
Load Shedding Button: To drop non-critical requests and ease the load on the system during a spike in traffic.

Each of these buttons can provide a swift action to mitigate ongoing or potential issues.

Integration Testing > Unit Testing

let's define what unit and integration testing are:

Unit testing: testing individual components or units of a software to ensure they work as intended independently.
Integration Testing: testing the interactions between different components or units to ensure they work as intended when integrated.

Now that we understand them as SREs if we have to chose one or the other, we should always prefer integration testing over unit testing. If we can have both that's even better.

The significance of integration testing comes into play especially when all unit tests pass successfully, yet there could still be bugs lurking in the interactions between these units. Integration testing is designed to catch such bugs by validating how different units work together. Moreover, an integration test can encompass a whole product feature, ensuring not only the correct interaction between units but also the correct functioning of the feature as a whole, providing a more holistic verification of the system's behavior. This dual approach of unit and integration testing ensures a more robust and reliable software product.

Unit tests alone are not enough - integration testing is also needed

relying solely on unit tests is not sufficient for robust testing, integration testing is also crucial:

Unit tests validate individual functions/classes in isolation, but not how they interact with other components.
Integration testing verifies interfaces between modules, proper system architecture, and end-to-end workflows.
Mocking between units can hide issues that appear when integrated. Testing real integrated environments uncovers edge cases.
Performance problems like latency, race conditions, bottlenecks only manifest at integration level, unit tests pass regardless.
Major bugs are often integration issues like configurations, APIs, database schemas, network protocols. Unit tests alone miss these.
UI flows, user scenarios, authorization also require integration testing as they involve many parts working together.
Test coverage metrics should include integration test paths, not just unit test lines.
When bugs occur, carefully examine if new integration tests should be added to prevent regressions.
Leverage automated integration testing as much as possible to supplement in-depth manual testing.

Example:

Below is an example of an integration test that uses playwright to test the sign up feature of a hypothetical website named codereliant.io.

const { test, expect } = require('@playwright/test');

test('Sign-up Test for codereliant.io', async ({ page }) => {
  // Navigate to the sign-up page
  await page.goto('https://codereliant.io/signup');

  // Fill out the sign-up form
  await page.fill('input[name="username"]', 'testuser');
  await page.fill('input[name="email"]', 'test@example.com');
  await page.fill('input[name="password"]', 'testpassword');

  // Click the sign-up button
  await page.click('button[type="submit"]');

  // Check for a success message or redirection to a new page
  const successMessage = await page.textContent('.success-message');
  expect(successMessage).toBe('Sign-up successful! Welcome to CodeReliant.');
});

Playwright navigates to the sign-up page of codereliant.io, fills out the sign-up form, clicks the sign-up button, and checks for a success message to confirm the sign-up process was successful.

Conclusion

The initial five lessons from Google's SRE team's 20-year journey have been enlightening for us, and hopefully for you too, given that Google operates some of the largest systems out there.
We've looked at important SRE topics, so stay tuned and subscribe to receive an in-depth exploration of the next five lessons in your inbox.

Codereliant’s Substack

Discussion about this post