13 min read

5 Ways to Use AI to be a Better SRE

5 Ways to Use AI to be a Better SRE
Photo by Mohamed Nohassi / Unsplash

Site reliability engineering (SRE) is all about making systems more scalable, performant, and fault-tolerant. As an SRE, you're constantly looking for ways to improve reliability, troubleshoot issues faster, and build more resilient infrastructure.

We always looked for ways to use AI in the SRE world; however, the latest wave of AI seems to revive our hunger to use AI and discover new ways to make our work easier and efficient.

In this post, we'll explore five ways AI can help SREs work smarter:

Read on to learn how AI can give your SRE skills a major upgrade across many areas.

Speedy Code Reviews

Manually reviewing code changes lines by line is tedious and time-consuming. AI can help accelerate code reviews by automatically scanning pull requests for potential bugs, security flaws, and design anti-patterns. Advanced AI code review tools like Amazon CodeGuru can detect logical issues, resource leaks, and exceptions in Java and Python code.

By leveraging AI's pattern recognition capabilities, SREs can focus code reviews on improving reliability, availability, and scalability rather than getting bogged down in style nitpicks. AI code analysis provides a safety net for catching common coding mistakes so reviewers can concentrate on the architecture and fault tolerance aspects.

For example, AI can flag redundant or idle resources, memory management issues like leaks or caching problems, exception handling oversights, and anything that might affect stability under load. This allows SREs to ensure all changes are in line with SRE best practices around redundancy, decoupling, throttling, backpressure, and graceful degradation.

The massive time savings from AI-augmented code reviews means SREs can review 2X or 3X more pull requests in the same time frame. More rigorous reviews mean higher quality, resilient code gets deployed. AI is every SRE's secret weapon for maintaining velocity without sacrificing reliability.

Example:

Given the golang code below:

func processOrder(orderId int) {

  order := getOrderFromDatabase(orderId)

  charge := chargeCreditCard(order.CreditCard)

  if charge.Success {
    shipOrder(order)
  } else {
    log.Printf("Failed to charge credit card for order %d", orderId)
  }

}

We asked Claude AI from Anthropic to review it from and SRE point of view, and be focused on resiliency and reliability, and Claude didn't disapoint:

Claude AI code review
Claude AI review code

Troubleshooting

Outages and system failures are inevitable, but AI can help SREs troubleshoot and restore service faster. AI-powered tools excel at quickly sifting through massive log datasets and identifying patterns that point to the root cause of an incident.

For example, Mezmo uses ML algorithms to group and correlate similar server logs to surface relevant events. This rapidly narrows down where an outage originated. AI assistants like ChatGPT can also take an SRE's logs and error messages and immediately suggest debugging steps, potential failure points, and hypotheses to investigate based on learned knowledge.

Some AI troubleshooting tools like Moogsoft even automate parts of the triage process. They autodetect anomalies, match incidents to known patterns, and pull relevant events from disparate monitoring systems to give SREs a head start on diagnosing the problem. This reduces the chaos and confusion when resolving time-sensitive issues.

AI dramatically cuts down the time it takes SREs to pinpoint the source of failures. Bugs that took hours or days to uncover manually can be identified in minutes with AI. This speeds up recovery and helps SRE teams continuously improve system resilience based on insights from post-incident analysis.

Example:

We will launch a pod on top of a kubernetes cluster, the pod definition below is taken from kubernetes documentation; specifically, their "Configure Liveness, Readiness and Startup Probes" page.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

We waited until the pod has showed that it was being restarted:

Pod Commands

Then we took the output of kubectl describe pods my-pod and just paste it into Claude AI and asked it to troubleshoot the pod. Below is a screen shot from Claude AI chat:

Pod Restart Troubleshooting
Troubleshooting Restarting Pod

I run this experiment with chatgpt as well, and it gave clear instructions and explanation:

Chatgpt Answer

Based on the provided kubectl describe output, it appears that the pod named "my-pod" is experiencing issues with its liveness probe. The liveness probe is defined as an exec action that tries to run the command "cat /tmp/healthy" inside the container every 5 seconds (period) with a timeout of 1 second. If this command succeeds (exit code 0) at least once within three consecutive attempts (#success=1 #failure=3), the container is considered healthy. Otherwise, it is considered unhealthy and may be restarted.

Here are the key points to troubleshoot the pod:

  1. Liveness Probe Failure: The liveness probe is failing with the following error message:bashCopy codeLiveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
    The liveness probe is expecting to find a file named /tmp/healthy, but it seems that the file is not present in the container, which is causing the liveness probe to fail.
  2. Last State: Terminated with Error: The container was previously terminated with an error. The relevant information is provided here:yamlCopy code
    Last State: Terminated
    Reason: Error
    Exit Code: 137
    Started: Sun, 30 Jul 2023 19:36:14 -0400
    Finished: Sun, 30 Jul 2023 19:37:28 -0400
    The container exited with code 137, which usually indicates a termination due to a process receiving a SIGKILL signal. The termination occurred at Sun, 30 Jul 2023 19:37:28 -0400.
  3. Restart Count: 1: The container has been restarted once already, likely due to the failed liveness probe.

Now, let's proceed with the troubleshooting steps:

  1. Check Liveness Probe Command: The liveness probe is trying to execute the command cat /tmp/healthy. Ensure that the file /tmp/healthy is present within the container. If it's not present or located in a different path, you need to adjust the liveness probe command accordingly.
  2. Verify Image Compatibility: Ensure that the "registry.k8s.io/busybox" image used in the container is compatible with your cluster's Kubernetes version and architecture.
  3. Review Container Logs: Check the container logs to get more information about why it terminated with an error (Exit Code 137). You can use the following command to see the logs for the "liveness" container. kubectl logs my-pod
  4. Check Resource Limits: If the container requires more resources (CPU/memory) than what is allocated to it, it might get terminated due to resource constraints. Review the pod's resource requests and limits.
  5. Update Liveness Probe: If the container doesn't actually require the liveness probe, you can consider removing it from the pod spec or modifying it to suit the container's specific requirements.
  6. Recreate the Pod: After making any necessary changes, delete and recreate the pod to apply the changes:
    kubectl delete pod my-pod
    kubectl create -f path/to/pod-definition.yaml

Remember that troubleshooting Kubernetes issues can be complex, and additional information from other sources (e.g., cluster logs, events, and configuration files) might be required to fully diagnose and resolve the problem. The information provided in the kubectl describe output is just a starting point for investigation.

Parsing Logs and Generating Queries

Logs contain a wealth of performance and diagnostic data. But making sense of massive log volumes can be a nightmare without intelligent assistance. AI comes to the rescue by automatically parsing unstructured log data and allowing SREs to query it with natural language.

Tools like Mezmo use ML to structure messy log entries into consistent fields like timestamp, app name, severity level, trace ID, etc. This metadata makes the logs searchable. SREs can simply ask questions in plain English like "show me all the debug messages from the payment service today with latency above 2 seconds."

The AI parser understands the intent and translates it into a filtering query on structured log fields. It returns the most relevant log entries matching the criteria.

For security events, AI can quickly sift through firewall and IDS logs to locate suspicious patterns like large file transfers or brute force attacks. AI makes searching massive logs as easy as having a conversation.

When diagnosing issues, the AI can also suggest high-value queries based on detected anomalies. For example, it can automatically flag a spike in HTTP 500 errors and suggest querying for associated server-side stack traces.

With AI, SREs spend less time parsing data and more time uncovering insights. AI handles the messy work of turning raw machine data into structured, queryable knowledge.

In addition, AI assistants like Chatgpt and Claude AI can generate queries/commands for logging tools like Elasticsearch, Splunk, or even Ripgerp, from natural languages.

Example:

Given a sample of Nginx access logs from this dataset, we will ask Chatgpt to generate ripgrep command to filter the 404 requests that were returned to bing:

54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/61474/productModel/200x200 HTTP/1.1" 200 5379 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/14925/productModel/100x100 HTTP/1.1" 200 1696 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:17 +0330] "GET /product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%D8%B1-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR257AT HTTP/1.1" 200 41483 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0)Gecko/16.0 Firefox/16.0" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/23488/productModel/150x150 HTTP/1.1" 200 2654 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Prompt:

Given the nginx access logs above as a sample generate a ripgrep command to filter all the 404 requests that were returned to bing.

Chatgpt 4 Response:

ripgrep tutorial using chatgpt
ripgrep command for logs

Performance

SREs are constantly looking for ways to squeeze more performance out of systems. AI can help profile workloads, predict usage spikes, and optimize configurations.

Tools like AWS SageMaker Autopilot automate constructing machine learning models to forecast future demand based on historical traffic patterns. This allows SREs to rightsize capacity and avoid over or under-provisioning.

AI performance profiling examines metrics and traces to detect bottlenecks. For example, it can identify a frequently called API endpoint that needs caching. Or determine when more read replicas are needed to reduce database load.

Some AI solutions do constant real-time optimization. They dynamically tune Linux kernel settings, adjust thread pools, and balance workloads to adapt to changing conditions.

AI can also synthesize system load tests tailored to uncover performance issues. Engineers only need to specify broad scenarios like peak demand. The AI generates and runs targeted stress tests, uncovering optimizations like adding indexes or parallelism.

By constantly monitoring, predicting, and tuning performance, AI allows SREs to maximize throughput and minimize costs. AI handles the optimization grunt work at scale, freeing engineers to focus on design and reliability.

LLMs AI assistants can also help troubleshoot performance issues; for instance, if we have a slow query, we can feed AI Assistant the mysql table definition + output of Explain of query, and let LLM handle analyzing the query plan to identify missing indexes, table scans, or suboptimal join criteria and provide specific optimization recommendations to speed up the query.

Example:

We will use a mysql table and run a query that won't perform well, then ask chatgpt to improve the performance.
Table definition:

CREATE TABLE `product_catalog` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(255) NOT NULL,
  `description` TEXT,
  `price` DECIMAL(10,2) NOT NULL,
  `sku` VARCHAR(100) UNIQUE NOT NULL,
  `category` VARCHAR(255),
  `quantity` INT(11) NOT NULL,
  `created_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  `updated_at` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
);

Slow query:

SELECT * FROM `product_catalog` WHERE quantity > 10 and quantity < 50

Explain Analyze:

EXPLAIN Analyze SELECT * FROM `product_catalog` WHERE quantity > 10 and quantity < 50;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: ((product_catalog.quantity > 10) and (product_catalog.quantity < 50))  (cost=520 rows=550) (actual time=0.125..5.53 rows=1945 loops=1)
    -> Table scan on product_catalog  (cost=520 rows=4953) (actual time=0.081..4.91 rows=5000 loops=1)
 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Chatgpt Response:

Chatgpt Improve Query Performance Based on Explain Analyze
Chatgpt Improve Query Performance Based on Explain Analyze

From the screenshot above, we can see that Chatgpt was able to pinpoint where the issues that caused the performance, how to mitigate/solve it, and the downside that comes with indexes.

System Design Improvement

Humans have limited ability to model complex systems and interactions in their heads. AI can crunch massive amounts of performance data to detect suboptimal design patterns and suggest improvements.

For example, an AI architect tool could identify frequent throttling and faults associated with a monolithic codebase. It may recommend fault tolerance design and code improvement.

AI can also evaluate tradeoffs around consistency, latency, throughput, and cost for database architectures. It can simulate migrations to optimized NoSQL designs that remove bottlenecks at scale.

For large operations teams, AI can analyze organizational charts and project workflows to recommend cross-training, splitting siloed teams, and streamlining handoff processes to improve velocity.

An AI system design assistant could take high-level requirements like user count and QPS targets and generate cloud architecture diagrams, resource/cost estimates, and implementation guidelines customized to leverage SRE best practices.

By continuously optimizing architecture as systems evolve, AI helps SREs move closer to ideals like self-healing, automated scaling, and zero-touch operations. AI turns human-unfriendly complexity into strategic advantage.

Example:

In this example we want to simply estimate the hardware needed for an e-commerce site that will have 1000 active users per hour.

Prompt:

estimate the hardware/capacity that will be needed for an e-commerce site that will have 1000 active user per hour

Claude AI Response:

Here is an estimate of the hardware and capacity needed to support an e-commerce site with 1000 active users per hour:
Web/App Servers: To handle 1000 concurrent users, I'd estimate you need around 5-10 web/app servers. Using 2 core/4GB RAM VMs, this would mean 10-20 cores and 20-40GB RAM for the web/app tier.
Database Server: The database needs to support peak load of ~1000 writes/reads per second. For this I'd recommend a 4 core/16GB RAM MySQL or PostgreSQL database server. Can scale up or out as needed.
Caching Servers: 1 or 2 memcached/Redis servers with 4 cores/16GB RAM each to cache requests and session data.
Search Servers: If using Solr/Elasticsearch for search, allocate 2-3 servers with 8 cores/32GB RAM each.
Storage: User uploads/content and product data may need 1-2TB or more of storage using a SAN/NAS. RDBMS storage around 500GB+ on fast disks.
Bandwidth: Estimate 50Mbps downstream and 15Mbps upstream per 100 concurrent users, so around 500Mbps downstream and 150Mbps upstream.
Load Balancer: Use L7 load balancer to distribute load across web servers.
Backups/Logging: Ensure adequate disk space for server and database backups and application logs.
Actual sizing will depend on traffic patterns, product catalog size, and performance requirements. Start small and scale based on load tests and growth. Use auto-scaling on cloud platforms.

Above we can see how Claude AI provided a detailed hardware estimation that we can start from and modify as we see fit. and we can even ask followup question & requests like:

  • How much will this hardware estimation cost on GCP, Azure, and AWS?
  • What platforms can I use to not deal with hardware at all?
  • Can sqlite be used for this e-commerce site?
  • generate a system design for this e-commerce site based on the estimate provided

Conclusion:

he rapid advances in artificial intelligence open exciting new possibilities for how SRE teams can build, run, and manage complex systems. As this post illustrated, AI can help SREs work smarter and more efficiently across a range of activities like code reviews, troubleshooting, log analysis, performance testing, and architecture design.

Adopting AI is a force multiplier for SRE velocity and effectiveness. System reliability engineering will become increasingly AI-assisted over time. SREs who embrace this shift will find themselves freer to innovate, solve bigger problems, and design future-proof systems to delight their users. The machines are here to help.


If you enjoyed this, you will also enjoy all the content we have in the making!
We are working on a similar blog but without leaving the code editor! we will focus on Github Copilot and Cody.