SRE Interview Prep Plan (Week 2)

Oct 12, 2023

Week 1: Fundamentals of SRE
Week 2: Automation & Scripting (This post)
Week 3: Monitoring, Logging, and Alerting
Week 4: Incident Management Lifecycle
Week 5: Scalability, Performance, & System Design
Week 6: Mock Interviews and Revision

As we step into the second week of our 6-Week SRE Interview Preparation Plan, it’s time to build upon the solid foundation laid during the first week. Week 1 was your initiation into the core essence of Site Reliability Engineering (SRE), where you navigated through the fundamentals of Linux/Unix systems and networking. You’ve grasped the basics, now it's time to answer a question know with the SRE community:

To Leetcode or not to Leetcode? That is the Question.

Leetcode's relevance often depends on the specific companies you're interviewing with. It's recommended to begin with Leetcode's easy to medium levels. Python is a favored choice due to its user-friendliness and widespread popularity.

This week is dedicated to providing you with the skills and knowledge to automate routine tasks, create scripts to solve complex problems, and manage infrastructure as code. As we look at scripting languages like Python and Bash, and explore the various Infrastructure as Code (IaC) platforms like OpenTofu and Ansible, you'll discover how automation forms the backbone of SRE's capability to manage large-scale, reliable services. The upcoming days are going to be a blend of learning, practicing, and grokking the art of automation and scripting.

Each day of this week brings you one step closer to not only acing your SRE interviews but also becoming the SRE who can leverage code & infrastructure to perfect systems reliability.

Days 1-3: Introduction to Automation, Scripting

Automation and scripting are fundamental to the Site Reliability Engineering (SRE) practice, providing several significant benefits:

Reducing Toil: automation significantly reduces toil, which is the repetitive and mundane work that SREs might otherwise have to perform manually. By automating routine tasks, SREs can spend more time on strategic, high-impact projects.
Enhancing Reliability: automation ensures that processes are carried out consistently and accurately, reducing the likelihood of human error, which in turn enhances the reliability of systems.
Increasing Efficiency: scripting allows for the rapid execution of tasks that might take a human operator much longer to perform manually. This increases operational efficiency and allows for faster response times, especially during incidents.
Scaling Operations: as systems grow in size and complexity, manual operations become unsustainable. Automation and scripting enable SREs to manage large-scale, complex systems effectively.
Improving Incident Management: automated monitoring, alerting, and remediation scripts help in quicker detection and resolution of incidents, minimizing downtime and improving system availability.
Infrastructure as Code (IaC): automation enables the practice of IaC, allowing SREs to manage infrastructure using code and automation tools, which ensures consistent and repeatable deployments.
Enhancing Security: automated processes can enforce security best practices consistently across the infrastructure, reducing the risk of security breaches.
Providing Measurement and Monitoring: automation tools provide better tracking, logging, and monitoring of system events and changes, which is crucial for post-incident reviews and continuous improvement.
Accelerating Development and Deployment: automation and scripting streamline the deployment process, making it faster and more reliable, which in turn supports more rapid development cycles.
Facilitating Continuous Improvement: automation provides a framework for continuously measuring, monitoring, and improving system performance and reliability, aligning with the SRE principle of continuous improvement.
Knowledge Sharing and Collaboration: scripts and automation tools encapsulate knowledge about system management and operations, facilitating knowledge sharing and collaboration among SREs and other teams.

Automation and scripting are not just tools but a philosophy in the SRE culture that encourages solving problems with code, promoting efficiency, scalability, and reliability in the systems being managed.

Resources:

Automate the Boring stuff with Python (book)
Eliminating Toil (workbook)
The Evolution of Automation at Google
The HitchHicker's Guide with Python (book)
- Command-Line Applications (chapter)
- System Administration (chapter)
- Networking (chapter)
- Continuous Integration (chapter)

Questions:

Describe a scenario where you automated a routine task using scripting. What language did you use and what was the outcome?
How would you approach automating the deployment of a new service in a cloud environment? Mention any tools or frameworks you would use.
Discuss an instance where automation significantly improved a process or resolved a problem in your previous work experience.
What are some important considerations when writing scripts for automation in an SRE context?
How do you ensure the reliability and error-handling of scripts or automation workflows you create?
Describe your experience with any configuration management tools like Ansible, Puppet, or Chef. How do they aid in automation?
Explain the concept of idempotence and its importance in automation scripts.
How would you automate the monitoring and alerting for a distributed system? Discuss any tools or platforms you would leverage.
Discuss a complex scripting or automation project you've worked on. What challenges did you face and how did you overcome them?
Should scripts usage be temporary or permanent? why?

Days 4-5: Infrastructure as Code

An approach in which infrastructure configuration and provisioning are managed and automated through code, rather than traditional manual processes. Using descriptive language files, IaC allows developers and IT professionals to automatically set up, modify, and version the entire infrastructure or individual components, ensuring consistency, repeatability, and scalability.

By treating infrastructure as a software system, organizations can apply software development best practices, such as version control, testing, and continuous integration, to their infrastructure, thus bridging the gap between development and operations. This paradigm shift not only accelerates deployment but also reduces the risk of human errors, fostering a more resilient and efficient operational environment.

Resources:

Questions:

What is Infrastructure as Code (IaC) and how does it differ from traditional infrastructure management?
Discuss the advantages and potential challenges of implementing IaC in a large-scale organization.
How do tools like Terraform ensure idempotence in infrastructure provisioning, and why is it important?
Describe a scenario where you used Ansible to automate a specific task or process. What modules did you use and what was the outcome?
Explain the principle of declarative vs. imperative infrastructure, providing examples of tools or languages that embody each approach.
How would you handle sensitive information, like passwords or API keys, when using IaC tools?
Discuss the role of state management in Terraform. How does it help in managing and modifying infrastructure?
Describe a complex infrastructure setup you've provisioned using IaC. What challenges did you encounter and how did you address them?
In Ansible, how do you ensure that tasks are executed in a specific order, especially when there are dependencies?
How do you handle versioning and collaboration when working on IaC projects with a team? Discuss any best practices or tools you employ.

Days 6-7: Practice scripting and automation tasks

The last part of Week 2, Days 6 and 7 are dedicated to a hands-on immersion in scripting and automation tasks, strengthening the theoretical foundations laid earlier in the week. During these days, the focus should be on practical challenges, ranging from writing intricate scripts to automating routine tasks, to designing complex automation workflows that mimic real-world scenarios. By confronting and resolving these challenges, you will not only refine your scripting skills but also gain a deeper understanding of how automation seamlessly integrates into the SRE ecosystem.

This hands-on approach ensures that learners are not just equipped with theoretical knowledge but also possess the practical expertise required to excel real-world SRE roles.

Resources:

Questions/Practice Problems:

Shell Scripting: Write a bash script that monitors a specific directory for any new files. If a new file appears, the script should automatically back it up to a designated backup directory.
Python Automation: Develop a Python script that fetches real-time CPU and memory usage of a system, and if either exceeds 80%, sends an alert email to the administrator.
Ansible: Create an Ansible playbook that sets up a basic LAMP (Linux, Apache, MySQL, PHP) stack on a remote server.
Terraform: Write a Terraform configuration to provision an AWS EC2 instance, ensuring it is of type t2.micro, running Ubuntu, and has a specific security group attached.
Regular Expressions: Script a solution that scans a log file and extracts all IP addresses that have made more than 100 requests within an hour.
Automation Workflow: Design a workflow that automatically deploys a web application from a Git repository to a staging environment whenever a new commit is pushed to the main branch.
Performance Scripting: Create a script that identifies the top 5 CPU-consuming processes on a system and outputs the results in a readable format.
Database Automation: Draft a script that automatically takes daily backups of a database, compresses them, and stores them in a designated backup location.
Networking Script: Develop a script that pings a list of servers from an input file and reports back which servers are unreachable.
Configuration Validation: Using your preferred automation tool, set up a task that ensures all machines in an environment have a specific version of a software package installed and, if not, automatically updates them.

As we conclude our second week it's evident that Site Reliability Engineering is a a mix of theory and hands-on expertise. Through practical challenges and real-world scenarios, you've seen firsthand how automation stands as the backbone of operating large-scale systems.

Peeking into Week 3, a whole new side of SRE awaits. you will focus on monitoring and alerting, the heartbeats of system reliability, you'll discover how SREs ensure the pulse of systems remains strong, detecting anomalies before they escalate.

Codereliant’s Substack

Discussion about this post