Download - The Art and Science of Site Reliability Engineering with Liz Fong-Jones

Discover

Podcast Features
Monetization
Podbean App
- Podcast Studio
  Easy-to-use audio recorder app.
- Podcast App
  The best podcast player & podcast app.

Help and Support
Popular Topics

All Arts Business Comedy Education
Fiction Government Health & Fitness History Kids & Family
Leisure Music News Religion & Spirituality Science
Society & Culture Sports Technology True Crime TV & Film
Live

How to Start a Podcast
How to Start a Live Podcast
How to Monetize a podcast
How to Promote Your Podcast
How to Use Group Recording

Log in
Start your podcast for free

Podcasting
Advertisers
Enterprise
Pricing
Resources
- Help and Support
- Popular Topics
Discover

Cloud Dialogues

Technology

The Art and Science of Site Reliability Engineering with Liz Fong-Jones

2024-10-10

Download

In this exciting episode of Cloud Dialogues, we are joined by Liz Fong-Jones, Field CTO at Honeycomb and former Google SRE, to explore the fascinating world of Site Reliability Engineering (SRE)—a game-changer for scaling and automating large systems.

What We Covered:

1. Meet Liz Fong-Jones: Liz brings over a decade of SRE experience from her time at Google and Honeycomb, helping companies revolutionize how they manage reliability and automation.

2. The Origin Story: SRE actually predates the cloud! Born at Google in the early 2000s, SRE started as a way to automate manual system administration tasks and has since evolved into its own discipline, running parallel to DevOps.

3. SRE at Its Core:
- Minimize repetitive work (aka "toil") by automating everything you can.
- Use Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and maintain reliability.

4. Different SRE Models: There are different ways to implement SRE:
- Tools-based within platform teams
- Consultative SREs parachuting in to help teams
- Embedded SREs integrated within every team

5. The SRE Mindset: Curiosity and empathy are essential for SREs. Teams need a culture of psychological safety where concerns can be raised without fear.

6. The Magic of SLOs and SLIs: SLOs set reliability targets (like aiming for 99.5% uptime), while SLIs measure performance against those targets. Together, they ensure your systems are running smoothly.

7. FinOps Meets SRE: Liz explains how SREs can help balance reliability, performance, and costs using SLOs to allocate resources more efficiently.

8. Disaster Testing: Want proof SREs are ready for anything? Honeycomb regularly tests its disaster recovery by taking down an entire availability zone—on purpose!

9. Pro Tips for Executives: Thinking about implementing SRE at your company? Liz suggests starting with your biggest challenges, offering executive support, and setting clear, achievable SLOs.

10. Why Observability Matters: Observability is the backbone of SRE. Having real-time, actionable data is key for setting and managing effective SLOs.

Plus, Liz gives covers off on her favorite ARM processors (for cost and environmental savings) and shares insights from her book Observability Engineering.

This episode is a deep dive into SRE, filled with actionable insights and strategies for leaders looking to supercharge their reliability game. You won’t want to miss it!