Have you ever wondered how Meta makes config rollouts safe at scale? In this episode, Pascal sits down with Ishwari and Joe to discuss Meta's approach for propagating changes across services in seconds and discuss why speed increases the need for strong safeguards. Catch the episode to discover canarying and progressive rollouts, the health checks and monitoring signals used to catch regressions early, and how incident reviews focus on improving systems rather than blaming people. We also hear how data and early AI/ML are slashing alert noise and speeding up bisecting when something goes wrong.
Got feedback? Send it to us on Threads (https://threads.net/@metatechpod), Instagram (https://instagram.com/metatechpod) and don't forget to follow our host Pascal (https://mastodon.social/@passy, https://threads.net/@passy_). Fancy working with us? Check out https://www.metacareers.com/.
Links
FFmpeg at Meta: Media Processing at Scale - https://engineering.fb.com/2026/03/02/video-engineering/ffmpeg-at-meta-media-processing-at-scale/
Reliably Changing Configuration @ Scale - https://atscaleconference.com/reliably-changing-configuration-scale/
Timestamps
Intro 0:06
Introduction and Overview of Configuration Changes 2:31
Understanding Configurations in Distributed Systems 4:02
Meta's Configuration Management Systems 6:43
Safeguards and Incident Prevention 9:22
Deployment Mechanisms: Canary and Progressive Rollouts 12:06
Challenges in Configuration Consumption 14:39
Health Checks and Incident Response 17:13
Mitigation Strategies for Configuration Issues 19:18
Balancing Developer Velocity and Configuration Safety 21:09
Data-Driven Improvements in Incident Management 22:12
Leveraging AI for Change Detection 26:05
Challenges in Deployment and Testing 28:21
Reinventing Change Safety Strategies 30:24
War Stories: Learning from Past Incidents 32:59
Outro 36:10