Corey Martin leads the discussion with two developers about production incidents they were personally involved in. Their goal is to inform listeners on how they discovered these issues, how they resolved them, and what they learned along the way.
Ifat Ribon is a Senior Developer at LaunchPad Lab, a web and mobile application development agency headquartered in Chicago. For one of their clients, they developed an application to assist with the scheduling of janitorial services. It was built with a fairly simple Ruby on Rails backend, leveraging Sidekiq to process background jobs. As part of its feature set, the app would send text messages to let employees know their schedule for the week; these schedules were assembled by querying the database several times, fetching frequencies and availabilities of workers. Unfortunately, a client noticed a discrepancy between how many notices were being sent out, versus how many jobs they knew they had: of the 400 jobs total, only 150 had notifications. It turned out that all of the available database connections were being exhausted--but that was only half of the issue. Sidekiq was attempting to process far too many jobs at once, and each of these jobs were responsible for connecting to the database, exhausting the available pool. The solution Ifat settled on was to reduce the number of parallel jobs processed while increasing the number of connections to the database. From this experience, she also learned the importance of understanding how all these different systems interconnect.
Christopher Ostrowski is Chief Technology Officer at Dutchie an e-commerce platform for the cannabis industry. One Christmas Eve, while celebrating with his family, Chris began receiving pager notifications warning him about some sluggish API response times. Since it didn't really have any significant end user impact, he ignored it and went back to the festivities. As the night went on, the warnings became significant alerts, and he pulled together a response team with colleagues to figure out what was going on. By all accounts, the website was functioning, but curiously, the rate of orders began to drop off. Through some investigation, they realized what was going on. Customers' order numbers were assigned a random, non-sequential six digit numbers. Dutchie was about to track its one-millionth order, a huge milestone. Before any orders are created, though, the app generates a six digit number, and tries to create one that doesn't already exist. The database was constantly being hit, as less and less six digit numbers were available for use. The solution ended up being rather simple: the order number limit was increased to nine digits. Although they had monitoring in place, the data was set up as an aggregate reporting; even though the "create order" API was slow, all of the others were low, keeping the average within tolerable levels. Christopher's solution to avoid this in the future was to set up more groupings for "essential" API endpoints, to alert the team sooner for latency issues on core business functionality.
Links from this episode118. Why Writing Matters for Engineers
117. Open Source with Jim Jagielski
116. Success From Anywhere
115. Demystifying the User Experience with Performance Monitoring
114. Beyond Root Cause Analysis in Complex Systems
113. Principles of Pragmatic Engineering
112. Managing Public Key Infrastructure within an Enterprise
111. Gift Cards for Small Businesses
110. Scaling a Bernie Meme
109. Meditation for the Curious Skeptic
108. Building Community with the Wicked CoolKit
107. How to Write Seriously Good Software
106. Growing a Self-Funded Company
105. Event Sourcing and CQRS
104. The Evolution of Service Meshes
103. Chaos Engineering
102. Whether or Not to Repeat Yourself: DRY, DAMP, or WET
101. Cloud Native Applications
100. Math for Programmers
Create your
podcast in
minutes
It is Free
Insight Story: Tech Trends Unpacked
Zero-Shot
Fast Forward by Tomorrow Unlocked: Tech past, tech future
Black Wolf Feed (Chapo Premium Feed Bootleg)
Bannon`s War Room