Download - Chaos Engineering

Technology

Chaos Engineering

2020-05-25

Chaos Engineering is a set of techniques to build a more proactive and resilient production environment and organization.

In this Episode of the DevOps Dojo, I introduce you to the basics of Chaos Engineering.

Sources

https://principlesofchaos.org/

https://github.com/Netflix/chaosmonkey

https://chaostoolkit.org/

Transcript

Chaos Engineering. The art and practice of randomly introducing failures into our production environment. It seems so counterintuitive that we intentionally break our infrastructure, but Chaos Engineering is a structured and responsible way of approaching exactly this.

Modern distributed systems are so complex that it is difficult to say anything about their behaviour in normal circumstances. This means it is nigh impossible to predict how they will behave when pushed to their limits in a hostile environment. Chaos Engineering allows us to probe our running systems, in order to build confidence in their performance and resilience. I am Johan Abildskov. Join me in the dojo to learn.

The first time I head about Chaos Engineering, was when I learned about the tool Chaos Monkey. Chaos Monkey is a tool created by Netflix that they run in their production, randomly restarting systems and servers. My first reaction was awe. How was this a sane thing to do? It just sounds so wrong. However, I have come to see that this is the only way that you can keep improving your systems in a structured way. Rather than having the unforeseen happen because of user behaviour, you inject tension in the system to discover weaknesses before they cause user-facing problems.

This is the natural continuation of the lean practice of artificially introducing tension in your production lines to continuously optimize productivity.

There are two types of companies, those that are only able to react. Whether it is to the market or to the competition. And there are those companies that are able to disrupt themselves and be proactive in the market.

My money is on the success of the proactive organizations.

But back to Chaos Engineering. To me, the simplest way of explaining Chaos Engineering is that it is applying the scientific method to our production systems. We formulate hypotheses and conduct Chaos Experiments to investigate them.

A chaos experiment has four parts.

First, we defined a steady state as some measurable condition of a system that means it is working as normal. This could be distribution of failure rates or response time on some percentile of the requests. It is key that we have something measurable.

Second, we hypothesize that our system will maintain this steady state under some hostile conditions. This can be server restarts, disk failures or degraded performance in network or in services that we depend on.

Third, we have the execution of the experiment, done with tools like Chaos Monkey or Chaos Toolkit. Here we run the experiment trying to disprove our hypothesis.

Fourth, we collect data and analyse the results. The harder it is to disprove our hypothesis, the more confidence we have in the behaviour of our complex distributed system.

So the four components of a Chaos Experiment are the steady-state, our hypothesis, the execution and the analysis.

But wait, you’ll say. You will tell me I am crazy, that you can’t just to such horrible things to your production environment. And you’ll likely be right. Just as we need to build confidence in our production systems, we need to build confidence in our ability to conduct chaos experiments.

If you are completely new to Chaos Engineering, and perhaps are not even practising disaster recovery, this would be a really good place to start. This helps build our muscle and gets us used to manipulate environments. These trainings are commonly called game days. In some circles, they practice the “wheel of misfortune”, where you randomly select and incident and walk through that.

My best suggestion for a starting disaster recovery training is to test whether you are able to restore your backups to a testing system. It is both a healthy exercise and something that you should be practising regularly.

But let’s say that you are ready to start your chaos engineering journey. I still do not recommend that you just set Chaos Monkey loose in your production environment, and have it crash servers, and see what happens. Remember, Chaos Engineering is a cool naming for a boring practice.

Your first chaos experiments can be run in a completely segregated environment, ideally provisioned for this experiment only. It is likely that you will uncover many findings in even such a small environment. Keep repeating your experiment and raising the resilience of your applications. You should see that your service is already becoming more resilient in your production environments as a consequence of these experiments and remediations.

When you become more confident in your experimentation, you escalate the chaos experiment to be running in your staging environment. This should lead to more learning and more confidence in your capabilities. At this point, it becomes obvious that you must be able to abort a running chaos experiment, should it exceed some predefined boundary conditions.

When you are ready you can run your supervised chaos experiments in production. This is often scary, but it is where the learning is maximized and the uncertain is greatest. This will again lead to a new set of discovered weaknesses and remedies applied.

You truly have graduated in your Chaos Engineering when you are able to take your supervised Chaos Experiments, and run them unsupervised automatically in your production environments.

When the first Chaos Experiment has been successfully completed it is time to build your hypothesis backlog and continue your experiments in order to introduce increasing tension in your systems, and out-learn your competition.

Chaos Engineering helps us build a culture where failure is considered normal. A culture where we accept that any sufficiently complex system at all times will be running in a degraded but functional state. A culture where we don’t ask what happens if this fails, but rather what happens when this fails.

This has been the DevOpsDojo on Chaos Engineering. You can follow me on twitter @randomsort. You can find show notes and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Until next time, keep learning. Thank you for listening.