Break Your Own Systems: Chaos Engineering

Use chaos engineering to improve the resilience of your systems by breaking things deliberately in a controlled way

Mike Skaife

01 Jul 2021 • 8 min read

Modern software applications and the infrastructure they run on are becoming increasingly complex. A web of multiple microservices integrated via APIs or message queues, each with its own database. Running inside containers that can spin up and down several times a day depending on the amount of traffic. Deployed on cloud infrastructure, in different geographic regions. All spitting out logs, metrics, and traces and used by a potentially vast number of users constantly during the day and night. All this is with the additional need for product teams to make changes and add new features more frequently, with zero downtime and zero impact on users.

These complex systems can also fail in increasingly complex ways. Applications can have weird and unexpected bugs, API versions can introduce breaking changes, message queues can crash, data can become corrupted, and whole regions of cloud infrastructure can even go offline. Changes made by one team can impact a separate service running somewhere else, owned by another team, with no warning. Massive spikes in traffic around events like Black Friday can overwhelm one part of the system and impact your customer experience. Or something entirely unexpected and seemingly random can happen at any time with catastrophic effects.

So how do you make sense of the madness? How do you avoid a lifetime of sleepless nights worrying that this fragile house of cards is going to collapse at any moment? This is where Chaos Engineering can help.

What Is Chaos Engineering?

The official definition states:

“Chaos Engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions.” — Wikipedia

In other words, Chaos Engineering involves purposefully causing things to fail in production environments, in a controlled way, with the intention of learning how those failures manifest, what their impact is, and whether the system can detect and recover from those failures as expected.

There are a couple of key points to highlight there:

Production — True Chaos Engineering involves breaking the production environment, as scary as that might seem at first.
Controlled — We are talking about taking a scientific approach to experiments and limiting their impact.
Expected — We are validating that what we expect to happen does happen. It’s not just a case of “What does this big red button do?”
Learning — The objective should be to further understand our system so we can implement improvements.

Essentially, it’s about accepting that in any sufficiently complex system, something is either already broken or is about to break at any moment. It’s pretty much impossible to prevent every possible failure from occurring, so instead let’s focus our efforts on constantly testing what happens when different failures occur to increase our confidence in our system’s resiliency and ability to handle failure gracefully.

It’s an ongoing journey — not a one-off activity — that should become a key part of your engineering culture and practices.

Less Chaos, More Engineering

When our teams first started venturing into Chaos Engineering, we were somewhat naive in how we approached it.

We would hold what we called “game days,” where a couple of us planned, in secret, to cause some kind of failure and watch the outcome. Our first couple of game days involved, for example, stopping a database instance whilst at the same time hiding a couple of the key engineers with the knowledge to fix it in a room on another floor of the building. We’d then sit back and observe the carnage of the team running around with their hair on fire trying to fix it themselves.

This was a bit too heavily skewed towards the chaos and less about using a sensible controlled engineering methodology. I would not recommend following this approach! It was useful for confirming some of the single points of failure in our team (i.e. only one person knows how to fix this problem, and we can’t cope very well if they ever happen to find themselves locked in a meeting room), but we ultimately learned very little.

So please don’t let the name fool you. It doesn’t mean just causing chaos by pulling cables out of the wall or switching things off. We’re talking about taking an engineering approach to validating and learning about our systems.

Scrabble pieces — Photo by Brett Jordan on Unsplash

Chaos Experiments and Applying Some Science

The correct way to approach Chaos Engineering is to first create a hypothesis that you want to validate, implement an experiment in a controlled way, observe the results, and compare them to what your hypothesis suggested would happen. From there, you can identify any gaps and implement further automation or process changes to mitigate those particular issues in the future.

Let’s run through the different steps involved in running a Chaos Experiment in a bit more detail.

Build a hypothesis

You first need to understand what normal behaviour in the system looks like and what failure you want to test for. Doing this as a team is really effective and can help identify the key areas to focus on. Bring together a bunch of software engineers, testers, SREs, and anybody else who can provide insight into how the system operates. Ask yourselves what could go wrong and identify the priority areas of the system to focus on and the different failure scenarios that could occur.

Once you’ve identified the failure to target as part of your experiment, your hypothesis should state what the expected behaviour will be when that failure occurs.

You should also ensure you know the metrics and signals to look for to indicate the behaviour of the system. Do you expect an alert to fire somewhere? Do you expect to see a spike in traffic on a particular dashboard?

For example, when a microservice crashes, we expect an alert to fire in our monitoring tool and for the microservice to spin back up again, connecting to the database and message queue with no manual intervention.

Simulate the failure and bring the chaos

Run your experiment and simulate the real-world failure you identified. That could mean stopping a database instance or Kubernetes pod, DDoS-ing yourself with high network traffic, adding latency to requests, causing a spike in CPU or memory, and so on.

There are tools available to help inject these failures, or it could be as simple as manually running commands and scripts on a machine.

Observe system behaviour and validate the hypothesis

Now you can validate that what you’re seeing is what you expected to see. Use the metrics and monitoring tools identified as part of your hypothesis.

Remember to also look for any additional unexpected outcomes. Watch for alerts firing in other areas of the system that warrant further investigation and be wary of a cascading failure that has had a wider impact than anticipated.

There are two possible outcomes to your experiment:

The system behaved as expected.
You have disproved your hypothesis and something different happened.

Remediate any issues in the system

Ensure you follow up and put in place any actions to prevent unexpected failures from reoccurring in the future. That might mean adding more automated tests into a CI/CD pipeline, scaling up to a cluster of databases rather than a single instance, adding a message queue between two microservices to increase decoupling, or even making changes to your monitoring and observability if you didn’t see an alert that you expected.

Best Practices and Other Considerations

Remember, we’re talking about taking an engineering approach — not deliberately causing genuine uncontrolled chaos. There are a number of ways to ensure your experiments are as safe as possible.

Metrics, monitoring, and observability

Ensure you know how you will detect if something is going wrong. Dashboards, metrics, logs, traces, and alerts need to be in place. Otherwise, you are flying blind and will have no idea how your system is behaving.

Spend time also learning what “normal” looks like in your system before you start the experiment so you don’t misinterpret a signal. For example, the CPU may run at 40% during normal operations, but if you didn’t know that, then it might look like quite a high value during your experiment. This is what observability encourages as well.

Minimise blast radius

It’s probably best to avoid experiments that take down the whole production environment. Focus the scope of your testing on specific areas of the system and mitigate as much as you can against any cascading failure — or at least be able to quickly detect and recover from it.

Have an escape plan

Make sure you always have a way of breaking out of the experiment if you need to and you can immediately roll back any changes to recover to a working state. Automated tools can help here again, as they usually have a way of immediately ceasing any failure injection and allowing the system to recover.

Run experiments in production environments

This often sounds scary to teams who are new to Chaos Engineering, but it’s the right thing to do. Production is where your customers live, and it’s where you want to test for resiliency and reliability.

Production environments are always unique, however much you use Infrastructure-as-Code or have copies of databases. There is more traffic, different usage patterns, and other subtle differences you never even considered.

Of course, it makes a lot of sense to practice with Chaos Engineering in a non-production environment first rather than just diving straight in if that helps to give your team (and your manager!) more confidence.

Automate chaos experiments and run them continuously

When you’ve run a successful experiment, look to automate it to run more frequently. This allows for continuous verification of any mitigations you put in place and helps detect any reoccurrence of the same issues you previously observed.

Some teams even have this automated chaos happening continuously, at random, in their production environment. Netflix famously has tools in place that can kill microservices and even whole regions of cloud infrastructure randomly during certain time windows — the ultimate sign of confidence in your resiliency and ability to recover from failures.

Tools and Technologies

With interest and maturity in this space growing, there is an increasingly large array of tools and technologies that can help support you in implementing Chaos Engineering.

For premium offerings, consider Gremlin, which provides a full suite of tools for injecting failures into your systems and validating hypotheses. Or if you’re running on AWS, they now even offer “Chaos Engineering as a Service” through their Fault Injection Simulator.

If you’d prefer to go down the open-source route, then there is always the well-known Chaos Monkey developed by Netflix. Or take a look at Litmus, which is a sandbox project under the Cloud Native Computing Foundation (CNCF).

There are even some slightly wackier options if you want to make Chaos Engineering even more fun than it already is! KubeInvaders is a Space Invaders-inspired interface for killing Kubernetes pods. Or there’s Kube Doom, which uses a similar concept built around the old video game.

Conclusion

Chaos Engineering is a great way to test and improve the reliability and resiliency of your systems, and ultimately enhance the experience of your customers.

As systems grow in complexity and frequency of change, so must our approach to protecting against failures and adverse effects.

Approach it methodically, run experiments frequently, reap the benefits of increased confidence in your systems, and have fun!