Chaos Engineering

Last week I had the pleasure of reading Chaos Engineering; Building Confidence in System Behavior through Experiments. A book written by few engineers involved in Chaos Engineering at Netflix. Being a software engineer who is involved in a similar set of work in enterprise software context, I would say that I am much thankful for the authors of this book for the content that helped me on thinking of formulating a strategy to ensure the reliability of our product.

In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state.

In this blog post I will summarize a simple process of adapting Chaos Engineering.

Under the light of Agile development methodologies, we are following different practices to ensure that our applications are doing what it is suppose to do. This could start at unit level testing with a Test Driven Development framework or could be scale up to component level integration testing. What we are doing in all these cases is testing what we know about our application or what we are expecting from our application. Our support teams have to battle with a different set of problems once we deploy our applications to production.

Chaos engineering is a discipline that allow us to get an understanding of our systems behavior in production environment by simulating faults in it. Principles of Chaos Engineering” define it as,

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Here are some steps involve in designing experiments,

  1. Create a hypothesis
  2. Define the scope of experiments
  3. Identify the metrics
  4. Notify the organization
  5. Run the experiments
  6. Analyze the results
  7. Increase the scrope
  8. Automate

Create a hypothesis

Failures can happen due to various reasons. For example, hardware failures, functional bugs, network latency or communication barriers, inconsistent state transmissions, etc. What is important at this stage is to select an impactful event that can change the system. Lets say that we have observed traffic of one region of our APIs is increasing, we could test out our load balancing functionality.

Define the scope of the experiments

It is great if we can do experiments on our hypothesis in production, but at first we could choose a less impactful  environment and gradually move towards production as confidence on our experiments grows over time.

Identify the metrics

Once the hypothesis and scope is defined we could decided what metrics we are going to use to evaluate our outcome. Equal distribution of traffic across multiple servers or time taken to reach a response to client can be used in load balancing scenario.

Notify organization

It is necessary to keep all stakeholders informed about the experiments and taking their input on how the experiments should designed in order to get maximum insights.

Run the experiments

Lights, Camera, Action! Now we can run the experiments, but at this point it is necessary to keep an on metrics. If the experiments are causing harm to system it is necessary to abort experiments and a mechanism for that should be placed in.

Analyze the result

Once the results are available we could validate the correctness of hypothesis and communicate the results with relevant teams. If the problem is with load balance, maybe the network infrastructure team have to work a bit more on load balancing across the system.

Increase the scope

Once we grow our confidence on experimenting on smaller scale problems we could start extending the scope of  experiments. Increasing the scope can reveal a different set of systemic problems. For example, failures in load balancing can cause time outs and inconsistent states in different services that could cause our system to fall apart in peak times.


Don’t repeat it yourselves as you gain confidence on your experiments. Start automating what you have already experimented and look forward for other areas to build confidence.

Finally, a problem that comes to mind naturally is how good the decision of shutting down or playing around with your system to take it down in production? Well, Chaos Engineering is certainly not playing around with your system. It is based on the same empirical process that uses to test new drugs, therefore whatever the work we are doing in here is for the betterment of our own products.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s