Last week I had the pleasure of reading “Chaos Engineering; Building Confidence in System Behavior through Experiments“. A book written by few engineers involved in Chaos Engineering at Netflix. Being a software engineer who is involved in a similar set of work in enterprise software context, I would say that I am much thankful for the authors of this book for the content that helped me on thinking of formulating a strategy to ensure the reliability of our product.
In this blog post I will summarize a simple process of adapting Chaos Engineering.
Under the light of Agile development methodologies, we are following different practices to ensure that our applications are doing what it is suppose to do. This could start at unit level testing with a Test Driven Development framework or could be scale up to component level integration testing. What we are doing in all these cases is testing what we know about our application or what we are expecting from our application. Our support teams have to battle with a different set of problems once we deploy our applications to production.
Chaos engineering is a discipline that allow us to get an understanding of our systems behavior in production environment by simulating faults in it. “Principles of Chaos Engineering” define it as,
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Here are some steps involve in designing experiments,
- Create a hypothesis
- Define the scope of experiments
- Identify the metrics
- Notify the organization
- Run the experiments
- Analyze the results
- Increase the scrope
Create a hypothesis
Failures can happen due to various reasons. For example, hardware failures, functional bugs, network latency or communication barriers, inconsistent state transmissions, etc. What is important at this stage is to select an impactful event that can change the system. Lets say that we have observed traffic of one region of our APIs is increasing, we could test out our load balancing functionality.
Define the scope of the experiments
It is great if we can do experiments on our hypothesis in production, but at first we could choose a less impactful environment and gradually move towards production as confidence on our experiments grows over time.
Identify the metrics
Once the hypothesis and scope is defined we could decided what metrics we are going to use to evaluate our outcome. Equal distribution of traffic across multiple servers or time taken to reach a response to client can be used in load balancing scenario.
It is necessary to keep all stakeholders informed about the experiments and taking their input on how the experiments should designed in order to get maximum insights.
Run the experiments
Lights, Camera, Action! Now we can run the experiments, but at this point it is necessary to keep an on metrics. If the experiments are causing harm to system it is necessary to abort experiments and a mechanism for that should be placed in.
Analyze the result
Once the results are available we could validate the correctness of hypothesis and communicate the results with relevant teams. If the problem is with load balance, maybe the network infrastructure team have to work a bit more on load balancing across the system.
Increase the scope
Once we grow our confidence on experimenting on smaller scale problems we could start extending the scope of experiments. Increasing the scope can reveal a different set of systemic problems. For example, failures in load balancing can cause time outs and inconsistent states in different services that could cause our system to fall apart in peak times.
Don’t repeat it yourselves as you gain confidence on your experiments. Start automating what you have already experimented and look forward for other areas to build confidence.
Finally, a problem that comes to mind naturally is how good the decision of shutting down or playing around with your system to take it down in production? Well, Chaos Engineering is certainly not playing around with your system. It is based on the same empirical process that uses to test new drugs, therefore whatever the work we are doing in here is for the betterment of our own products.