Chaos Engineering 🐒️
Posted on 07 Mar, 2021
Act of studying a system so as to build confidence in its capability to withstand harsh conditions. This is done by literally breaking (experimenting with) the system.
Although chaos engineering really seems to only help some large-scale distributed operations in the world, its principles can be applied to even smaller systems to achieve strong systems
- 1.Define the steady state of the system that specifies normal or expected behaviour (e.g latency, throughput)
- 2.Segregate environments for experimenting
- Normal Group (Main production app)
- Experimental Group
- 3.Introduce variables that reflect real world events (server crash, hard-drive malfunction, sending large payload data, limited network connection).
The harder is to break the steady state, the more confidence we have in the behaviour of our system
- 1.Run Experiments in prod 💣️: Systems behave differently depending on environment & since a experimental group may not exactly have the same usage metrics, it becomes necessary to experiment in production.
- 2.Automate experiments to run continuously: Running experiments manually can be time-consuming & costly.
- 3.Minimize Blast Radius: Experimenting in production has the potential to cause unnecessary customer pain.
- Low Level Fault Injection Difficult to simulate high-level fault types (e.g failure code, exceptions of a particular type)
- Probabilistic Random nature of approach provides few guarantees on application's tolerance to failure
- Netflix's chaosmonkey is a resiliency tool that helps applications tolerate random instance failures.
Last modified 10mo ago