Chaos Engineering 🐒️

Posted on 07 Mar, 2021

Act of studying a system so as to build confidence in its capability to withstand harsh conditions. This is done by literally breaking (experimenting with) the system.

Although chaos engineering really seems to only help some large-scale distributed operations in the world, its principles can be applied to even smaller systems to achieve strong systems

The said "experiments" follow certain steps:

  1. Define the steady state of the system that specifies normal or expected behaviour (e.g latency, throughput)

  2. Segregate environments for experimenting

    • Normal Group (Main production app)

    • Experimental Group

  3. Introduce variables that reflect real world events (server crash, hard-drive malfunction, sending large payload data, limited network connection).

The harder is to break the steady state, the more confidence we have in the behaviour of our system

Advanced principles:

  1. Run Experiments in prod 💣️: Systems behave differently depending on environment & since a experimental group may not exactly have the same usage metrics, it becomes necessary to experiment in production.

  2. Automate experiments to run continuously: Running experiments manually can be time-consuming & costly.

  3. Minimize Blast Radius: Experimenting in production has the potential to cause unnecessary customer pain.

Cons

  • Low Level Fault Injection Difficult to simulate high-level fault types (e.g failure code, exceptions of a particular type)

  • Probabilistic Random nature of approach provides few guarantees on application's tolerance to failure

Tools

  • Netflix's chaosmonkey is a resiliency tool that helps applications tolerate random instance failures.

  • pumba is a chaos testing, network emulation and stress testing tool for containers.

  • Chaos Mesh is a chaos engineering platform under CNCF

  • Filibuster is a resiliency testing tool based on Service-level Fault Injection Testing.

Resources & Credits

Last updated