Define the steady state of the system that specifies normal or expected behaviour (e.g latency, throughput)
Segregate environments for experimenting
Normal Group (Main production app)
Introduce variables that reflect real world events (server crash, hard-drive malfunction, sending large payload data, limited network connection).
The harder is to break the steady state, the more confidence we have in the behaviour of our system
Run Experiments in prod 💣️:
Systems behave differently depending on environment & since a experimental group may not exactly have the same usage metrics, it becomes necessary to experiment in production.
Automate experiments to run continuously:
Running experiments manually can be time-consuming & costly.
Minimize Blast Radius:
Experimenting in production has the potential to cause unnecessary customer pain.
Low Level Fault Injection Difficult to simulate high-level fault types (e.g failure code, exceptions of a particular type)
Probabilistic Random nature of approach provides few guarantees on application's tolerance to failure
Netflix's chaosmonkey is a resiliency tool that helps applications tolerate random instance failures.
pumba is a chaos testing, network emulation and stress testing tool for containers.
Chaos Mesh is a chaos engineering platform under CNCF
Filibuster is a resiliency testing tool based on Service-level Fault Injection Testing.