What is Chaos Engineering?
Your API works perfectly 99% of the time. But what about that 1%? Chaos engineering deliberately breaks things in controlled ways to find weaknesses before customers do.
Instead of hoping your system handles failures gracefully, you force failures and see what happens:
- Inject latency to simulate slow networks
- Simulate service outages
- Cause database failures
- Create resource exhaustion
- Introduce error conditions
The goal: find problems in testing, not in production.
Types of Chaos Tests
Latency Injection
Add artificial delays to requests:
// Mock API with latency
const delayedAPI = {
getUser: (id) => {
return new Promise((resolve) => {
setTimeout(() => {
resolve({ id, name: 'John' });
}, 2000); // 2 second delay
});
}
};
// Does your UI show a loading state?
// Does your request timeout?
// Do you retry after timeout?
Error Rate Injection
Make requests fail randomly:
function maybeError(successRate = 0.9) {
if (Math.random() > successRate) {
throw new Error('Random failure');
}
}
const chaosAPI = {
getUser: (id) => {
maybeError(0.9); // 10% error rate
return { id, name: 'John' };
}
};
Dependency Failure
Simulate external service outages:
const paymentAPI = {
charge: () => {
// Simulate payment service down
throw new Error('Payment service unavailable');
}
};
Circuit Breaker Pattern
Prevent cascading failures with circuit breakers:
class CircuitBreaker {
constructor(failureThreshold = 5, resetTimeout = 60000) {
this.failureCount = 0;
this.failureThreshold = failureThreshold;
this.resetTimeout = resetTimeout;
this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
}
async execute(fn) {
if (this.state === 'OPEN') {
throw new Error('Circuit breaker is OPEN');
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
}
}
}
Testing Resilience Patterns
Retry with Exponential Backoff
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.pow(2, i) * 100; // 100ms, 200ms, 400ms
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Timeout Handling
function withTimeout(promise, timeoutMs) {
return Promise.race([
promise,
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeoutMs)
)
]);
}
Tools for Chaos Engineering
Chaos Toolkit
Open-source chaos engineering platform with YAML-based experiments.
Gremlin
Commercial chaos engineering platform with:
- User-friendly UI
- Pre-built chaos experiments
- Real-time observability
- Integration with monitoring tools
Locust with Chaos
Combine load testing with chaos injection for comprehensive testing.
Chaos Engineering Best Practices
- Start small - Test one component, not the entire system
- Automate - Integrate chaos tests into CI/CD
- Run regularly - Don't wait for production failures
- Have runbooks - Know how to respond to failures
- Document findings - Learn from chaos experiments
- Involve the team - Chaos testing is a shared responsibility
- Define blast radius - Test should not impact customers
Real-World Chaos Experiment
Hypothesis: Order service should handle payment service outages gracefully
Experiment Plan:
1. Start load: 100 requests/sec to order endpoint
2. Baseline metric: 95% success rate, 50ms avg latency
3. Inject chaos: Disable payment service
4. Observe: System should queue orders, return 503 with retry-after
5. Verify: No data loss, orders processed when service recovers
6. Rollback: Re-enable payment service
Monitoring During Chaos
Watch these metrics during chaos tests:
- Error rate
- Response time (p50, p95, p99)
- CPU and memory usage
- Database connections
- Queue depth (if using queues)
- User experience indicators
Conclusion
Chaos engineering isn't about breaking things for fun—it's about finding weaknesses in controlled environments. Netflix, Amazon, and Google all use chaos engineering. If you manage critical systems, you should too. Start with simple latency injection, add error rates, then test complex failure scenarios. Your users will thank you for the reliability improvements.