Chaos Engineering: API Resilience Testing & Failure Injection

What is Chaos Engineering?

Your API works perfectly 99% of the time. But what about that 1%? Chaos engineering deliberately breaks things in controlled ways to find weaknesses before customers do.

Instead of hoping your system handles failures gracefully, you force failures and see what happens:

Inject latency to simulate slow networks
Simulate service outages
Cause database failures
Create resource exhaustion
Introduce error conditions

The goal: find problems in testing, not in production.

Types of Chaos Tests

Latency Injection

Add artificial delays to requests:

// Mock API with latency
const delayedAPI = {
  getUser: (id) => {
    return new Promise((resolve) => {
      setTimeout(() => {
        resolve({ id, name: 'John' });
      }, 2000); // 2 second delay
    });
  }
};

// Does your UI show a loading state?
// Does your request timeout?
// Do you retry after timeout?

Error Rate Injection

Make requests fail randomly:

function maybeError(successRate = 0.9) {
  if (Math.random() > successRate) {
    throw new Error('Random failure');
  }
}

const chaosAPI = {
  getUser: (id) => {
    maybeError(0.9); // 10% error rate
    return { id, name: 'John' };
  }
};

Dependency Failure

Simulate external service outages:

const paymentAPI = {
  charge: () => {
    // Simulate payment service down
    throw new Error('Payment service unavailable');
  }
};

Circuit Breaker Pattern

Prevent cascading failures with circuit breakers:

class CircuitBreaker {
  constructor(failureThreshold = 5, resetTimeout = 60000) {
    this.failureCount = 0;
    this.failureThreshold = failureThreshold;
    this.resetTimeout = resetTimeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
  }
  
  async execute(fn) {
    if (this.state === 'OPEN') {
      throw new Error('Circuit breaker is OPEN');
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Testing Resilience Patterns

Retry with Exponential Backoff

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.pow(2, i) * 100; // 100ms, 200ms, 400ms
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Timeout Handling

function withTimeout(promise, timeoutMs) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), timeoutMs)
    )
  ]);
}

Tools for Chaos Engineering

Chaos Toolkit

Open-source chaos engineering platform with YAML-based experiments.

Gremlin

Commercial chaos engineering platform with:

User-friendly UI
Pre-built chaos experiments
Real-time observability
Integration with monitoring tools

Locust with Chaos

Combine load testing with chaos injection for comprehensive testing.

Chaos Engineering Best Practices

Start small - Test one component, not the entire system
Automate - Integrate chaos tests into CI/CD
Run regularly - Don't wait for production failures
Have runbooks - Know how to respond to failures
Document findings - Learn from chaos experiments
Involve the team - Chaos testing is a shared responsibility
Define blast radius - Test should not impact customers

Real-World Chaos Experiment

Hypothesis: Order service should handle payment service outages gracefully

Experiment Plan:
1. Start load: 100 requests/sec to order endpoint
2. Baseline metric: 95% success rate, 50ms avg latency
3. Inject chaos: Disable payment service
4. Observe: System should queue orders, return 503 with retry-after
5. Verify: No data loss, orders processed when service recovers
6. Rollback: Re-enable payment service

Monitoring During Chaos

Watch these metrics during chaos tests:

Error rate
Response time (p50, p95, p99)
CPU and memory usage
Database connections
Queue depth (if using queues)
User experience indicators

Conclusion

Chaos engineering isn't about breaking things for fun—it's about finding weaknesses in controlled environments. Netflix, Amazon, and Google all use chaos engineering. If you manage critical systems, you should too. Start with simple latency injection, add error rates, then test complex failure scenarios. Your users will thank you for the reliability improvements.