Chaos Engineering for APIs: Test Resilience and Failure

Learn chaos engineering for APIs. Inject latency, errors, and timeouts to test how your system handles failures. Build resilient API systems.

What is Chaos Engineering?

Your API works perfectly 99% of the time. But what about that 1%? Chaos engineering deliberately breaks things in controlled ways to find weaknesses before customers do.

Instead of hoping your system handles failures gracefully, you force failures and see what happens:

  • Inject latency to simulate slow networks
  • Simulate service outages
  • Cause database failures
  • Create resource exhaustion
  • Introduce error conditions

The goal: find problems in testing, not in production.

Types of Chaos Tests

Latency Injection

Add artificial delays to requests:

// Mock API with latency
const delayedAPI = {
  getUser: (id) => {
    return new Promise((resolve) => {
      setTimeout(() => {
        resolve({ id, name: 'John' });
      }, 2000); // 2 second delay
    });
  }
};

// Does your UI show a loading state?
// Does your request timeout?
// Do you retry after timeout?

Error Rate Injection

Make requests fail randomly:

function maybeError(successRate = 0.9) {
  if (Math.random() > successRate) {
    throw new Error('Random failure');
  }
}

const chaosAPI = {
  getUser: (id) => {
    maybeError(0.9); // 10% error rate
    return { id, name: 'John' };
  }
};

Dependency Failure

Simulate external service outages:

const paymentAPI = {
  charge: () => {
    // Simulate payment service down
    throw new Error('Payment service unavailable');
  }
};

Circuit Breaker Pattern

Prevent cascading failures with circuit breakers:

class CircuitBreaker {
  constructor(failureThreshold = 5, resetTimeout = 60000) {
    this.failureCount = 0;
    this.failureThreshold = failureThreshold;
    this.resetTimeout = resetTimeout;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
  }
  
  async execute(fn) {
    if (this.state === 'OPEN') {
      throw new Error('Circuit breaker is OPEN');
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

Testing Resilience Patterns

Retry with Exponential Backoff

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      
      const delay = Math.pow(2, i) * 100; // 100ms, 200ms, 400ms
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Timeout Handling

function withTimeout(promise, timeoutMs) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error('Timeout')), timeoutMs)
    )
  ]);
}

Tools for Chaos Engineering

Chaos Toolkit

Open-source chaos engineering platform with YAML-based experiments.

Gremlin

Commercial chaos engineering platform with:

  • User-friendly UI
  • Pre-built chaos experiments
  • Real-time observability
  • Integration with monitoring tools

Locust with Chaos

Combine load testing with chaos injection for comprehensive testing.

Chaos Engineering Best Practices

  • Start small - Test one component, not the entire system
  • Automate - Integrate chaos tests into CI/CD
  • Run regularly - Don't wait for production failures
  • Have runbooks - Know how to respond to failures
  • Document findings - Learn from chaos experiments
  • Involve the team - Chaos testing is a shared responsibility
  • Define blast radius - Test should not impact customers

Real-World Chaos Experiment

Hypothesis: Order service should handle payment service outages gracefully

Experiment Plan:
1. Start load: 100 requests/sec to order endpoint
2. Baseline metric: 95% success rate, 50ms avg latency
3. Inject chaos: Disable payment service
4. Observe: System should queue orders, return 503 with retry-after
5. Verify: No data loss, orders processed when service recovers
6. Rollback: Re-enable payment service

Monitoring During Chaos

Watch these metrics during chaos tests:

  • Error rate
  • Response time (p50, p95, p99)
  • CPU and memory usage
  • Database connections
  • Queue depth (if using queues)
  • User experience indicators

Conclusion

Chaos engineering isn't about breaking things for fun—it's about finding weaknesses in controlled environments. Netflix, Amazon, and Google all use chaos engineering. If you manage critical systems, you should too. Start with simple latency injection, add error rates, then test complex failure scenarios. Your users will thank you for the reliability improvements.