Fault Injection

Khaos uses chaos engineering principles to test how your AI agents behave under real-world failure conditions. By injecting faults at the SDK level, Khaos creates production-realistic failures that reveal how your agent handles errors, timeouts, and edge cases.

How Fault Injection Works

Khaos injects faults by patching the SDKs your agent uses—OpenAI, Anthropic, httpx, requests, and MCP clients. This means faults are indistinguishable from real production failures:

Production-realistic exceptions — Exact error codes, headers, and message structures
SDK-level interception — No mocking; your actual error handling code is tested
Probabilistic or deterministic — Random injection for stress testing, or scheduled for reproducibility
Automatic capability detection — Faults only apply to capabilities your agent actually uses

Zero-code instrumentation

Fault injection is automatic. You don't need to modify your agent code—Khaos patches the underlying SDKs at runtime via environment variables.

Fault Categories

Khaos ships with 21 built-in fault types across 6 categories. Each fault creates the exact exception your agent would see in production.

LLM Provider Faults

Simulate failures from OpenAI, Anthropic, and other LLM providers. These faults match the exact HTTP status codes, Retry-After headers, and error structures returned by real APIs.

Fault	Description	Real-World Scenario
`llm_rate_limit`	429 rate limit with Retry-After header	API quota exceeded during peak usage
`llm_model_unavailable`	404 model not found error	Model deprecated or access revoked
`llm_response_timeout`	APITimeoutError exception	Network issues or overloaded API
`llm_token_quota_exceeded`	429 quota exhaustion error	Monthly token limit reached
`model_fallback_forced`	Silent model substitution	Primary model unavailable, tests fallback behavior

YAML

# LLM fault configuration
faults:
  - type: llm_rate_limit
    config:
      retry_after_ms: 30000    # Retry-After header value
      probability: 0.2          # 20% of LLM calls affected

HTTP/Network Faults

Test how your agent handles network-level failures affecting any HTTP calls—API requests, webhooks, external services.

Fault	Description	Real-World Scenario
`http_latency`	Add configurable delay to requests	Slow network, congested APIs
`http_error`	Return HTTP error codes (4xx, 5xx)	Service outages, authorization failures
`timeout`	Request timeout before response	Network partition, unresponsive server
`malformed_payload`	Invalid JSON, truncated, or garbage responses	Proxy errors, encoding issues

YAML

# HTTP fault configuration
faults:
  - type: http_latency
    config:
      delay_ms: 2000            # 2 second delay
      jitter_ms: 500            # ±500ms random jitter
      probability: 0.3          # 30% of requests

  - type: http_error
    config:
      status_code: 503          # Service Unavailable
      probability: 0.1          # 10% of requests

Tool/Function Call Faults

When your agent uses tools (function calling), these faults test error handling at the tool execution layer.

Fault	Description	Real-World Scenario
`tool_call_failure`	Tool execution fails with error	External API down, permission denied
`tool_response_corruption`	Tool returns malformed data	Schema changes, encoding errors
`tool_latency_spike`	Tool response significantly delayed	Slow database, network congestion
`tool_output_injection`	Inject attack payload into tool output	Compromised tool, indirect prompt injection

MCP (Model Context Protocol) Faults

For agents using MCP servers, these faults test the MCP protocol layer specifically.

Fault	Description	Real-World Scenario
`mcp_tool_latency`	Add delay to MCP tool calls	Slow MCP server, network issues
`mcp_tool_failure`	MCP tool execution errors	Server-side tool failures
`mcp_tool_corruption`	Corrupted MCP tool responses	Protocol errors, serialization bugs
`mcp_server_unavailable`	MCP server connection refused	Server crash, deployment issues

RAG/Retrieval Faults

Test how your agent handles failures in retrieval-augmented generation pipelines.

Fault	Description	Real-World Scenario
`rag_retrieval_corruption`	Return wrong or irrelevant documents	Index corruption, embedding drift
`rag_document_poisoning`	Inject malicious content into retrieved docs	Data poisoning, compromised sources

Context/Memory Faults

Simulate context window and memory-related failures that affect long-running conversations.

Fault	Description	Real-World Scenario
`context_window_overflow`	Context window limit exceeded	Long conversations, large documents
`conversation_history_dropped`	Conversation history loss	Session timeout, state corruption
`embedding_service_failure`	Embedding API unavailable	Embedding service outage

Automatic Capability Detection

Khaos automatically detects what capabilities your agent uses (LLM, tools, MCP, RAG) and only applies relevant faults. If your agent doesn't use MCP, MCP faults are skipped and don't affect your resilience score.

Fault Scheduling

Khaos supports two approaches to fault injection, depending on your testing goals:

Probabilistic Injection

Random fault injection for stress testing. Each request has a probability of being affected.

YAML

# Random injection across all inputs
faults:
  - type: llm_rate_limit
    config:
      probability: 0.2    # 20% chance per LLM call

  - type: http_latency
    config:
      delay_ms: 500
      probability: 0.3    # 30% chance per HTTP request

Deterministic Scheduling

Map specific faults to specific test inputs for reproducible testing. This is how the built-in evaluation packs work—ensuring consistent, comparable results across runs.

YAML

# Built-in packs use fault schedules for reproducibility
resilience:
  fault_schedule:
    # This specific input gets LLM timeout
    math_simple:
      - type: llm_response_timeout
        config:
          delay_ms: 100

    # This input gets rate limiting
    instruction_format_json:
      - type: llm_rate_limit
        config:
          retry_after_ms: 1000

    # This input gets HTTP latency
    knowledge_capital:
      - type: http_latency
        config:
          delay_ms: 500

Reproducibility

Using the same --seed value with deterministic fault schedules produces identical injection sequences. This enables meaningful before/after comparisons when testing agent changes.

Fault Configuration Options

Each fault type accepts specific configuration options:

YAML

faults:
  # LLM rate limiting with retry header
  - type: llm_rate_limit
    config:
      retry_after_ms: 30000     # Retry-After header value (ms)
      probability: 0.2          # Injection probability

  # HTTP latency with jitter
  - type: http_latency
    config:
      delay_ms: 1000            # Base delay
      jitter_ms: 200            # Random ±200ms
      probability: 0.5

  # HTTP errors with specific status code
  - type: http_error
    config:
      status_code: 503          # HTTP status code
      message: "Service temporarily unavailable"
      probability: 0.1

  # Tool execution failure
  - type: tool_call_failure
    config:
      error_type: "execution_error"
      error_message: "Tool execution failed: connection refused"

  # MCP tool targeting specific tools
  - type: mcp_tool_latency
    config:
      tool_name: "search"       # Target specific tool (or "*" for all)
      latency_ms: 3000

  # Malformed payload types
  - type: malformed_payload
    config:
      corruption_type: "invalid_json"   # invalid_json, truncated, binary_garbage
      probability: 0.15

Running Fault Injection Tests

Fault injection is built into the standard evaluation workflow:

Terminal

# Standard evaluation includes resilience testing
khaos run <agent-name>

# Comprehensive evaluation with extensive fault coverage
khaos run <agent-name> --eval full-eval

# Quick evaluation with focused fault testing
khaos run <agent-name> --eval quickstart

# Baseline only (no faults) for comparison
khaos run <agent-name> --eval baseline

Understanding Resilience Results

After running an evaluation, you'll see resilience metrics that show how your agent performed under fault conditions:

TEXT

Resilience Score: 85/100 (B)

Fault Coverage:
  LLM Faults:     3/3 tested (llm_rate_limit, llm_timeout, llm_quota)
  HTTP Faults:    2/2 tested (http_latency, http_error)
  Tool Faults:    2/2 tested (tool_failure, tool_corruption)
  MCP Faults:     N/A (agent doesn't use MCP)

Recovery Analysis:
  ├─ llm_rate_limit:      Recovered (retry successful)
  ├─ llm_response_timeout: Recovered (fallback response)
  ├─ http_error:          Degraded (partial functionality)
  └─ tool_call_failure:   Failed (no error handling)

Recommendations:
  → Add retry logic for tool_call_failure scenarios
  → Consider fallback when HTTP services are unavailable

Custom Fault Definitions

Create custom evaluation configurations with specific fault schedules:

YAML

# my-resilience-test.yaml
inputs:
  - id: api_integration
    text: "Fetch the current weather for San Francisco"
    goal:
      contains_any: ["weather", "temperature", "forecast"]

  - id: calculation_task
    text: "Calculate the compound interest on $1000 at 5% for 3 years"
    goal:
      contains: "1157"

phases:
  baseline:
    runs: 1
    faults: []

  resilience:
    runs: 1
    fault_schedule:
      api_integration:
        - type: http_latency
          config:
            delay_ms: 3000
        - type: http_error
          config:
            status_code: 503
            probability: 0.5

      calculation_task:
        - type: llm_response_timeout
          config:
            delay_ms: 5000

Terminal

# Run custom evaluation
khaos run <agent-name> --eval my-resilience-test.yaml

Fault Injection Architecture

Understanding how Khaos injects faults helps you debug and extend the system:

Layer	Mechanism	Fault Types
LLM Shim	Patches OpenAI/Anthropic SDKs	llm_rate_limit, llm_timeout, llm_quota, model_fallback
HTTP Shim	Patches requests/httpx	http_latency, http_error, timeout, malformed_payload
MCP Proxy	Intercepts MCP protocol	mcp_tool_latency, mcp_tool_failure, mcp_corruption
Tool Interceptor	Wraps tool execution	tool_call_failure, tool_corruption, tool_latency

Faults are configured via environment variables that the shims read at runtime:

KHAOS_LLM_FAULTS — JSON array of LLM fault configurations
KHAOS_HTTP_FAULTS — JSON array of HTTP fault configurations
KHAOS_MCP_FAULTS — JSON array of MCP fault configurations
KHAOS_FAULT_SEED — Random seed for reproducible injection

Best Practices

Start with built-in packs — Use quickstart or full-eval before writing custom tests
Test recovery, not just failure — A good agent recovers gracefully; check for retry logic and fallbacks
Use deterministic scheduling for CI — Reproducible tests catch regressions reliably
Compare baseline vs resilience — The gap between scores reveals error handling quality
Don't over-inject — 10-30% probability is realistic; 100% masks normal behavior

Interactive Playground

Metrics