Fault Injection

Khaos uses chaos engineering principles to test how your AI agents behave under real-world failure conditions. By injecting faults at the SDK level, Khaos creates production-realistic failures that reveal how your agent handles errors, timeouts, and edge cases.

How Fault Injection Works

Khaos injects faults by patching the SDKs your agent uses—OpenAI, Anthropic, httpx, requests, and MCP clients. This means faults are indistinguishable from real production failures:

  • Production-realistic exceptions — Exact error codes, headers, and message structures
  • SDK-level interception — No mocking; your actual error handling code is tested
  • Probabilistic or deterministic — Random injection for stress testing, or scheduled for reproducibility
  • Automatic capability detection — Faults only apply to capabilities your agent actually uses
Zero-code instrumentation
Fault injection is automatic. You don't need to modify your agent code—Khaos patches the underlying SDKs at runtime via environment variables.

Fault Categories

Khaos ships with 21 built-in fault types across 6 categories. Each fault creates the exact exception your agent would see in production.

LLM Provider Faults

Simulate failures from OpenAI, Anthropic, and other LLM providers. These faults match the exact HTTP status codes, Retry-After headers, and error structures returned by real APIs.

FaultDescriptionReal-World Scenario
llm_rate_limit429 rate limit with Retry-After headerAPI quota exceeded during peak usage
llm_model_unavailable404 model not found errorModel deprecated or access revoked
llm_response_timeoutAPITimeoutError exceptionNetwork issues or overloaded API
llm_token_quota_exceeded429 quota exhaustion errorMonthly token limit reached
model_fallback_forcedSilent model substitutionPrimary model unavailable, tests fallback behavior
YAML
# LLM fault configuration
faults:
  - type: llm_rate_limit
    config:
      retry_after_ms: 30000    # Retry-After header value
      probability: 0.2          # 20% of LLM calls affected

HTTP/Network Faults

Test how your agent handles network-level failures affecting any HTTP calls—API requests, webhooks, external services.

FaultDescriptionReal-World Scenario
http_latencyAdd configurable delay to requestsSlow network, congested APIs
http_errorReturn HTTP error codes (4xx, 5xx)Service outages, authorization failures
timeoutRequest timeout before responseNetwork partition, unresponsive server
malformed_payloadInvalid JSON, truncated, or garbage responsesProxy errors, encoding issues
YAML
# HTTP fault configuration
faults:
  - type: http_latency
    config:
      delay_ms: 2000            # 2 second delay
      jitter_ms: 500            # ±500ms random jitter
      probability: 0.3          # 30% of requests

  - type: http_error
    config:
      status_code: 503          # Service Unavailable
      probability: 0.1          # 10% of requests

Tool/Function Call Faults

When your agent uses tools (function calling), these faults test error handling at the tool execution layer.

FaultDescriptionReal-World Scenario
tool_call_failureTool execution fails with errorExternal API down, permission denied
tool_response_corruptionTool returns malformed dataSchema changes, encoding errors
tool_latency_spikeTool response significantly delayedSlow database, network congestion
tool_output_injectionInject attack payload into tool outputCompromised tool, indirect prompt injection

MCP (Model Context Protocol) Faults

For agents using MCP servers, these faults test the MCP protocol layer specifically.

FaultDescriptionReal-World Scenario
mcp_tool_latencyAdd delay to MCP tool callsSlow MCP server, network issues
mcp_tool_failureMCP tool execution errorsServer-side tool failures
mcp_tool_corruptionCorrupted MCP tool responsesProtocol errors, serialization bugs
mcp_server_unavailableMCP server connection refusedServer crash, deployment issues

RAG/Retrieval Faults

Test how your agent handles failures in retrieval-augmented generation pipelines.

FaultDescriptionReal-World Scenario
rag_retrieval_corruptionReturn wrong or irrelevant documentsIndex corruption, embedding drift
rag_document_poisoningInject malicious content into retrieved docsData poisoning, compromised sources

Context/Memory Faults

Simulate context window and memory-related failures that affect long-running conversations.

FaultDescriptionReal-World Scenario
context_window_overflowContext window limit exceededLong conversations, large documents
conversation_history_droppedConversation history lossSession timeout, state corruption
embedding_service_failureEmbedding API unavailableEmbedding service outage
Automatic Capability Detection
Khaos automatically detects what capabilities your agent uses (LLM, tools, MCP, RAG) and only applies relevant faults. If your agent doesn't use MCP, MCP faults are skipped and don't affect your resilience score.

Fault Scheduling

Khaos supports two approaches to fault injection, depending on your testing goals:

Probabilistic Injection

Random fault injection for stress testing. Each request has a probability of being affected.

YAML
# Random injection across all inputs
faults:
  - type: llm_rate_limit
    config:
      probability: 0.2    # 20% chance per LLM call

  - type: http_latency
    config:
      delay_ms: 500
      probability: 0.3    # 30% chance per HTTP request

Deterministic Scheduling

Map specific faults to specific test inputs for reproducible testing. This is how the built-in evaluation packs work—ensuring consistent, comparable results across runs.

YAML
# Built-in packs use fault schedules for reproducibility
resilience:
  fault_schedule:
    # This specific input gets LLM timeout
    math_simple:
      - type: llm_response_timeout
        config:
          delay_ms: 100

    # This input gets rate limiting
    instruction_format_json:
      - type: llm_rate_limit
        config:
          retry_after_ms: 1000

    # This input gets HTTP latency
    knowledge_capital:
      - type: http_latency
        config:
          delay_ms: 500
Reproducibility
Using the same --seed value with deterministic fault schedules produces identical injection sequences. This enables meaningful before/after comparisons when testing agent changes.

Fault Configuration Options

Each fault type accepts specific configuration options:

YAML
faults:
  # LLM rate limiting with retry header
  - type: llm_rate_limit
    config:
      retry_after_ms: 30000     # Retry-After header value (ms)
      probability: 0.2          # Injection probability

  # HTTP latency with jitter
  - type: http_latency
    config:
      delay_ms: 1000            # Base delay
      jitter_ms: 200            # Random ±200ms
      probability: 0.5

  # HTTP errors with specific status code
  - type: http_error
    config:
      status_code: 503          # HTTP status code
      message: "Service temporarily unavailable"
      probability: 0.1

  # Tool execution failure
  - type: tool_call_failure
    config:
      error_type: "execution_error"
      error_message: "Tool execution failed: connection refused"

  # MCP tool targeting specific tools
  - type: mcp_tool_latency
    config:
      tool_name: "search"       # Target specific tool (or "*" for all)
      latency_ms: 3000

  # Malformed payload types
  - type: malformed_payload
    config:
      corruption_type: "invalid_json"   # invalid_json, truncated, binary_garbage
      probability: 0.15

Running Fault Injection Tests

Fault injection is built into the standard evaluation workflow:

Terminal
# Standard evaluation includes resilience testing
khaos run <agent-name>

# Comprehensive evaluation with extensive fault coverage
khaos run <agent-name> --eval full-eval

# Quick evaluation with focused fault testing
khaos run <agent-name> --eval quickstart

# Baseline only (no faults) for comparison
khaos run <agent-name> --eval baseline

Understanding Resilience Results

After running an evaluation, you'll see resilience metrics that show how your agent performed under fault conditions:

TEXT
Resilience Score: 85/100 (B)

Fault Coverage:
  LLM Faults:     3/3 tested (llm_rate_limit, llm_timeout, llm_quota)
  HTTP Faults:    2/2 tested (http_latency, http_error)
  Tool Faults:    2/2 tested (tool_failure, tool_corruption)
  MCP Faults:     N/A (agent doesn't use MCP)

Recovery Analysis:
  ├─ llm_rate_limit:      Recovered (retry successful)
  ├─ llm_response_timeout: Recovered (fallback response)
  ├─ http_error:          Degraded (partial functionality)
  └─ tool_call_failure:   Failed (no error handling)

Recommendations:
  → Add retry logic for tool_call_failure scenarios
  → Consider fallback when HTTP services are unavailable

Custom Fault Definitions

Create custom evaluation configurations with specific fault schedules:

YAML
# my-resilience-test.yaml
inputs:
  - id: api_integration
    text: "Fetch the current weather for San Francisco"
    goal:
      contains_any: ["weather", "temperature", "forecast"]

  - id: calculation_task
    text: "Calculate the compound interest on $1000 at 5% for 3 years"
    goal:
      contains: "1157"

phases:
  baseline:
    runs: 1
    faults: []

  resilience:
    runs: 1
    fault_schedule:
      api_integration:
        - type: http_latency
          config:
            delay_ms: 3000
        - type: http_error
          config:
            status_code: 503
            probability: 0.5

      calculation_task:
        - type: llm_response_timeout
          config:
            delay_ms: 5000
Terminal
# Run custom evaluation
khaos run <agent-name> --eval my-resilience-test.yaml

Fault Injection Architecture

Understanding how Khaos injects faults helps you debug and extend the system:

LayerMechanismFault Types
LLM ShimPatches OpenAI/Anthropic SDKsllm_rate_limit, llm_timeout, llm_quota, model_fallback
HTTP ShimPatches requests/httpxhttp_latency, http_error, timeout, malformed_payload
MCP ProxyIntercepts MCP protocolmcp_tool_latency, mcp_tool_failure, mcp_corruption
Tool InterceptorWraps tool executiontool_call_failure, tool_corruption, tool_latency

Faults are configured via environment variables that the shims read at runtime:

  • KHAOS_LLM_FAULTS — JSON array of LLM fault configurations
  • KHAOS_HTTP_FAULTS — JSON array of HTTP fault configurations
  • KHAOS_MCP_FAULTS — JSON array of MCP fault configurations
  • KHAOS_FAULT_SEED — Random seed for reproducible injection

Best Practices

  • Start with built-in packs — Use quickstart or full-eval before writing custom tests
  • Test recovery, not just failure — A good agent recovers gracefully; check for retry logic and fallbacks
  • Use deterministic scheduling for CI — Reproducible tests catch regressions reliably
  • Compare baseline vs resilience — The gap between scores reveals error handling quality
  • Don't over-inject — 10-30% probability is realistic; 100% masks normal behavior