Fault Injection
Khaos uses chaos engineering principles to test how your AI agents behave under real-world failure conditions. By injecting faults at the SDK level, Khaos creates production-realistic failures that reveal how your agent handles errors, timeouts, and edge cases.
How Fault Injection Works
Khaos injects faults by patching the SDKs your agent uses—OpenAI, Anthropic, httpx, requests, and MCP clients. This means faults are indistinguishable from real production failures:
- Production-realistic exceptions — Exact error codes, headers, and message structures
- SDK-level interception — No mocking; your actual error handling code is tested
- Probabilistic or deterministic — Random injection for stress testing, or scheduled for reproducibility
- Automatic capability detection — Faults only apply to capabilities your agent actually uses
Fault Categories
Khaos ships with 21 built-in fault types across 6 categories. Each fault creates the exact exception your agent would see in production.
LLM Provider Faults
Simulate failures from OpenAI, Anthropic, and other LLM providers. These faults match the exact HTTP status codes, Retry-After headers, and error structures returned by real APIs.
| Fault | Description | Real-World Scenario |
|---|---|---|
llm_rate_limit | 429 rate limit with Retry-After header | API quota exceeded during peak usage |
llm_model_unavailable | 404 model not found error | Model deprecated or access revoked |
llm_response_timeout | APITimeoutError exception | Network issues or overloaded API |
llm_token_quota_exceeded | 429 quota exhaustion error | Monthly token limit reached |
model_fallback_forced | Silent model substitution | Primary model unavailable, tests fallback behavior |
# LLM fault configuration
faults:
- type: llm_rate_limit
config:
retry_after_ms: 30000 # Retry-After header value
probability: 0.2 # 20% of LLM calls affectedHTTP/Network Faults
Test how your agent handles network-level failures affecting any HTTP calls—API requests, webhooks, external services.
| Fault | Description | Real-World Scenario |
|---|---|---|
http_latency | Add configurable delay to requests | Slow network, congested APIs |
http_error | Return HTTP error codes (4xx, 5xx) | Service outages, authorization failures |
timeout | Request timeout before response | Network partition, unresponsive server |
malformed_payload | Invalid JSON, truncated, or garbage responses | Proxy errors, encoding issues |
# HTTP fault configuration
faults:
- type: http_latency
config:
delay_ms: 2000 # 2 second delay
jitter_ms: 500 # ±500ms random jitter
probability: 0.3 # 30% of requests
- type: http_error
config:
status_code: 503 # Service Unavailable
probability: 0.1 # 10% of requestsTool/Function Call Faults
When your agent uses tools (function calling), these faults test error handling at the tool execution layer.
| Fault | Description | Real-World Scenario |
|---|---|---|
tool_call_failure | Tool execution fails with error | External API down, permission denied |
tool_response_corruption | Tool returns malformed data | Schema changes, encoding errors |
tool_latency_spike | Tool response significantly delayed | Slow database, network congestion |
tool_output_injection | Inject attack payload into tool output | Compromised tool, indirect prompt injection |
MCP (Model Context Protocol) Faults
For agents using MCP servers, these faults test the MCP protocol layer specifically.
| Fault | Description | Real-World Scenario |
|---|---|---|
mcp_tool_latency | Add delay to MCP tool calls | Slow MCP server, network issues |
mcp_tool_failure | MCP tool execution errors | Server-side tool failures |
mcp_tool_corruption | Corrupted MCP tool responses | Protocol errors, serialization bugs |
mcp_server_unavailable | MCP server connection refused | Server crash, deployment issues |
RAG/Retrieval Faults
Test how your agent handles failures in retrieval-augmented generation pipelines.
| Fault | Description | Real-World Scenario |
|---|---|---|
rag_retrieval_corruption | Return wrong or irrelevant documents | Index corruption, embedding drift |
rag_document_poisoning | Inject malicious content into retrieved docs | Data poisoning, compromised sources |
Context/Memory Faults
Simulate context window and memory-related failures that affect long-running conversations.
| Fault | Description | Real-World Scenario |
|---|---|---|
context_window_overflow | Context window limit exceeded | Long conversations, large documents |
conversation_history_dropped | Conversation history loss | Session timeout, state corruption |
embedding_service_failure | Embedding API unavailable | Embedding service outage |
Fault Scheduling
Khaos supports two approaches to fault injection, depending on your testing goals:
Probabilistic Injection
Random fault injection for stress testing. Each request has a probability of being affected.
# Random injection across all inputs
faults:
- type: llm_rate_limit
config:
probability: 0.2 # 20% chance per LLM call
- type: http_latency
config:
delay_ms: 500
probability: 0.3 # 30% chance per HTTP requestDeterministic Scheduling
Map specific faults to specific test inputs for reproducible testing. This is how the built-in evaluation packs work—ensuring consistent, comparable results across runs.
# Built-in packs use fault schedules for reproducibility
resilience:
fault_schedule:
# This specific input gets LLM timeout
math_simple:
- type: llm_response_timeout
config:
delay_ms: 100
# This input gets rate limiting
instruction_format_json:
- type: llm_rate_limit
config:
retry_after_ms: 1000
# This input gets HTTP latency
knowledge_capital:
- type: http_latency
config:
delay_ms: 500--seed value with deterministic fault schedules produces identical injection sequences. This enables meaningful before/after comparisons when testing agent changes.Fault Configuration Options
Each fault type accepts specific configuration options:
faults:
# LLM rate limiting with retry header
- type: llm_rate_limit
config:
retry_after_ms: 30000 # Retry-After header value (ms)
probability: 0.2 # Injection probability
# HTTP latency with jitter
- type: http_latency
config:
delay_ms: 1000 # Base delay
jitter_ms: 200 # Random ±200ms
probability: 0.5
# HTTP errors with specific status code
- type: http_error
config:
status_code: 503 # HTTP status code
message: "Service temporarily unavailable"
probability: 0.1
# Tool execution failure
- type: tool_call_failure
config:
error_type: "execution_error"
error_message: "Tool execution failed: connection refused"
# MCP tool targeting specific tools
- type: mcp_tool_latency
config:
tool_name: "search" # Target specific tool (or "*" for all)
latency_ms: 3000
# Malformed payload types
- type: malformed_payload
config:
corruption_type: "invalid_json" # invalid_json, truncated, binary_garbage
probability: 0.15Running Fault Injection Tests
Fault injection is built into the standard evaluation workflow:
# Standard evaluation includes resilience testing
khaos run <agent-name>
# Comprehensive evaluation with extensive fault coverage
khaos run <agent-name> --eval full-eval
# Quick evaluation with focused fault testing
khaos run <agent-name> --eval quickstart
# Baseline only (no faults) for comparison
khaos run <agent-name> --eval baselineUnderstanding Resilience Results
After running an evaluation, you'll see resilience metrics that show how your agent performed under fault conditions:
Resilience Score: 85/100 (B)
Fault Coverage:
LLM Faults: 3/3 tested (llm_rate_limit, llm_timeout, llm_quota)
HTTP Faults: 2/2 tested (http_latency, http_error)
Tool Faults: 2/2 tested (tool_failure, tool_corruption)
MCP Faults: N/A (agent doesn't use MCP)
Recovery Analysis:
├─ llm_rate_limit: Recovered (retry successful)
├─ llm_response_timeout: Recovered (fallback response)
├─ http_error: Degraded (partial functionality)
└─ tool_call_failure: Failed (no error handling)
Recommendations:
→ Add retry logic for tool_call_failure scenarios
→ Consider fallback when HTTP services are unavailableCustom Fault Definitions
Create custom evaluation configurations with specific fault schedules:
# my-resilience-test.yaml
inputs:
- id: api_integration
text: "Fetch the current weather for San Francisco"
goal:
contains_any: ["weather", "temperature", "forecast"]
- id: calculation_task
text: "Calculate the compound interest on $1000 at 5% for 3 years"
goal:
contains: "1157"
phases:
baseline:
runs: 1
faults: []
resilience:
runs: 1
fault_schedule:
api_integration:
- type: http_latency
config:
delay_ms: 3000
- type: http_error
config:
status_code: 503
probability: 0.5
calculation_task:
- type: llm_response_timeout
config:
delay_ms: 5000# Run custom evaluation
khaos run <agent-name> --eval my-resilience-test.yamlFault Injection Architecture
Understanding how Khaos injects faults helps you debug and extend the system:
| Layer | Mechanism | Fault Types |
|---|---|---|
| LLM Shim | Patches OpenAI/Anthropic SDKs | llm_rate_limit, llm_timeout, llm_quota, model_fallback |
| HTTP Shim | Patches requests/httpx | http_latency, http_error, timeout, malformed_payload |
| MCP Proxy | Intercepts MCP protocol | mcp_tool_latency, mcp_tool_failure, mcp_corruption |
| Tool Interceptor | Wraps tool execution | tool_call_failure, tool_corruption, tool_latency |
Faults are configured via environment variables that the shims read at runtime:
KHAOS_LLM_FAULTS— JSON array of LLM fault configurationsKHAOS_HTTP_FAULTS— JSON array of HTTP fault configurationsKHAOS_MCP_FAULTS— JSON array of MCP fault configurationsKHAOS_FAULT_SEED— Random seed for reproducible injection
Best Practices
- Start with built-in packs — Use
quickstartorfull-evalbefore writing custom tests - Test recovery, not just failure — A good agent recovers gracefully; check for retry logic and fallbacks
- Use deterministic scheduling for CI — Reproducible tests catch regressions reliably
- Compare baseline vs resilience — The gap between scores reveals error handling quality
- Don't over-inject — 10-30% probability is realistic; 100% masks normal behavior