Architecture Overview
Khaos is a modular evaluation engine for AI agents. It discovers agents, loads evaluation packs, injects faults and security attacks, scores outcomes, and produces structured reports. This page describes the high-level architecture, module map, data flow, and extension points.
High-Level Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ khaos.cli │
│ (discover · run · test · compare · ci · cloud) │
└─────────────┬───────────────────────────────────────┬───────────────┘
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────────┐
│ khaos.engine │ │ khaos.cloud │
│ ┌───────────────────┐ │ │ (sync · auth · API) │
│ │ SeededScheduler │ │ └─────────────────────────────┘
│ │ FaultRegistry │ │
│ │ AttackRegistry │ │
│ └───────────────────┘ │
└──────────┬──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ khaos.packs │
│ PackRunner ─▶ Phase (BASELINE · RESILIENCE · SECURITY) │
│ GoalCriteria · GoalEvaluator │
└──────────┬──────────────────────────┬───────────────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────────────┐
│ khaos.transport │ │ khaos.chaos │
│ AgentTransport │ │ FaultHandler · ScenarioEngine │
│ InProcessTransport │ │ (http_latency · timeout · llm_* │
│ TransportMessage │ │ tool_* · mcp_* · rag_* · context_*) │
└──────────────────────┘ └──────────────────────────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────────────┐
│ khaos.evaluator │ │ khaos.security │
│ Scorer · Grader │ │ AttackBundle · classifier · Outcome │
│ MetricAggregator │ │ (prompt_injection · jailbreak · │
└──────────────────────┘ │ exfiltration · indirect_injection · │
│ │ tool_abuse · env_poisoning · ...) │
▼ └──────────────────────────────────────────┘
┌──────────────────────┐
│ khaos.artifacts │ ┌──────────────────────────────────────────┐
│ RunManifest · JSON │ │ Supporting Modules │
│ Evidence export │ │ khaos.pii (PII detection) │
└──────────────────────┘ │ khaos.capabilities (agent profiling) │
│ khaos.sandbox (isolation) │
│ khaos.llm (LLM telemetry) │
└──────────────────────────────────────────┘Module Map
| Module | Responsibility | Key Types |
|---|---|---|
khaos.testing | Python-native test framework (@khaostest decorator, assertions, classification) | khaostest, AgentTestClient, AgentResponse, Outcome |
khaos.packs | Evaluation pack loading, phase iteration, goal criteria | Pack, Phase, PhaseType, GoalCriteria, PackRunner |
khaos.chaos | Fault injection scenarios and fault handlers | FaultHandler, Fault, ScenarioEngine |
khaos.evaluator | Scoring, grading, and metric aggregation | Scorer, Grader, MetricAggregator, GoalEvaluator |
khaos.engine | Orchestration: seeded scheduler, fault registry, attack registry | SeededScheduler, FaultRegistry, AttackRegistry |
khaos.transport | Agent invocation protocol and message passing | AgentTransport, InProcessTransport, TransportMessage |
khaos.security | Attack bundles, classification, tri-state outcomes | AttackBundle, AttackMetadata, Outcome, ClassificationResult |
khaos.pii | PII detection and redaction in agent responses | PIIDetector, redact_text, redact_response |
khaos.capabilities | Agent capability profiling for targeted test selection | CapabilityProfile, infer_capabilities |
khaos.sandbox | Isolation for agent tool calls and code execution | Sandbox, SandboxConfig |
khaos.artifacts | Run manifests, evidence export, result serialization | RunManifest, EvidenceExporter |
khaos.cli | Command-line interface (khaos run, khaos test, khaos ci, etc.) | Click commands |
khaos.cloud | Cloud sync, authentication, project management | CloudClient, authenticate |
khaos.llm | LLM call interception, token counting, telemetry | LLMEvent, LLMInterceptor |
Data Flow
A Khaos evaluation follows a linear pipeline from agent discovery through to the final report. Each stage feeds into the next, with the engine orchestrating the overall flow.
Discovery ──▶ Pack Selection ──▶ Phase Execution ──▶ Fault Injection
│
▼
Reporting ◀── Scoring ◀── Security Testing ◀── Agent Invocation1. Discovery
khaos discover scans the project for @khaosagent-decorated functions and registers them in .khaos/agents.json. Each agent gets a name, version, framework hint, and entry point reference.
2. Pack Selection
The engine resolves the evaluation pack (e.g. quickstart, full-eval, or a custom YAML pack). The PackRunner loads the pack definition, resolves inputs, and prepares the phase sequence.
3. Phase Execution
Packs define one or more phases, each with a distinct purpose:
| Phase | Purpose | What Happens |
|---|---|---|
BASELINE | Establish normal agent behaviour | Run agent with clean inputs, no faults, no attacks. Measure latency, token usage, task completion. |
RESILIENCE | Test fault tolerance | Re-run the same inputs with faults injected (network errors, timeouts, corrupted tool responses). |
SECURITY | Test attack resistance | Run adversarial inputs and inject attack payloads. Classify outcomes as BLOCKED/COMPROMISED/UNCERTAIN. |
4. Fault Injection
During the RESILIENCE phase, the FaultRegistry maps fault names to FaultHandlerimplementations. The SeededScheduler determines fault timing and ordering using the configured seed for reproducibility.
5. Security Testing
During the SECURITY phase, the AttackRegistry selects attack bundles based on the agent's CapabilityProfile. Attacks are injected via HTTP interception (tool responses, RAG documents) and direct adversarial prompts.
6. Scoring
The GoalEvaluator checks each result against the pack's GoalCriteria. The Scorer computes per-phase scores, and the MetricAggregator produces the final overall, security, and resilience scores (0-100).
7. Reporting
Results are written to a RunManifest in .khaos/runs/, including all scores, security findings, reproducibility metadata, and timing data. The manifest is used for comparisons, cloud sync, and CI output formats (JUnit XML, JSON, Markdown).
Evaluation Pipeline
The evaluation pipeline is the core execution path. Here is how the key components interact:
# Simplified pseudocode of the evaluation pipeline
# 1. PackRunner loads the pack definition
runner = PackRunner(pack="quickstart", agent="my-agent")
# 2. Iterate phases
for phase in runner.phases: # BASELINE, RESILIENCE, SECURITY
scheduler = SeededScheduler(seed=config.seed, phase=phase)
for case in phase.cases:
# 3. Build transport message
msg = TransportMessage(
name="invoke",
payload={"text": case.input},
metadata={"run_id": run_id, "phase": phase.type},
)
# 4. Inject faults (RESILIENCE phase)
if phase.type == PhaseType.RESILIENCE:
faults = scheduler.next_faults()
for fault in faults:
FaultRegistry.get(fault.type).apply(msg)
# 5. Inject attacks (SECURITY phase)
if phase.type == PhaseType.SECURITY:
attacks = AttackRegistry.select(
profile=agent.capabilities,
tier=case.tier,
)
for attack in attacks:
msg = attack.inject(msg)
# 6. Invoke agent via transport
response = transport.send(msg)
# 7. Evaluate goals
for goal in phase.goals:
GoalEvaluator.evaluate(response, goal)
# 8. Aggregate scores and write manifest
manifest = scorer.finalize()
manifest.write(".khaos/runs/")Transport Layer
The transport layer decouples agent invocation from the evaluation engine. All communication flows through the AgentTransport protocol, which defines a single method: send(message: TransportMessage) -> TransportMessage.
AgentTransport Protocol
from typing import Protocol
class AgentTransport(Protocol):
"""Protocol for agent communication."""
async def send(self, message: TransportMessage) -> TransportMessage:
"""Send a message to the agent and receive a response."""
...
async def close(self) -> None:
"""Clean up transport resources."""
...TransportMessage
The universal message envelope used for all agent communication:
| Field | Type | Description |
|---|---|---|
name | str | Message type identifier (e.g. "invoke", "health", "shutdown") |
payload | dict | Message payload (e.g. {"text": "...", "tools": [...]}) |
metadata | dict | Run context: run_id, phase, scenario, timestamp |
Built-in Transports
Khaos ships with InProcessTransport for local evaluation, which calls the agent handler function directly in the same process. Future transports will support HTTP and MCP-based invocation for remote agents.
from khaos.transport import InProcessTransport, TransportMessage
# InProcessTransport wraps a @khaosagent handler
transport = InProcessTransport(handler=my_agent_handler)
response = await transport.send(TransportMessage(
name="invoke",
payload={"text": "Hello, agent!"},
metadata={"run_id": "run-001"},
))
print(response.payload["text"])Extension Points
Khaos is designed to be extended at three levels: custom faults, custom evaluators, and custom scenarios.
Custom Faults
Implement a custom FaultPlugin and register it withregister_fault():
from khaos.engine.fault_plugins import FaultPlugin, register_fault
@register_fault("database_timeout")
class DatabaseTimeoutFault(FaultPlugin):
async def inject(self, config: dict) -> dict:
delay_s = float(config.get("delay_s", 5))
await self.sleep(delay_s)
return {
"outcome": "database_timeout",
"delay_s": delay_s,
"error": "Connection timed out",
}Custom Evaluators
Implement the Evaluator protocol for custom scoring logic:
from khaos.evaluator import Evaluator
class FactualAccuracyEvaluator(Evaluator):
"""Score responses for factual accuracy using a reference dataset."""
name = "factual_accuracy"
def evaluate(self, response, context) -> float:
expected = context.get("expected_answer", "")
actual = response.payload.get("text", "")
# Simple containment check (replace with your logic)
return 1.0 if expected.lower() in actual.lower() else 0.0Custom Scenarios
Define scenarios in YAML or build them programmatically. Custom scenarios can combine built-in faults, custom faults, and custom goals:
identifier: database-resilience
summary: Test agent under database failure conditions
tags: [database, resilience, custom]
faults:
- type: database_timeout
config:
delay_s: 10
- type: http_error
config:
status_code: 503
goals:
- name: Graceful degradation
weight: 1.0
assertions:
- type: exists
target: response
- type: not_contains
target: response
value: "Internal Server Error"Next Steps
- Architecture Decision Records — Why the architecture is designed this way
- Evaluation Packs — Pack schema, built-in packs, and custom packs
- Fault Injection — Full list of faults and configuration
- Security Testing — Attack tiers, bundles, and classification
- Enriched Testing API — Classification, assertions, paired trials, and more
- Python API Reference — Programmatic evaluation and comparison