Architecture Overview

Khaos is a modular evaluation engine for AI agents. It discovers agents, loads evaluation packs, injects faults and security attacks, scores outcomes, and produces structured reports. This page describes the high-level architecture, module map, data flow, and extension points.

High-Level Architecture

TEXT

┌─────────────────────────────────────────────────────────────────────┐
│                          khaos.cli                                  │
│            (discover · run · test · compare · ci · cloud)           │
└─────────────┬───────────────────────────────────────┬───────────────┘
              │                                       │
              ▼                                       ▼
┌─────────────────────────┐             ┌─────────────────────────────┐
│      khaos.engine       │             │        khaos.cloud          │
│  ┌───────────────────┐  │             │   (sync · auth · API)       │
│  │  SeededScheduler   │  │             └─────────────────────────────┘
│  │  FaultRegistry     │  │
│  │  AttackRegistry    │  │
│  └───────────────────┘  │
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        khaos.packs                                  │
│   PackRunner  ─▶  Phase (BASELINE · RESILIENCE · SECURITY)         │
│   GoalCriteria · GoalEvaluator                                     │
└──────────┬──────────────────────────┬───────────────────────────────┘
           │                          │
           ▼                          ▼
┌──────────────────────┐   ┌──────────────────────────────────────────┐
│   khaos.transport    │   │             khaos.chaos                  │
│  AgentTransport      │   │   FaultHandler · ScenarioEngine          │
│  InProcessTransport  │   │   (http_latency · timeout · llm_*       │
│  TransportMessage    │   │    tool_* · mcp_* · rag_* · context_*)  │
└──────────────────────┘   └──────────────────────────────────────────┘
           │                          │
           ▼                          ▼
┌──────────────────────┐   ┌──────────────────────────────────────────┐
│   khaos.evaluator    │   │            khaos.security                │
│  Scorer · Grader     │   │   AttackBundle · classifier · Outcome    │
│  MetricAggregator    │   │   (prompt_injection · jailbreak ·        │
└──────────────────────┘   │    exfiltration · indirect_injection ·   │
           │               │    tool_abuse · env_poisoning · ...)     │
           ▼               └──────────────────────────────────────────┘
┌──────────────────────┐
│   khaos.artifacts    │   ┌──────────────────────────────────────────┐
│  RunManifest · JSON  │   │         Supporting Modules               │
│  Evidence export     │   │  khaos.pii        (PII detection)       │
└──────────────────────┘   │  khaos.capabilities (agent profiling)   │
                           │  khaos.sandbox     (isolation)           │
                           │  khaos.llm         (LLM telemetry)      │
                           └──────────────────────────────────────────┘

Module Map

Module	Responsibility	Key Types
`khaos.testing`	Python-native test framework (`@khaostest` decorator, assertions, classification)	`khaostest`, `AgentTestClient`, `AgentResponse`, `Outcome`
`khaos.packs`	Evaluation pack loading, phase iteration, goal criteria	`Pack`, `Phase`, `PhaseType`, `GoalCriteria`, `PackRunner`
`khaos.chaos`	Fault injection scenarios and fault handlers	`FaultHandler`, `Fault`, `ScenarioEngine`
`khaos.evaluator`	Scoring, grading, and metric aggregation	`Scorer`, `Grader`, `MetricAggregator`, `GoalEvaluator`
`khaos.engine`	Orchestration: seeded scheduler, fault registry, attack registry	`SeededScheduler`, `FaultRegistry`, `AttackRegistry`
`khaos.transport`	Agent invocation protocol and message passing	`AgentTransport`, `InProcessTransport`, `TransportMessage`
`khaos.security`	Attack bundles, classification, tri-state outcomes	`AttackBundle`, `AttackMetadata`, `Outcome`, `ClassificationResult`
`khaos.pii`	PII detection and redaction in agent responses	`PIIDetector`, `redact_text`, `redact_response`
`khaos.capabilities`	Agent capability profiling for targeted test selection	`CapabilityProfile`, `infer_capabilities`
`khaos.sandbox`	Isolation for agent tool calls and code execution	`Sandbox`, `SandboxConfig`
`khaos.artifacts`	Run manifests, evidence export, result serialization	`RunManifest`, `EvidenceExporter`
`khaos.cli`	Command-line interface (`khaos run`, `khaos test`, `khaos ci`, etc.)	Click commands
`khaos.cloud`	Cloud sync, authentication, project management	`CloudClient`, `authenticate`
`khaos.llm`	LLM call interception, token counting, telemetry	`LLMEvent`, `LLMInterceptor`

Data Flow

A Khaos evaluation follows a linear pipeline from agent discovery through to the final report. Each stage feeds into the next, with the engine orchestrating the overall flow.

TEXT

Discovery ──▶ Pack Selection ──▶ Phase Execution ──▶ Fault Injection
                                                          │
                                                          ▼
Reporting ◀── Scoring ◀── Security Testing ◀── Agent Invocation

1. Discovery

khaos discover scans the project for @khaosagent-decorated functions and registers them in .khaos/agents.json. Each agent gets a name, version, framework hint, and entry point reference.

2. Pack Selection

The engine resolves the evaluation pack (e.g. quickstart, full-eval, or a custom YAML pack). The PackRunner loads the pack definition, resolves inputs, and prepares the phase sequence.

3. Phase Execution

Packs define one or more phases, each with a distinct purpose:

Phase	Purpose	What Happens
`BASELINE`	Establish normal agent behaviour	Run agent with clean inputs, no faults, no attacks. Measure latency, token usage, task completion.
`RESILIENCE`	Test fault tolerance	Re-run the same inputs with faults injected (network errors, timeouts, corrupted tool responses).
`SECURITY`	Test attack resistance	Run adversarial inputs and inject attack payloads. Classify outcomes as BLOCKED/COMPROMISED/UNCERTAIN.

4. Fault Injection

During the RESILIENCE phase, the FaultRegistry maps fault names to FaultHandlerimplementations. The SeededScheduler determines fault timing and ordering using the configured seed for reproducibility.

5. Security Testing

During the SECURITY phase, the AttackRegistry selects attack bundles based on the agent's CapabilityProfile. Attacks are injected via HTTP interception (tool responses, RAG documents) and direct adversarial prompts.

6. Scoring

The GoalEvaluator checks each result against the pack's GoalCriteria. The Scorer computes per-phase scores, and the MetricAggregator produces the final overall, security, and resilience scores (0-100).

7. Reporting

Results are written to a RunManifest in .khaos/runs/, including all scores, security findings, reproducibility metadata, and timing data. The manifest is used for comparisons, cloud sync, and CI output formats (JUnit XML, JSON, Markdown).

Evaluation Pipeline

The evaluation pipeline is the core execution path. Here is how the key components interact:

Python

# Simplified pseudocode of the evaluation pipeline

# 1. PackRunner loads the pack definition
runner = PackRunner(pack="quickstart", agent="my-agent")

# 2. Iterate phases
for phase in runner.phases:  # BASELINE, RESILIENCE, SECURITY
    scheduler = SeededScheduler(seed=config.seed, phase=phase)

    for case in phase.cases:
        # 3. Build transport message
        msg = TransportMessage(
            name="invoke",
            payload={"text": case.input},
            metadata={"run_id": run_id, "phase": phase.type},
        )

        # 4. Inject faults (RESILIENCE phase)
        if phase.type == PhaseType.RESILIENCE:
            faults = scheduler.next_faults()
            for fault in faults:
                FaultRegistry.get(fault.type).apply(msg)

        # 5. Inject attacks (SECURITY phase)
        if phase.type == PhaseType.SECURITY:
            attacks = AttackRegistry.select(
                profile=agent.capabilities,
                tier=case.tier,
            )
            for attack in attacks:
                msg = attack.inject(msg)

        # 6. Invoke agent via transport
        response = transport.send(msg)

        # 7. Evaluate goals
        for goal in phase.goals:
            GoalEvaluator.evaluate(response, goal)

# 8. Aggregate scores and write manifest
manifest = scorer.finalize()
manifest.write(".khaos/runs/")

Transport Layer

The transport layer decouples agent invocation from the evaluation engine. All communication flows through the AgentTransport protocol, which defines a single method: send(message: TransportMessage) -> TransportMessage.

AgentTransport Protocol

Python

from typing import Protocol

class AgentTransport(Protocol):
    """Protocol for agent communication."""

    async def send(self, message: TransportMessage) -> TransportMessage:
        """Send a message to the agent and receive a response."""
        ...

    async def close(self) -> None:
        """Clean up transport resources."""
        ...

TransportMessage

The universal message envelope used for all agent communication:

Field	Type	Description
`name`	`str`	Message type identifier (e.g. `"invoke"`, `"health"`, `"shutdown"`)
`payload`	`dict`	Message payload (e.g. `{"text": "...", "tools": [...]}`)
`metadata`	`dict`	Run context: `run_id`, `phase`, `scenario`, `timestamp`

Built-in Transports

Khaos ships with InProcessTransport for local evaluation, which calls the agent handler function directly in the same process. Future transports will support HTTP and MCP-based invocation for remote agents.

Python

from khaos.transport import InProcessTransport, TransportMessage

# InProcessTransport wraps a @khaosagent handler
transport = InProcessTransport(handler=my_agent_handler)

response = await transport.send(TransportMessage(
    name="invoke",
    payload={"text": "Hello, agent!"},
    metadata={"run_id": "run-001"},
))

print(response.payload["text"])

Why a Transport Abstraction?

The transport protocol means the evaluation engine never calls your agent directly. This enables local testing, remote evaluation, sandboxed execution, and future multi-agent topologies without changing the engine. See ADR 3 for the full rationale.

Extension Points

Khaos is designed to be extended at three levels: custom faults, custom evaluators, and custom scenarios.

Custom Faults

Implement a custom FaultPlugin and register it withregister_fault():

Python

from khaos.engine.fault_plugins import FaultPlugin, register_fault

@register_fault("database_timeout")
class DatabaseTimeoutFault(FaultPlugin):
    async def inject(self, config: dict) -> dict:
        delay_s = float(config.get("delay_s", 5))
        await self.sleep(delay_s)
        return {
            "outcome": "database_timeout",
            "delay_s": delay_s,
            "error": "Connection timed out",
        }

Custom Evaluators

Implement the Evaluator protocol for custom scoring logic:

Python

from khaos.evaluator import Evaluator

class FactualAccuracyEvaluator(Evaluator):
    """Score responses for factual accuracy using a reference dataset."""

    name = "factual_accuracy"

    def evaluate(self, response, context) -> float:
        expected = context.get("expected_answer", "")
        actual = response.payload.get("text", "")
        # Simple containment check (replace with your logic)
        return 1.0 if expected.lower() in actual.lower() else 0.0

Custom Scenarios

Define scenarios in YAML or build them programmatically. Custom scenarios can combine built-in faults, custom faults, and custom goals:

scenarios/database-resilience.yaml

identifier: database-resilience
summary: Test agent under database failure conditions
tags: [database, resilience, custom]

faults:
  - type: database_timeout
    config:
      delay_s: 10
  - type: http_error
    config:
      status_code: 503

goals:
  - name: Graceful degradation
    weight: 1.0
    assertions:
      - type: exists
        target: response
      - type: not_contains
        target: response
        value: "Internal Server Error"

Next Steps

Architecture Decision Records — Why the architecture is designed this way
Evaluation Packs — Pack schema, built-in packs, and custom packs
Fault Injection — Full list of faults and configuration
Security Testing — Attack tiers, bundles, and classification
Enriched Testing API — Classification, assertions, paired trials, and more
Python API Reference — Programmatic evaluation and comparison

LLM Observability

Transport Layer