Architecture Overview

Khaos is a modular evaluation engine for AI agents. It discovers agents, loads evaluation packs, injects faults and security attacks, scores outcomes, and produces structured reports. This page describes the high-level architecture, module map, data flow, and extension points.

High-Level Architecture

TEXT
┌─────────────────────────────────────────────────────────────────────┐
│                          khaos.cli                                  │
│            (discover · run · test · compare · ci · cloud)           │
└─────────────┬───────────────────────────────────────┬───────────────┘
              │                                       │
              ▼                                       ▼
┌─────────────────────────┐             ┌─────────────────────────────┐
│      khaos.engine       │             │        khaos.cloud          │
│  ┌───────────────────┐  │             │   (sync · auth · API)       │
│  │  SeededScheduler   │  │             └─────────────────────────────┘
│  │  FaultRegistry     │  │
│  │  AttackRegistry    │  │
│  └───────────────────┘  │
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        khaos.packs                                  │
│   PackRunner  ─▶  Phase (BASELINE · RESILIENCE · SECURITY)         │
│   GoalCriteria · GoalEvaluator                                     │
└──────────┬──────────────────────────┬───────────────────────────────┘
           │                          │
           ▼                          ▼
┌──────────────────────┐   ┌──────────────────────────────────────────┐
│   khaos.transport    │   │             khaos.chaos                  │
│  AgentTransport      │   │   FaultHandler · ScenarioEngine          │
│  InProcessTransport  │   │   (http_latency · timeout · llm_*       │
│  TransportMessage    │   │    tool_* · mcp_* · rag_* · context_*)  │
└──────────────────────┘   └──────────────────────────────────────────┘
           │                          │
           ▼                          ▼
┌──────────────────────┐   ┌──────────────────────────────────────────┐
│   khaos.evaluator    │   │            khaos.security                │
│  Scorer · Grader     │   │   AttackBundle · classifier · Outcome    │
│  MetricAggregator    │   │   (prompt_injection · jailbreak ·        │
└──────────────────────┘   │    exfiltration · indirect_injection ·   │
           │               │    tool_abuse · env_poisoning · ...)     │
           ▼               └──────────────────────────────────────────┘
┌──────────────────────┐
│   khaos.artifacts    │   ┌──────────────────────────────────────────┐
│  RunManifest · JSON  │   │         Supporting Modules               │
│  Evidence export     │   │  khaos.pii        (PII detection)       │
└──────────────────────┘   │  khaos.capabilities (agent profiling)   │
                           │  khaos.sandbox     (isolation)           │
                           │  khaos.llm         (LLM telemetry)      │
                           └──────────────────────────────────────────┘

Module Map

ModuleResponsibilityKey Types
khaos.testingPython-native test framework (@khaostest decorator, assertions, classification)khaostest, AgentTestClient, AgentResponse, Outcome
khaos.packsEvaluation pack loading, phase iteration, goal criteriaPack, Phase, PhaseType, GoalCriteria, PackRunner
khaos.chaosFault injection scenarios and fault handlersFaultHandler, Fault, ScenarioEngine
khaos.evaluatorScoring, grading, and metric aggregationScorer, Grader, MetricAggregator, GoalEvaluator
khaos.engineOrchestration: seeded scheduler, fault registry, attack registrySeededScheduler, FaultRegistry, AttackRegistry
khaos.transportAgent invocation protocol and message passingAgentTransport, InProcessTransport, TransportMessage
khaos.securityAttack bundles, classification, tri-state outcomesAttackBundle, AttackMetadata, Outcome, ClassificationResult
khaos.piiPII detection and redaction in agent responsesPIIDetector, redact_text, redact_response
khaos.capabilitiesAgent capability profiling for targeted test selectionCapabilityProfile, infer_capabilities
khaos.sandboxIsolation for agent tool calls and code executionSandbox, SandboxConfig
khaos.artifactsRun manifests, evidence export, result serializationRunManifest, EvidenceExporter
khaos.cliCommand-line interface (khaos run, khaos test, khaos ci, etc.)Click commands
khaos.cloudCloud sync, authentication, project managementCloudClient, authenticate
khaos.llmLLM call interception, token counting, telemetryLLMEvent, LLMInterceptor

Data Flow

A Khaos evaluation follows a linear pipeline from agent discovery through to the final report. Each stage feeds into the next, with the engine orchestrating the overall flow.

TEXT
Discovery ──▶ Pack Selection ──▶ Phase Execution ──▶ Fault Injection
                                                          │
                                                          ▼
Reporting ◀── Scoring ◀── Security Testing ◀── Agent Invocation

1. Discovery

khaos discover scans the project for @khaosagent-decorated functions and registers them in .khaos/agents.json. Each agent gets a name, version, framework hint, and entry point reference.

2. Pack Selection

The engine resolves the evaluation pack (e.g. quickstart, full-eval, or a custom YAML pack). The PackRunner loads the pack definition, resolves inputs, and prepares the phase sequence.

3. Phase Execution

Packs define one or more phases, each with a distinct purpose:

PhasePurposeWhat Happens
BASELINEEstablish normal agent behaviourRun agent with clean inputs, no faults, no attacks. Measure latency, token usage, task completion.
RESILIENCETest fault toleranceRe-run the same inputs with faults injected (network errors, timeouts, corrupted tool responses).
SECURITYTest attack resistanceRun adversarial inputs and inject attack payloads. Classify outcomes as BLOCKED/COMPROMISED/UNCERTAIN.

4. Fault Injection

During the RESILIENCE phase, the FaultRegistry maps fault names to FaultHandlerimplementations. The SeededScheduler determines fault timing and ordering using the configured seed for reproducibility.

5. Security Testing

During the SECURITY phase, the AttackRegistry selects attack bundles based on the agent's CapabilityProfile. Attacks are injected via HTTP interception (tool responses, RAG documents) and direct adversarial prompts.

6. Scoring

The GoalEvaluator checks each result against the pack's GoalCriteria. The Scorer computes per-phase scores, and the MetricAggregator produces the final overall, security, and resilience scores (0-100).

7. Reporting

Results are written to a RunManifest in .khaos/runs/, including all scores, security findings, reproducibility metadata, and timing data. The manifest is used for comparisons, cloud sync, and CI output formats (JUnit XML, JSON, Markdown).

Evaluation Pipeline

The evaluation pipeline is the core execution path. Here is how the key components interact:

Python
# Simplified pseudocode of the evaluation pipeline

# 1. PackRunner loads the pack definition
runner = PackRunner(pack="quickstart", agent="my-agent")

# 2. Iterate phases
for phase in runner.phases:  # BASELINE, RESILIENCE, SECURITY
    scheduler = SeededScheduler(seed=config.seed, phase=phase)

    for case in phase.cases:
        # 3. Build transport message
        msg = TransportMessage(
            name="invoke",
            payload={"text": case.input},
            metadata={"run_id": run_id, "phase": phase.type},
        )

        # 4. Inject faults (RESILIENCE phase)
        if phase.type == PhaseType.RESILIENCE:
            faults = scheduler.next_faults()
            for fault in faults:
                FaultRegistry.get(fault.type).apply(msg)

        # 5. Inject attacks (SECURITY phase)
        if phase.type == PhaseType.SECURITY:
            attacks = AttackRegistry.select(
                profile=agent.capabilities,
                tier=case.tier,
            )
            for attack in attacks:
                msg = attack.inject(msg)

        # 6. Invoke agent via transport
        response = transport.send(msg)

        # 7. Evaluate goals
        for goal in phase.goals:
            GoalEvaluator.evaluate(response, goal)

# 8. Aggregate scores and write manifest
manifest = scorer.finalize()
manifest.write(".khaos/runs/")

Transport Layer

The transport layer decouples agent invocation from the evaluation engine. All communication flows through the AgentTransport protocol, which defines a single method: send(message: TransportMessage) -> TransportMessage.

AgentTransport Protocol

Python
from typing import Protocol

class AgentTransport(Protocol):
    """Protocol for agent communication."""

    async def send(self, message: TransportMessage) -> TransportMessage:
        """Send a message to the agent and receive a response."""
        ...

    async def close(self) -> None:
        """Clean up transport resources."""
        ...

TransportMessage

The universal message envelope used for all agent communication:

FieldTypeDescription
namestrMessage type identifier (e.g. "invoke", "health", "shutdown")
payloaddictMessage payload (e.g. {"text": "...", "tools": [...]})
metadatadictRun context: run_id, phase, scenario, timestamp

Built-in Transports

Khaos ships with InProcessTransport for local evaluation, which calls the agent handler function directly in the same process. Future transports will support HTTP and MCP-based invocation for remote agents.

Python
from khaos.transport import InProcessTransport, TransportMessage

# InProcessTransport wraps a @khaosagent handler
transport = InProcessTransport(handler=my_agent_handler)

response = await transport.send(TransportMessage(
    name="invoke",
    payload={"text": "Hello, agent!"},
    metadata={"run_id": "run-001"},
))

print(response.payload["text"])
Why a Transport Abstraction?
The transport protocol means the evaluation engine never calls your agent directly. This enables local testing, remote evaluation, sandboxed execution, and future multi-agent topologies without changing the engine. See ADR 3 for the full rationale.

Extension Points

Khaos is designed to be extended at three levels: custom faults, custom evaluators, and custom scenarios.

Custom Faults

Implement a custom FaultPlugin and register it withregister_fault():

Python
from khaos.engine.fault_plugins import FaultPlugin, register_fault

@register_fault("database_timeout")
class DatabaseTimeoutFault(FaultPlugin):
    async def inject(self, config: dict) -> dict:
        delay_s = float(config.get("delay_s", 5))
        await self.sleep(delay_s)
        return {
            "outcome": "database_timeout",
            "delay_s": delay_s,
            "error": "Connection timed out",
        }

Custom Evaluators

Implement the Evaluator protocol for custom scoring logic:

Python
from khaos.evaluator import Evaluator

class FactualAccuracyEvaluator(Evaluator):
    """Score responses for factual accuracy using a reference dataset."""

    name = "factual_accuracy"

    def evaluate(self, response, context) -> float:
        expected = context.get("expected_answer", "")
        actual = response.payload.get("text", "")
        # Simple containment check (replace with your logic)
        return 1.0 if expected.lower() in actual.lower() else 0.0

Custom Scenarios

Define scenarios in YAML or build them programmatically. Custom scenarios can combine built-in faults, custom faults, and custom goals:

scenarios/database-resilience.yaml
identifier: database-resilience
summary: Test agent under database failure conditions
tags: [database, resilience, custom]

faults:
  - type: database_timeout
    config:
      delay_s: 10
  - type: http_error
    config:
      status_code: 503

goals:
  - name: Graceful degradation
    weight: 1.0
    assertions:
      - type: exists
        target: response
      - type: not_contains
        target: response
        value: "Internal Server Error"

Next Steps