Metrics

Khaos produces run-level summaries with interpretable scores designed for comparison across agents and over time.

Overall Score

The overall score (0-100) is a weighted combination of security and resilience scores:

TEXT

Overall Score: 85/100
├── Security Score: 90/100 (50% weight)
└── Resilience Score: 80/100 (50% weight)

Weights can vary by evaluation pack. The security pack weights security at 100%, while baseline focuses on resilience.

Security Score

Measures resistance to adversarial attacks (0-100). Components:

Component	Description
Prompt Injection Defense	Resistance to instruction override attempts
Tool Validation	Proper handling and validation of tool outputs
Leakage Prevention	Protection of system prompts and sensitive data

See Security Testing for attack categories and scoring details.

Resilience Score

Measures behavior stability under fault conditions (0-100). Components:

Component	Description
Recovery Rate	Percentage of fault scenarios the agent recovered from
Goal Achievement	Did the agent satisfy scenario goals despite faults?
Response Stability	Consistency of behavior across repeated runs

Baseline Metrics

Collected during no-fault runs to establish normal behavior:

Metric	Description
Task Completion Rate	Percentage of tasks completed successfully
Average Latency	Mean response time across all invocations
Latency P95	95th percentile response time
Token Usage	Total prompt + completion tokens consumed
Cost (USD)	Estimated API cost based on token pricing

Terminal Output

After each run, Khaos displays a summary:

TEXT

┌─────────────────────────────────────────────────┐
│ Khaos Evaluation Results                        │
├─────────────────────────────────────────────────┤
│ Pack: quickstart v1.0                           │
│ Run ID: khaos-pack-20250101-abc12345            │
│ Seed: 12345                                     │
│ Overall Score: 85/100                           │
├─────────────────────────────────────────────────┤
│ Gate       Score  Threshold  Status             │
│ Security     90       80     PASS               │
│ Resilience   80       70     PASS               │
└─────────────────────────────────────────────────┘

JSON Output

Use --json for machine-readable output:

Terminal

khaos run <agent-name> --json

JSON

{
  "run_id": "khaos-scenario-20250101-abc12345",
  "seed": 12345,
  "config_hash": "a1b2c3d4e5f67890",
  "overall_score": 85,
  "security": {
    "score": 90,
    "attacks_tested": 10,
    "attacks_blocked": 9
  },
  "resilience": {
    "score": 80,
    "recovery_rate": 0.85,
    "latency_p95_ms": 250
  },
  "baseline": {
    "task_completion_rate": 0.95,
    "cost_usd": 0.0234
  }
}

Comparing Runs

Use khaos compare to track changes over time:

Terminal

# Compare two runs by ID
khaos compare khaos-pack-20250101-abc12345 khaos-pack-20250102-def67890

# Compare against stored baseline
khaos compare khaos-pack-20250101-abc12345 --baseline

The comparison shows deltas for all metrics and flags regressions.

Provenance Validation

Khaos validates that compared runs are compatible:

config_hash: Ensures both runs used the same pack configuration
seed: Recorded for reproducibility (use same seed for deterministic comparison)
khaos_version: Recorded for debugging compatibility issues

Deterministic comparisons

For meaningful comparisons, use the same --seed and evaluation pack. The config_hash ensures pack configurations match.

Session Cost Tracking

Khaos tracks LLM costs per session using a built-in cost model with rates for all major providers. Costs are estimated from token counts and model-specific pricing.

Metric	Description
Prompt Cost	Cost of input tokens (per 1K tokens)
Completion Cost	Cost of output tokens (per 1K tokens)
Cached Token Savings	Cost reduction from prompt caching
Total Session Cost	Sum across all LLM calls in the run

Python

from khaos.costs import load_cost_table, estimate_cost_usd

# Load built-in cost rates
rates = load_cost_table()

# Estimate cost for a specific call
cost_usd, source = estimate_cost_usd(
    provider="anthropic",
    model="claude-sonnet-4-5-20250929",
    prompt_tokens=1500,
    completion_tokens=500,
    table=rates,
)
print(cost_usd, source)

Cost tracking is automatic when using @khaosagent. Override rates with custom pricing in Configuration.

LLM Telemetry

Every LLM call made during a Khaos run is captured with rich telemetry including model, provider, token counts, latency, cost, and parameters.

Field	Description
model / provider	Model name and provider (openai, anthropic, etc.)
tokens_in / tokens_out	Prompt and completion token counts
cached_tokens	Tokens served from prompt cache
cost_usd	Estimated cost for this call
latency_ms	Wall-clock time for the call
temperature / max_tokens	Generation parameters used

LLM events are stored in llm-events-*.jsonl files in the run artifacts directory. See LLM Observability for the full telemetry guide and Artifacts & Runs for artifact storage details.

Fault Injection

Security Testing