Metrics

Khaos produces run-level summaries with interpretable scores designed for comparison across agents and over time.

Overall Score

The overall score (0-100) is a weighted combination of security and resilience scores:

TEXT
Overall Score: 85/100
├── Security Score: 90/100 (50% weight)
└── Resilience Score: 80/100 (50% weight)

Weights can vary by evaluation pack. The security pack weights security at 100%, while baseline focuses on resilience.

Security Score

Measures resistance to adversarial attacks (0-100). Components:

ComponentDescription
Prompt Injection DefenseResistance to instruction override attempts
Tool ValidationProper handling and validation of tool outputs
Leakage PreventionProtection of system prompts and sensitive data

See Security Testing for attack categories and scoring details.

Resilience Score

Measures behavior stability under fault conditions (0-100). Components:

ComponentDescription
Recovery RatePercentage of fault scenarios the agent recovered from
Goal AchievementDid the agent satisfy scenario goals despite faults?
Response StabilityConsistency of behavior across repeated runs

Baseline Metrics

Collected during no-fault runs to establish normal behavior:

MetricDescription
Task Completion RatePercentage of tasks completed successfully
Average LatencyMean response time across all invocations
Latency P9595th percentile response time
Token UsageTotal prompt + completion tokens consumed
Cost (USD)Estimated API cost based on token pricing

Terminal Output

After each run, Khaos displays a summary:

TEXT
┌─────────────────────────────────────────────────┐
│ Khaos Evaluation Results                        │
├─────────────────────────────────────────────────┤
│ Pack: quickstart v1.0                           │
│ Run ID: khaos-pack-20250101-abc12345            │
│ Seed: 12345                                     │
│ Overall Score: 85/100                           │
├─────────────────────────────────────────────────┤
│ Gate       Score  Threshold  Status             │
│ Security     90       80     PASS               │
│ Resilience   80       70     PASS               │
└─────────────────────────────────────────────────┘

JSON Output

Use --json for machine-readable output:

Terminal
khaos run <agent-name> --json
JSON
{
  "run_id": "khaos-scenario-20250101-abc12345",
  "seed": 12345,
  "config_hash": "a1b2c3d4e5f67890",
  "overall_score": 85,
  "security": {
    "score": 90,
    "attacks_tested": 10,
    "attacks_blocked": 9
  },
  "resilience": {
    "score": 80,
    "recovery_rate": 0.85,
    "latency_p95_ms": 250
  },
  "baseline": {
    "task_completion_rate": 0.95,
    "cost_usd": 0.0234
  }
}

Comparing Runs

Use khaos compare to track changes over time:

Terminal
# Compare two runs by ID
khaos compare khaos-pack-20250101-abc12345 khaos-pack-20250102-def67890

# Compare against stored baseline
khaos compare khaos-pack-20250101-abc12345 --baseline

The comparison shows deltas for all metrics and flags regressions.

Provenance Validation

Khaos validates that compared runs are compatible:

  • config_hash: Ensures both runs used the same pack configuration
  • seed: Recorded for reproducibility (use same seed for deterministic comparison)
  • khaos_version: Recorded for debugging compatibility issues
Deterministic comparisons
For meaningful comparisons, use the same --seed and evaluation pack. The config_hash ensures pack configurations match.

Session Cost Tracking

Khaos tracks LLM costs per session using a built-in cost model with rates for all major providers. Costs are estimated from token counts and model-specific pricing.

MetricDescription
Prompt CostCost of input tokens (per 1K tokens)
Completion CostCost of output tokens (per 1K tokens)
Cached Token SavingsCost reduction from prompt caching
Total Session CostSum across all LLM calls in the run
Python
from khaos.costs import load_cost_table, estimate_cost_usd

# Load built-in cost rates
rates = load_cost_table()

# Estimate cost for a specific call
cost_usd, source = estimate_cost_usd(
    provider="anthropic",
    model="claude-sonnet-4-5-20250929",
    prompt_tokens=1500,
    completion_tokens=500,
    table=rates,
)
print(cost_usd, source)

Cost tracking is automatic when using @khaosagent. Override rates with custom pricing in Configuration.

LLM Telemetry

Every LLM call made during a Khaos run is captured with rich telemetry including model, provider, token counts, latency, cost, and parameters.

FieldDescription
model / providerModel name and provider (openai, anthropic, etc.)
tokens_in / tokens_outPrompt and completion token counts
cached_tokensTokens served from prompt cache
cost_usdEstimated cost for this call
latency_msWall-clock time for the call
temperature / max_tokensGeneration parameters used

LLM events are stored in llm-events-*.jsonl files in the run artifacts directory. See LLM Observability for the full telemetry guide and Artifacts & Runs for artifact storage details.