Metrics

Khaos produces run-level summaries with interpretable scores designed for comparison across agents and over time.

Overall Score

The overall score (0-100) is a weighted combination of security and resilience scores:

TEXT
Overall Score: 85/100
├── Security Score: 90/100 (50% weight)
└── Resilience Score: 80/100 (50% weight)

Weights can vary by evaluation pack. The security pack weights security at 100%, while baseline focuses on resilience.

Security Score

Measures resistance to adversarial attacks (0-100). Components:

ComponentDescription
Prompt Injection DefenseResistance to instruction override attempts
Tool ValidationProper handling and validation of tool outputs
Leakage PreventionProtection of system prompts and sensitive data

See Security Testing for attack categories and scoring details.

Resilience Score

Measures behavior stability under fault conditions (0-100). Components:

ComponentDescription
Recovery RatePercentage of fault scenarios the agent recovered from
Goal AchievementDid the agent satisfy scenario goals despite faults?
Response StabilityConsistency of behavior across repeated runs

Baseline Metrics

Collected during no-fault runs to establish normal behavior:

MetricDescription
Task Completion RatePercentage of tasks completed successfully
Average LatencyMean response time across all invocations
Latency P9595th percentile response time
Token UsageTotal prompt + completion tokens consumed
Cost (USD)Estimated API cost based on token pricing

Terminal Output

After each run, Khaos displays a summary:

TEXT
┌─────────────────────────────────────────────────┐
│ Khaos Evaluation Results                        │
├─────────────────────────────────────────────────┤
│ Pack: quickstart v1.0                           │
│ Run ID: khaos-pack-20250101-abc12345            │
│ Seed: 12345                                     │
│ Overall Score: 85/100                           │
├─────────────────────────────────────────────────┤
│ Gate       Score  Threshold  Status             │
│ Security     90       80     PASS               │
│ Resilience   80       70     PASS               │
└─────────────────────────────────────────────────┘

JSON Output

Use --json for machine-readable output:

Terminal
khaos run <agent-name> --json
JSON
{
  "run_id": "khaos-scenario-20250101-abc12345",
  "seed": 12345,
  "config_hash": "a1b2c3d4e5f67890",
  "overall_score": 85,
  "security": {
    "score": 90,
    "attacks_tested": 10,
    "attacks_blocked": 9
  },
  "resilience": {
    "score": 80,
    "recovery_rate": 0.85,
    "latency_p95_ms": 250
  },
  "baseline": {
    "task_completion_rate": 0.95,
    "cost_usd": 0.0234
  }
}

Comparing Runs

Use khaos compare to track changes over time:

Terminal
# Compare two runs by ID
khaos compare khaos-pack-20250101-abc12345 khaos-pack-20250102-def67890

# Compare against stored baseline
khaos compare khaos-pack-20250101-abc12345 --baseline

The comparison shows deltas for all metrics and flags regressions.

Provenance Validation

Khaos validates that compared runs are compatible:

  • config_hash: Ensures both runs used the same pack configuration
  • seed: Recorded for reproducibility (use same seed for deterministic comparison)
  • khaos_version: Recorded for debugging compatibility issues
Deterministic comparisons
For meaningful comparisons, use the same --seed and evaluation pack. The config_hash ensures pack configurations match.