Metrics
Khaos produces run-level summaries with interpretable scores designed for comparison across agents and over time.
Overall Score
The overall score (0-100) is a weighted combination of security and resilience scores:
Overall Score: 85/100
├── Security Score: 90/100 (50% weight)
└── Resilience Score: 80/100 (50% weight)Weights can vary by evaluation pack. The security pack weights security at 100%, while baseline focuses on resilience.
Security Score
Measures resistance to adversarial attacks (0-100). Components:
| Component | Description |
|---|---|
| Prompt Injection Defense | Resistance to instruction override attempts |
| Tool Validation | Proper handling and validation of tool outputs |
| Leakage Prevention | Protection of system prompts and sensitive data |
See Security Testing for attack categories and scoring details.
Resilience Score
Measures behavior stability under fault conditions (0-100). Components:
| Component | Description |
|---|---|
| Recovery Rate | Percentage of fault scenarios the agent recovered from |
| Goal Achievement | Did the agent satisfy scenario goals despite faults? |
| Response Stability | Consistency of behavior across repeated runs |
Baseline Metrics
Collected during no-fault runs to establish normal behavior:
| Metric | Description |
|---|---|
| Task Completion Rate | Percentage of tasks completed successfully |
| Average Latency | Mean response time across all invocations |
| Latency P95 | 95th percentile response time |
| Token Usage | Total prompt + completion tokens consumed |
| Cost (USD) | Estimated API cost based on token pricing |
Terminal Output
After each run, Khaos displays a summary:
┌─────────────────────────────────────────────────┐
│ Khaos Evaluation Results │
├─────────────────────────────────────────────────┤
│ Pack: quickstart v1.0 │
│ Run ID: khaos-pack-20250101-abc12345 │
│ Seed: 12345 │
│ Overall Score: 85/100 │
├─────────────────────────────────────────────────┤
│ Gate Score Threshold Status │
│ Security 90 80 PASS │
│ Resilience 80 70 PASS │
└─────────────────────────────────────────────────┘JSON Output
Use --json for machine-readable output:
khaos run <agent-name> --json{
"run_id": "khaos-scenario-20250101-abc12345",
"seed": 12345,
"config_hash": "a1b2c3d4e5f67890",
"overall_score": 85,
"security": {
"score": 90,
"attacks_tested": 10,
"attacks_blocked": 9
},
"resilience": {
"score": 80,
"recovery_rate": 0.85,
"latency_p95_ms": 250
},
"baseline": {
"task_completion_rate": 0.95,
"cost_usd": 0.0234
}
}Comparing Runs
Use khaos compare to track changes over time:
# Compare two runs by ID
khaos compare khaos-pack-20250101-abc12345 khaos-pack-20250102-def67890
# Compare against stored baseline
khaos compare khaos-pack-20250101-abc12345 --baselineThe comparison shows deltas for all metrics and flags regressions.
Provenance Validation
Khaos validates that compared runs are compatible:
- config_hash: Ensures both runs used the same pack configuration
- seed: Recorded for reproducibility (use same seed for deterministic comparison)
- khaos_version: Recorded for debugging compatibility issues
--seed and evaluation pack. The config_hash ensures pack configurations match.