Evaluations
Evaluations are curated test suites that combine security testing, resilience evaluation, and baseline metrics into a single reproducible run. Choose the evaluation that matches your workflow.
Available Evaluations
khaos discover, then use khaos run <agent-name> in all examples below.| Evaluation | Time | Use Case |
|---|---|---|
baseline | ~1 min | Pure observation, no faults or attacks |
quickstart | ~2 min | Fast dev iteration with basic security |
full-eval | ~10-15 min | Production readiness assessment |
security | ~5-8 min | Deep security testing only |
# Run with a specific evaluation
khaos run <agent-name> --eval quickstart
khaos run <agent-name> --eval full-eval
khaos run <agent-name> --eval security
khaos run <agent-name> --eval baselineRunning a Single Test
You can run a single test from an evaluation using the --test flag. This is useful for debugging or re-running specific failing tests.
# Run only the math-reasoning test from full-eval
khaos run <agent-name> --eval full-eval --test math-reasoning
# List available tests in an eval
khaos evals listbaseline
Pure observation mode. Runs your agent without injecting any faults or security attacks. Use this to establish performance baselines and verify basic functionality.
- Phases: Baseline only
- Security: Disabled
- Faults: None
- Outputs: Latency, token usage, cost, completion rate
khaos run <agent-name> --eval baselinequickstart
The default evaluation for rapid development iteration. Provides balanced coverage of security and resilience in about 2 minutes.
- Phases: Baseline, Resilience, Security
- Security: 10 attack probes
- Faults: Basic latency and error injection
- Outputs: Full scoring (security, resilience, overall)
# Default when no eval specified
khaos run <agent-name>
# Explicit quickstart
khaos run <agent-name> --eval quickstartfull-eval
Comprehensive evaluation for production readiness. Runs the complete test corpus with extensive fault injection and security probing.
- Phases: Baseline, Resilience, Tooling, LLM Turbulence, Security
- Security: 20+ attack probes across all categories
- Faults: Full spectrum (HTTP, LLM, tools, MCP)
- Outputs: Detailed breakdown with fault coverage analysis
khaos run <agent-name> --eval full-evalfull-eval before major releases to catch edge cases that quickstart might miss.security
Focused security testing with the complete attack corpus. Use when you need deep security analysis without resilience testing overhead.
- Phases: Security only
- Security: Full attack corpus (all categories)
- Faults: None
- Outputs: Detailed vulnerability report with remediation hints
khaos run <agent-name> --eval securityEvaluation Phases
Each evaluation runs through one or more testing phases:
| Phase | Description |
|---|---|
| Baseline | Normal operation metrics without interference |
| Resilience | Response stability under network faults |
| Tooling | Tool call handling under failure conditions |
| LLM Turbulence | Behavior during LLM rate limits and timeouts |
| Security | Adversarial attack resistance |
Automatic Applicability (N/A)
Evaluations are designed to work across many agent frameworks. Khaos automatically infers what your agent actually uses (LLM, HTTP/tooling, MCP) and skips tests that don't apply. Skipped tests are treated as N/A and don't affect scoring.
- No MCP usage → MCP fault tests are skipped
- No HTTP/tool usage → HTTP/tool/RAG fault tests are skipped
Canonical Inputs
Each evaluation includes a set of canonical input prompts designed for cross-agent comparison. These inputs cover common use cases:
- Greeting and help requests
- Factual questions
- Multi-step reasoning tasks
- Edge case handling
You can also provide custom inputs with --input/--inputs to override an evaluation's canonical prompts:
khaos run <agent-name> --eval quickstart --inputs my-tests.yamlChoosing the Right Evaluation
| Use Case | Recommended Evaluation |
|---|---|
| Local development iteration | quickstart |
| PR/MR CI checks | quickstart |
| Pre-release validation | full-eval |
| Security audit | security |
| Performance benchmarking | baseline |
| Regression baseline | baseline |