Evaluations

Evaluations are curated test suites that combine security testing, resilience evaluation, and baseline metrics into a single reproducible run. Choose the evaluation that matches your workflow.

Available Evaluations

Run by agent name

First run khaos discover, then use khaos run <agent-name> in all examples below.

Evaluation	Time	Use Case
`baseline`	~1 min	Pure observation, no faults or attacks
`quickstart`	~2 min	Fast dev iteration with basic security
`full-eval`	~10-15 min	Production readiness assessment
`security`	~5-8 min	Deep security testing only
`security-agentic`	~8-12 min	Security for file/shell-based agents

Terminal

# Run with a specific evaluation
khaos run <agent-name> --eval quickstart
khaos run <agent-name> --eval full-eval
khaos run <agent-name> --eval security
khaos run <agent-name> --eval security-agentic
khaos run <agent-name> --eval baseline

Running a Single Test

You can run a single test from an evaluation using the --test flag. This is useful for debugging or re-running specific failing tests.

Terminal

# Run only the math-reasoning test from full-eval
khaos run <agent-name> --eval full-eval --test math-reasoning

# List available tests in an eval
khaos evals list

baseline

Pure observation mode. Runs your agent without injecting any faults or security attacks. Use this to establish performance baselines and verify basic functionality.

Phases: Baseline only
Security: Disabled
Faults: None
Outputs: Latency, token usage, cost, completion rate

Terminal

khaos run <agent-name> --eval baseline

When to use baseline

Run baseline first to establish expected behavior, then compare against resilience and security runs to measure degradation.

quickstart

The default evaluation for rapid development iteration. Provides balanced coverage of security and resilience in about 2 minutes.

Phases: Baseline, Resilience, Security
Security: 10 attack probes
Faults: Basic latency and error injection
Outputs: Full scoring (security, resilience, overall)

Terminal

# Default when no eval specified
khaos run <agent-name>

# Explicit quickstart
khaos run <agent-name> --eval quickstart

full-eval

Comprehensive evaluation for production readiness. Runs the complete test corpus with extensive fault injection and security probing.

Phases: Baseline, Resilience, Tooling, LLM Turbulence, Security
Security: 20+ attack probes across all categories
Faults: Full spectrum (HTTP, LLM, tools, MCP)
Outputs: Detailed breakdown with fault coverage analysis

Terminal

khaos run <agent-name> --eval full-eval

Pre-release recommendation

Run full-eval before major releases to catch edge cases that quickstart might miss.

security

Focused security testing. Khaos detects what your agent can actually do and runs relevant attack categories first (instead of blindly running every category).

Phases: Security only
Security: Capability-aware attack selection with skipped tests marked N/A
Faults: None
Outputs: Per-attack outcomes, vulnerability summary, and replay commands by attack ID

Terminal

# Run capability-aware security evaluation
khaos run <agent-name> --eval security

# Inspect a specific built-in attack definition
khaos attacks show <attack_id>

security-agentic

Specialized security testing for agents that read files and execute commands (like Claude Code, Aider, Cursor, and similar coding assistants). Tests indirect injection attacks via:

File Content Injection: Malicious payloads in README.md, .env, package.json, pyproject.toml, config files
Shell Output Injection: Attacks via git status, npm install, pip install, and other command output
Error Message Injection: Malicious instructions embedded in stack traces and error messages

Evaluation details:

Phases: Baseline, File Injection, Shell Injection, Error Injection, Combined
Security: 30 agentic-specific attacks + traditional attacks
Shims: Filesystem and subprocess interception
Outputs: Vulnerability report with attack vector breakdown

Terminal

# Test a coding agent like Claude Code
khaos run <agent-name> --eval security-agentic

For Agentic Environments

If your agent reads project files or executes shell commands, use this evaluation to ensure it doesn't follow malicious instructions embedded in file content or command output.

Evaluation Phases

Each evaluation runs through one or more testing phases:

Phase	Description
Baseline	Normal operation metrics without interference
Resilience	Response stability under network faults
Tooling	Tool call handling under failure conditions
LLM Turbulence	Behavior during LLM rate limits and timeouts
Security	Adversarial attack resistance

Automatic Applicability (N/A)

Evaluations are designed to work across many agent frameworks. Khaos automatically infers what your agent actually uses (LLM, HTTP/tooling, MCP) and skips tests that don't apply. Skipped tests are treated as N/A and don't affect scoring.

No MCP usage → MCP fault tests are skipped
No tool/file/code surface → tool and agent-environment attacks are skipped
No HTTP/tool usage → HTTP/tool/RAG fault tests are skipped

Canonical Inputs

Each evaluation includes a set of canonical input prompts designed for cross-agent comparison. These inputs cover common use cases:

Greeting and help requests
Factual questions
Multi-step reasoning tasks
Edge case handling

You can also provide custom inputs with --input/--inputs to override an evaluation's canonical prompts:

Terminal

khaos run <agent-name> --eval quickstart --inputs my-tests.yaml

Choosing the Right Evaluation

Use Case	Recommended Evaluation
Local development iteration	`quickstart`
PR/MR CI checks	`quickstart`
Pre-release validation	`full-eval`
Security audit	`security`
Coding agent security (Claude Code, Aider)	`security-agentic`
Performance benchmarking	`baseline`
Regression baseline	`baseline`

Evaluation Packs

Evaluations are powered by packs - YAML-defined test suites with phases, inputs, faults, and goals. Khaos ships with built-in eval packs, and you can create custom packs for your specific testing needs.

Terminal

# Run with a specific pack
khaos run <agent-name> --eval assess
khaos run <agent-name> --eval security-quick
khaos run <agent-name> --eval break

# List all available packs
khaos evals list

Packs support multi-turn conversations, goal-based assertions, per-input difficulty ratings, capability-aware input selection, and smart generation based on intent (BREAK, ASSESS, AUDIT).

See Evaluation Packs for the full pack catalog, YAML schema reference, and custom pack authoring guide.

CLI Reference

Evaluation Packs