Evaluation Packs

Evaluation packs are curated test suites that combine security testing, resilience evaluation, and baseline metrics into a single reproducible run. Choose the pack that matches your workflow.

Available Packs

Run by agent name
First run khaos discover, then use khaos run <agent-name> in all examples below.
PackTimeUse Case
baseline~1 minPure observation, no faults or attacks
quickstart~2 minFast dev iteration with basic security
full-eval~10-15 minProduction readiness assessment
security~5-8 minDeep security testing only
Terminal
# Run with a specific pack
khaos run <agent-name> --eval quickstart
khaos run <agent-name> --eval full-eval
khaos run <agent-name> --eval security
khaos run <agent-name> --eval baseline

baseline

Pure observation mode. Runs your agent without injecting any faults or security attacks. Use this to establish performance baselines and verify basic functionality.

  • Phases: Baseline only
  • Security: Disabled
  • Faults: None
  • Outputs: Latency, token usage, cost, completion rate
Terminal
khaos run <agent-name> --eval baseline

# Or use the --quick shorthand
khaos run <agent-name> --quick
When to use baseline
Run baseline first to establish expected behavior, then compare against resilience and security runs to measure degradation.

quickstart

The default pack for rapid development iteration. Provides balanced coverage of security and resilience in about 2 minutes.

  • Phases: Baseline, Resilience, Security
  • Security: 10 attack probes
  • Faults: Basic latency and error injection
  • Outputs: Full scoring (security, resilience, overall)
Terminal
# Default when no pack specified
khaos run <agent-name>

# Explicit quickstart
khaos run <agent-name> --eval quickstart

full-eval

Comprehensive evaluation for production readiness. Runs the complete test corpus with extensive fault injection and security probing.

  • Phases: Baseline, Resilience, Tooling, LLM Turbulence, Security
  • Security: 20+ attack probes across all categories
  • Faults: Full spectrum (HTTP, LLM, tools, MCP)
  • Outputs: Detailed breakdown with fault coverage analysis
Terminal
khaos run <agent-name> --eval full-eval
Pre-release recommendation
Run full-eval before major releases to catch edge cases that quickstart might miss.

security

Focused security testing with the complete attack corpus. Use when you need deep security analysis without resilience testing overhead.

  • Phases: Security only
  • Security: Full attack corpus (all categories)
  • Faults: None
  • Outputs: Detailed vulnerability report with remediation hints
Terminal
khaos run <agent-name> --eval security

Pack Phases

Each pack runs through one or more evaluation phases:

PhaseDescription
BaselineNormal operation metrics without interference
ResilienceResponse stability under network faults
ToolingTool call handling under failure conditions
LLM TurbulenceBehavior during LLM rate limits and timeouts
SecurityAdversarial attack resistance

Automatic Applicability (N/A)

Packs are designed to work across many agent frameworks. Khaos automatically infers what your agent actually uses (LLM, HTTP/tooling, MCP) and skips tests that don’t apply. Skipped tests are treated as N/A and don’t affect scoring.

  • No MCP usage → MCP fault tests are skipped
  • No HTTP/tool usage → HTTP/tool/RAG fault tests are skipped

Canonical Inputs

Each pack includes a set of canonical input prompts designed for cross-agent comparison. These inputs cover common use cases:

  • Greeting and help requests
  • Factual questions
  • Multi-step reasoning tasks
  • Edge case handling

You can also provide custom inputs with --input/--inputs to override a pack’s canonical prompts:

Terminal
khaos run <agent-name> --eval quickstart --inputs my-tests.yaml

Choosing the Right Pack

ScenarioRecommended Pack
Local development iterationquickstart
PR/MR CI checksquickstart
Pre-release validationfull-eval
Security auditsecurity
Performance benchmarkingbaseline
Regression baselinebaseline