PII Detection
Khaos includes a built-in PII detection engine that scans agent responses for personally identifiable information. The detector covers seven categories of sensitive data with configurable risk levels and pattern matching.
Quick Start
from khaos.pii import PIIDetector
detector = PIIDetector()
# Scan a string for PII
result = detector.scan("Contact me at john@example.com or 555-123-4567")
print(f"PII found: {result.has_pii}")
print(f"Matches: {len(result.matches)}")
for match in result.matches:
print(f" {match.pattern_name} ({match.category}): {match.matched_text}")
# Quick boolean check
has_pii = detector.quick_check("No sensitive data here")
print(has_pii) # False
# Mask PII in output
masked = detector.mask_text("SSN: 123-45-6789")
print(masked) # "SSN: ***-**-****"PIICategory Enum
PII patterns are organized into seven categories covering the most common types of sensitive data encountered in agent outputs.
| Category | Description | Example Patterns |
|---|---|---|
PERSONAL_ID | Government-issued identifiers | SSN, passport numbers, driver's license |
FINANCIAL | Financial account information | Credit card numbers, bank accounts, routing numbers |
CONTACT | Contact information | Email addresses, phone numbers, physical addresses |
AUTHENTICATION | Secrets and credentials | API keys, passwords, tokens, SSH keys |
NETWORK | Network identifiers | IP addresses, MAC addresses, URLs with credentials |
MEDICAL | Health-related information | Medical record numbers, health plan IDs |
CRYPTO | Cryptocurrency identifiers | Wallet addresses, private keys |
RiskLevel Enum
Each PII pattern has an assigned risk level that indicates the severity of exposure.
| Level | Description | Examples |
|---|---|---|
CRITICAL | Immediate risk of identity theft or financial loss | SSN, credit card numbers, API keys, private keys |
HIGH | Significant privacy risk | Passport numbers, bank accounts, passwords |
MEDIUM | Moderate privacy concern | Email addresses, phone numbers, IP addresses |
LOW | Minor privacy concern | Names in specific contexts, general URLs |
PIIDetector Class
The PIIDetector is the main interface for scanning text. Configure it to target specific categories or risk levels.
| Parameter | Type | Default | Description |
|---|---|---|---|
categories | list[PIICategory] | All | Categories to scan for |
min_risk_level | RiskLevel | LOW | Minimum risk level to report |
include_context | bool | True | Include surrounding text context in matches |
context_chars | int | 50 | Number of context characters around each match |
custom_patterns | list[PIIPattern] | None | Additional custom patterns to include |
Methods
| Method | Returns | Description |
|---|---|---|
scan(text) | PIIScanResult | Scan a single string for PII |
scan_multiple(texts) | list[PIIScanResult] | Scan multiple strings |
quick_check(text) | bool | Fast boolean check (stops at first match) |
mask_text(text) | str | Return text with PII replaced by mask characters |
PIIPattern Dataclass
Each detection pattern is defined by a PIIPattern instance. You can create custom patterns using the same structure.
| Field | Type | Description |
|---|---|---|
name | str | Pattern identifier |
category | PIICategory | Which category this pattern belongs to |
risk_level | RiskLevel | Severity of this pattern's matches |
pattern | str | Regular expression pattern |
mask_char | str | Character used for masking (default "*") |
description | str | Human-readable description |
from khaos.pii import PIIPattern, PIICategory, PIIDetector
from khaos.pii.patterns import RiskLevel
# Define a custom pattern
employee_id_pattern = PIIPattern(
name="employee_id",
category=PIICategory.PERSONAL_ID,
risk_level=RiskLevel.HIGH,
pattern=r"EMP-\d{6}",
mask_char="X",
description="Internal employee ID format",
)
# Use it in a detector
detector = PIIDetector(custom_patterns=[employee_id_pattern])
result = detector.scan("Employee EMP-123456 reported the issue")
print(result.has_pii) # TruePIIMatch and PIIScanResult
When PII is detected, the scanner returns structured results with full match context.
PIIMatch
| Field | Type | Description |
|---|---|---|
pattern_name | str | Name of the matched pattern |
category | PIICategory | Category of the match |
risk_level | RiskLevel | Risk level of the match |
matched_text | str | The actual matched text |
start | int | Start position in the source text |
end | int | End position in the source text |
line_number | int | Line number of the match |
context | str | Surrounding text context |
PIIScanResult
| Field | Type | Description |
|---|---|---|
matches | list[PIIMatch] | All PII matches found |
text_length | int | Length of the scanned text |
has_pii | bool | Whether any PII was detected |
risk_summary | dict | Count of matches by risk level |
category_summary | dict | Count of matches by category |
critical_count | int | Number of CRITICAL matches |
high_count | int | Number of HIGH matches |
Convenience Detectors
Khaos provides pre-configured detector instances for common use cases.
| Detector | Categories | Min Risk | Use Case |
|---|---|---|---|
DEFAULT_DETECTOR | All | LOW | General-purpose scanning |
AUTH_DETECTOR | AUTHENTICATION | MEDIUM | Credential and secret detection |
FINANCIAL_DETECTOR | FINANCIAL | HIGH | Financial data protection |
CRITICAL_DETECTOR | All | CRITICAL | Only highest-severity matches |
from khaos.pii.detector import (
DEFAULT_DETECTOR,
AUTH_DETECTOR,
FINANCIAL_DETECTOR,
CRITICAL_DETECTOR,
)
# Use a pre-configured detector
result = AUTH_DETECTOR.scan(agent_response)
if result.has_pii:
print(f"Credentials detected: {result.matches}")
# Financial-only scanning
result = FINANCIAL_DETECTOR.scan(agent_response)
if result.critical_count > 0:
print("CRITICAL: Financial data exposed")Integration with Testing
In the stable release, PII helpers are provided via khaos.pii. UsePIIDetector or mask_pii() directly in tests.
from khaos.pii import PIIDetector, mask_pii
detector = PIIDetector()
# Redact PII from a single response string
clean_response = detector.mask_text(agent_response)
# Redact PII across transcript messages
clean_transcript = [
{**msg, "content": mask_pii(msg["content"])}
for msg in transcript
]Related Documentation
- Security Testing - Security testing overview
- Attack Registry - PII leakage attack category
- Enriched Testing API - Programmatic access to PII scan results