PII Detection

Khaos includes a built-in PII detection engine that scans agent responses for personally identifiable information. The detector covers seven categories of sensitive data with configurable risk levels and pattern matching.

Quick Start

Python
from khaos.pii import PIIDetector

detector = PIIDetector()

# Scan a string for PII
result = detector.scan("Contact me at john@example.com or 555-123-4567")
print(f"PII found: {result.has_pii}")
print(f"Matches: {len(result.matches)}")
for match in result.matches:
    print(f"  {match.pattern_name} ({match.category}): {match.matched_text}")

# Quick boolean check
has_pii = detector.quick_check("No sensitive data here")
print(has_pii)  # False

# Mask PII in output
masked = detector.mask_text("SSN: 123-45-6789")
print(masked)  # "SSN: ***-**-****"

PIICategory Enum

PII patterns are organized into seven categories covering the most common types of sensitive data encountered in agent outputs.

CategoryDescriptionExample Patterns
PERSONAL_IDGovernment-issued identifiersSSN, passport numbers, driver's license
FINANCIALFinancial account informationCredit card numbers, bank accounts, routing numbers
CONTACTContact informationEmail addresses, phone numbers, physical addresses
AUTHENTICATIONSecrets and credentialsAPI keys, passwords, tokens, SSH keys
NETWORKNetwork identifiersIP addresses, MAC addresses, URLs with credentials
MEDICALHealth-related informationMedical record numbers, health plan IDs
CRYPTOCryptocurrency identifiersWallet addresses, private keys

RiskLevel Enum

Each PII pattern has an assigned risk level that indicates the severity of exposure.

LevelDescriptionExamples
CRITICALImmediate risk of identity theft or financial lossSSN, credit card numbers, API keys, private keys
HIGHSignificant privacy riskPassport numbers, bank accounts, passwords
MEDIUMModerate privacy concernEmail addresses, phone numbers, IP addresses
LOWMinor privacy concernNames in specific contexts, general URLs

PIIDetector Class

The PIIDetector is the main interface for scanning text. Configure it to target specific categories or risk levels.

ParameterTypeDefaultDescription
categorieslist[PIICategory]AllCategories to scan for
min_risk_levelRiskLevelLOWMinimum risk level to report
include_contextboolTrueInclude surrounding text context in matches
context_charsint50Number of context characters around each match
custom_patternslist[PIIPattern]NoneAdditional custom patterns to include

Methods

MethodReturnsDescription
scan(text)PIIScanResultScan a single string for PII
scan_multiple(texts)list[PIIScanResult]Scan multiple strings
quick_check(text)boolFast boolean check (stops at first match)
mask_text(text)strReturn text with PII replaced by mask characters

PIIPattern Dataclass

Each detection pattern is defined by a PIIPattern instance. You can create custom patterns using the same structure.

FieldTypeDescription
namestrPattern identifier
categoryPIICategoryWhich category this pattern belongs to
risk_levelRiskLevelSeverity of this pattern's matches
patternstrRegular expression pattern
mask_charstrCharacter used for masking (default "*")
descriptionstrHuman-readable description
Python
from khaos.pii import PIIPattern, PIICategory, PIIDetector
from khaos.pii.patterns import RiskLevel

# Define a custom pattern
employee_id_pattern = PIIPattern(
    name="employee_id",
    category=PIICategory.PERSONAL_ID,
    risk_level=RiskLevel.HIGH,
    pattern=r"EMP-\d{6}",
    mask_char="X",
    description="Internal employee ID format",
)

# Use it in a detector
detector = PIIDetector(custom_patterns=[employee_id_pattern])
result = detector.scan("Employee EMP-123456 reported the issue")
print(result.has_pii)  # True

PIIMatch and PIIScanResult

When PII is detected, the scanner returns structured results with full match context.

PIIMatch

FieldTypeDescription
pattern_namestrName of the matched pattern
categoryPIICategoryCategory of the match
risk_levelRiskLevelRisk level of the match
matched_textstrThe actual matched text
startintStart position in the source text
endintEnd position in the source text
line_numberintLine number of the match
contextstrSurrounding text context

PIIScanResult

FieldTypeDescription
matcheslist[PIIMatch]All PII matches found
text_lengthintLength of the scanned text
has_piiboolWhether any PII was detected
risk_summarydictCount of matches by risk level
category_summarydictCount of matches by category
critical_countintNumber of CRITICAL matches
high_countintNumber of HIGH matches

Convenience Detectors

Khaos provides pre-configured detector instances for common use cases.

DetectorCategoriesMin RiskUse Case
DEFAULT_DETECTORAllLOWGeneral-purpose scanning
AUTH_DETECTORAUTHENTICATIONMEDIUMCredential and secret detection
FINANCIAL_DETECTORFINANCIALHIGHFinancial data protection
CRITICAL_DETECTORAllCRITICALOnly highest-severity matches
Python
from khaos.pii.detector import (
    DEFAULT_DETECTOR,
    AUTH_DETECTOR,
    FINANCIAL_DETECTOR,
    CRITICAL_DETECTOR,
)

# Use a pre-configured detector
result = AUTH_DETECTOR.scan(agent_response)
if result.has_pii:
    print(f"Credentials detected: {result.matches}")

# Financial-only scanning
result = FINANCIAL_DETECTOR.scan(agent_response)
if result.critical_count > 0:
    print("CRITICAL: Financial data exposed")

Integration with Testing

In the stable release, PII helpers are provided via khaos.pii. UsePIIDetector or mask_pii() directly in tests.

Python
from khaos.pii import PIIDetector, mask_pii

detector = PIIDetector()

# Redact PII from a single response string
clean_response = detector.mask_text(agent_response)

# Redact PII across transcript messages
clean_transcript = [
    {**msg, "content": mask_pii(msg["content"])}
    for msg in transcript
]
Automatic PII scanning
When security testing is enabled, Khaos automatically scans all agent responses for PII leakage. Any detected PII is flagged in the security report without additional configuration.

Related Documentation