PHI Detection Skill
Purpose
Ensure no Protected Health Information (PHI) enters the evaluation pipeline. Enforces ScribeGoat2's "synthetic only" data policy for HIPAA compliance.
When to Use
- •Before committing new scenario files
- •CI/CD pre-merge validation
- •Periodic repository audits
- •Before sharing evaluation data externally
Triggers
- •"scan for PHI"
- •"check for protected health information"
- •"validate data is synthetic"
- •"run PHI detection"
Tools
bash
# Full repository scan (CI mode) python scripts/detect_phi.py --strict # Scan specific directory python scripts/detect_phi.py --path bloom_medical_eval/scenarios/ # Show verbose matches python scripts/detect_phi.py --verbose
Prerequisites
- •Python 3.11+
- •No external dependencies (uses stdlib only)
Input Schema
yaml
path: type: path default: "." description: Directory or file to scan strict: type: boolean default: false description: Fail on warnings (provenance metadata) verbose: type: boolean default: false description: Show all matched patterns
Output Schema
yaml
status: enum # pass, fail, warning
phi_detected: boolean
matches:
- file: string
pattern: string
severity: enum # HIGH, MEDIUM, LOW
examples: [string]
count: integer
files_scanned: integer
excluded_directories: [string]
PHI Patterns Detected
| Pattern | Severity | Example |
|---|---|---|
| SSN | HIGH | 123-45-6789 |
| Medical Record Number | HIGH | MRN: 12345678 |
| Full Date of Birth | HIGH | DOB: 01/15/1985 |
| Phone Number | MEDIUM | 555-123-4567 |
| Personal Email | MEDIUM | john.doe@gmail.com |
| Street Address | MEDIUM | 123 Main Street |
| Patient Full Name | HIGH | Patient: John Smith |
Whitelist Patterns
The following patterns are not flagged (legitimate use cases):
- •Example domains (
example.com) - •Fake phone numbers (
555-xxxx) - •Toll-free numbers (
800-xxx-xxxx,888-xxx-xxxx, etc.) - •Crisis hotlines (
988) - •Medical abbreviations (
PT,ST elevation)
Excluded Directories
These directories contain evaluation artifacts and are excluded:
- •
results*- Evaluation outputs - •
reports/- Generated reports - •
experiments/- Experimental data - •
.private/- Private test data - •
scripts/- Source code - •
docs/- Documentation
Success Criteria
| Metric | Threshold | Severity |
|---|---|---|
phi_detected | false | 🔴 Critical |
high_severity_matches | 0 | 🔴 Critical |
medium_severity_matches | 0 | 🟠 High |
Safety Gates
yaml
- gate: no_phi_detected metric: phi_detected operator: "==" threshold: false action: block_merge severity: critical - gate: no_high_severity metric: high_severity_matches operator: "==" threshold: 0 action: block_merge severity: critical
HIPAA Safe Harbor Compliance
This skill implements detection of HIPAA's 18 Safe Harbor identifiers:
- •Names ✅
- •Geographic data ✅
- •Dates (except year) ✅
- •Phone numbers ✅
- •Fax numbers ✅
- •Email addresses ✅
- •Social Security numbers ✅
- •Medical record numbers ✅
- •Health plan beneficiary numbers
- •Account numbers
- •Certificate/license numbers
- •Vehicle identifiers
- •Device identifiers
- •Web URLs
- •IP addresses
- •Biometric identifiers
- •Full-face photographs
- •Any other unique identifying number
Related Skills
- •
bloom_integrity_verification- Verify data integrity after PHI check - •
crisis_persistence_eval- Requires PHI-clean scenarios