PII/PHI: A Practical Segmentation Playbook

Context

Teams want to move fast with AI, but sensitive data slows them down, often for good reason. The fix isn’t blanket blocking or heroic “privacy fabric” rebuilds. It’s pragmatic segmentation: create clear zones, use stable tokenization/masking patterns, route access by role and purpose, and keep an audit trail that survives scrutiny.

Core Framework: The 4-Zone Model

Red (Identified): Raw PII/PHI. Strict access (need + legal basis). No model training. Only minimal processing with signed data contracts.
Amber (Tokenized): Direct identifiers replaced by format-preserving tokens. Reversible by a secure vault service. For analytics and model inference where linkage is required.
Yellow (Masked/Pseudonymous): Quasi-identifiers generalized or masked; linkage limited. For experimentation, evals, and pre-production tests.
Green (Non-Sensitive/Synthetic): Fully de-identified or high-fidelity synthetic datasets with documented limits. For demos, sandboxes, and broad dev use.

Data Contracts & Purpose Binding

Every flow attaches a purpose tag and lawful basis to access. Data contracts define: fields, sensitivity, allowed zones, retention, cross-border rules, and breach contacts. Contracts live next to pipelines and are versioned like code.

Tokenization & Masking Patterns

Vaulted tokens: Deterministic, per-tenant keys; rotation policy; HSM or equivalent controls.
Format-preserving: Preserve length/pattern for downstream validation (e.g., phone, MRN).
Contextual masking: Coarse in Yellow (e.g., age → 5-year buckets), precise in Amber.

Access Control & Routing

ABAC > RBAC: Attribute-based policies (role, tenant, purpose, zone) enforced in gateways.
Policy-as-code: Versioned rules; CI checks for risky changes.
Safe adapters: Red/Amber data never goes straight to the model; a gateway applies redaction & policy filtering first.

Recommended Actions

Classify 20 fields by sensitivity: ID, quasi-ID, clinical, financial, free text. Map each to Red/Amber/Yellow/Green.
Stand up a token vault: Deterministic tokens per tenant; rotate keys; log reversals.
Ship a policy gateway: Pre-inference scrubbing (regex + ML NER), purpose checks, and zone enforcement.
Adopt de-ID recipes: HIPAA Safe Harbor–like generalization + suppression for Yellow/Green, with utility notes.
Attach contracts to pipelines: YAML beside code; include retention, allowed zones, owners, and DSR endpoints.
Log for audit: Who/what/why for every sensitive call; include prompt, retrieval provenance, and policy decisions.

Common Pitfalls

Free-text leaks: PHI hides in notes and uploads; scrub before indexing or prompt stuffing.
One-zone thinking: Forcing all work into Red creates shadow data copies; use Amber/Yellow to unblock safely.
Weak keys: Tokenization without rotation or per-tenant separation undermines reversibility controls.
Policy drift: Rules in wikis, not in code, no tests, no review, guaranteed gaps.

Quick Win Checklist

Publish a 1-pager: your 4 zones, example fields, and allowed uses.
Wrap your model endpoint with a policy gateway that redacts IDs and strips free-text PHI.
Tokenize 3 top identifiers with a vault; rotate keys quarterly.
Add purpose and zone attributes to access logs; review weekly.

Closing

Segmentation isn’t bureaucracy. It’s how you move quickly without breaking trust. With clear zones, good tokens, enforceable policies, and solid logs, product teams can build confidently while staying on the right side of regulators and customers.

Essay by OneMind Strata Team