PII Masking

Help providers transform confidential data into usable, secure datasets by removing identifiers and possibly replacing them with synthetic but realistic fictional information.
Task Description
Legal help organizations, courts, and justice system actors handle large volumes of confidential data—from case notes and filings to intake records and communications. This data contains valuable insights that could improve services, support research, inform policy, or train safer AI tools. But before it can be used for those purposes, it must be de-identified to protect the privacy and safety of clients and staff.

The PII Masking task can be done by a system that securely spots relevant PII, disregards irrelevant possible identifiers, and then either redacts or replaces the relevant PII.
It scans confidential records—including text fields, stories, transcripts, uploaded/saved images and documents, and structured case management data—to identify and remove personally identifying information (PII), such as names, dates of birth, addresses, phone numbers, and unique case numbers.
Rather than just redacting content, the system can also replace the real PII with synthetic identifiers—fictional, non-reversible replacements that remain internally consistent. For example, a client named “Martha Pizzoli” might become “Client A07” throughout a record set. This preserves the structure and meaning of the data while ensuring no real person or organization can be re-identified.
This task is especially valuable for organizations looking to safely share data with researchers, build internal dashboards, or develop AI tools without violating confidentiality rules. It enables use of rich, detailed records while fully protecting sensitive information.
Success means the data has been de-identified with high confidence—no identifying details remain, and the resulting synthetic data preserves accuracy, consistency, and utility for further use.
How to Measure Quality?
Our working group has established a list of items that are Definite PII, Possible PII, and Not PII. We have also developed a rubric to judge the output of PII Masking solutions' quality.
What counts as PII?
The Stanford Legal Design Lab had asked a group of subject matter experts working at legal help teams to give feedback (using this survey) on common legal document fields. For each field, they asked the respondent to categorize it as Clear PII, Possible PII, or Not PII.
The respondents’ answers were then analyzed by the team. They used the categorizations to assign a numeric score and then clustered the responses accordingly. The team also gathered qualitative feedback and additional field suggestions.
High risk / Definite PII (mask by default)
- SSN (full), bank/financial account numbers, driver’s license/state ID (full), passport number
- Phone number, email
- Date of birth (DOB)
- Signature
- Home/mailing address, property address
- Insurance policy number
- Criminal history details (often treated as highly sensitive/identifying in context)
Medium risk / Possible or indirect PII (mask depends on profile + context)
- Last 3 digits of SSN/ID
- Case/court number (can be identifying via public record linkage)
- Initials, age
- Income amount, financial assets
- Medical condition, pregnancy status
- Employer; school/daycare names
- VIN/license plate/professional license numbers
- Place of birth; immigration identifiers (e.g., A-number)
- Address components alone (ZIP code, census tract) depending on purpose
Low risk / Usually not PII (generally do not mask)
- Hearing date/time; document date
- Rent amount; generic financial totals
- Judge name; insurance company name
- Laws/citations/legal labels
Note: many teams distinguish PII (identifiers) from sensitive information (medical, pregnancy, income/assets). Your profile may choose to treat “sensitive” as maskable even when it is not a direct identifier.
Rubric to Evaluate PII Masking Performance

PII MASKING RUBRIC
Rating: 1 = Not OK 2 = OK with cleanup 3 = Good
HARD FAIL (if any are true → Not OK)
☐ A HIGH/DEFINITE PII item is still visible (SSN/DOB/address/ID/phone/email/signature)
☐ You can still find PII by searching/copy-pasting (text layer not cleaned)
☐ Masking is reversible (annotations/layers can be removed)
1) CATCHES PII
Did it find the PII that’s there?
☐ Catches HIGH/DEFINITE PII throughout the doc (including repeats)
☐ Catches key MEDIUM PII as expected
Score (1–3): _____
Notes: ________________________________________________________________
2) MASKS THE RIGHT THINGS
Did it mask what it should—and leave what it shouldn’t?
☐ Masks HIGH PII
☐ Does NOT mask LOW/NOT PII (dates, rent totals, judge name, citations)
Score (1–3): _____
Notes: ________________________________________________________________
3) MASKING IS REAL
Is the PII truly gone, not just hidden?
☐ Can’t search or copy PII from the output
☐ Works across exports (PDF/PNG/TXT) without leakage
Score (1–3): _____
Notes / test used (search, copy/paste, OCR): _____________________________
4) AVOIDS OVERMASKING
Did it avoid hiding important non-PII?
☐ Doesn’t black out whole paragraphs because of one identifier
☐ Keeps legal meaning readable (dates, amounts, key events still visible)
Score (1–3): _____
Notes: ________________________________________________________________
5) CONSISTENT + REVIEWABLE
Is it easy to verify and fix?
☐ Masks all instances (headers/footers/tables/exhibits)
☐ Shows what it masked (simple log or highlights) OR makes manual correction easy
Score (1–3): _____
Notes: ________________________________________________________________
OVERALL RESULT
[ ] NOT OK (any Hard Fail OR 2+ items scored “1”)
[ ] OK WITH CLEANUP (no Hard Fail; mostly “2”)
[ ] GOOD (no Hard Fail; mostly “3”)
PII QUICK REFERENCE (for raters; examples)
HIGH/DEFINITE PII (must be masked): SSN; driver’s license/state ID; bank/financial account numbers; phone; email; DOB; signature; home/mailing address; property address; insurance policy #; passport #; criminal history details.
MEDIUM PII (should usually be masked if present): last-3 digits of SSN/ID; case/court number; initials; age; income/assets; medical condition; pregnancy status; employer; school/daycare names; VIN/license plate/professional license #; place of birth; immigration IDs (A-number); ZIP/census tract when it could identify someone in context.
LOW/NOT PII (usually should NOT be masked): hearing date/time; document date; rent amount; generic financial totals; judge name; insurance company name; laws/citations/legal labels.