JusticeBench

A benchmark dataset for evaluating AI-powered intake classifiers and user-profiling tools used in legal help systems. Each item presents an unstructured user input (chat message, hotline transcript, voicemail, SMS thread, email, referral note, or walk-in note) paired with gold-label outputs in four extraction dimensions: legal-issue classification (LIST taxonomy codes), jurisdiction extraction (state, county, FIPS), audience-served tags (Legal Help Commons 22-term vocabulary), and emergency-triage classification.

The dataset spans the full range of intake conditions a real legal help system encounters. Items vary across legal-issue area (Housing, Family, Public Benefits, Work, Money/Debt, Crime, Immigration, Estates, Education, Civil Rights, Small Business, Native American Law, Traffic), jurisdiction (37 states with county-level FIPS resolution), input channel (Chatbot, Chat, Hotline transcript, Voicemail, SMS, Email, Cross-agency Referral Note, Walk-in Intake Note), and difficulty (Easy, Medium, Hard). Approximately half the items carry an emergency flag in the gold labels.

What distinguishes this dataset from straightforward intake-classification benchmarks is its substantial coverage of negative and adversarial test cases:

Confounder items (24 cases) test how the profiler handles real intake conditions that break naive keyword-matching approaches: off-topic preambles where the actual legal issue is buried under personal context; multi-issue intakes where the user mentions four legal problems at once; multi-jurisdictional confusion from a user who lived in three states in nine months; family-member confusion where the reporter is asking on behalf of multiple relatives with separate legal needs; hostile or distrustful framing that may discourage substantive engagement; inconsistent facts where the user self-corrects mid-narrative; code-mixing between Spanish and English; intakes from users above LSC income eligibility; and detailed-but-irrelevant venting that drowns the substantive legal question.
Emergency-variation items (8 cases) test high-stakes triage across four critical scenarios: mental health crisis with imminent legal deadline, active ICE enforcement, weapon-or-violence threats, and minors in immediate danger.
False-positive emergency items (10 cases) test whether the profiler over-triggers emergency routing based on lexical patterns alone. Each item mentions crisis keywords (DV, gun, suicide, ICE, deportation, abuse) but in contexts that are historical, resolved, or otherwise non-urgent: post-divorce name change after past DV, family-based petition for a relative deported 17 years ago, will-making by someone 14 years stable post-attempt, SSDI disclosure questions about past brief ideation.
No-legal-issue items (10 cases) test whether the profiler appropriately declines to assert a legal-issue code when none fits. Includes coworker interpersonal conflicts, neighbor noise complaints, restaurant service complaints, breakup venting, career-change planning, pet behavior issues, family estrangement, generalized anxiety in treatment, personal financial planning, and roommate annoyances. LIST Code is intentionally blank in the gold labels for these items.

Together these categories make the dataset suitable for measuring not only standard precision and recall on extraction tasks, but also false-positive rates on emergency triage, true-negative rates on no-legal-issue classification, and per-confounder-category robustness.

Size: 180 items total. Approximately 130 items are positive intake cases with full legal-issue and audience-served labels; 24 items are confounders with deliberate extraction obstacles; 8 items are emergency-variation cases; 10 are false-positive emergency keyword cases; 10 are no-legal-issue rambling cases.

Format: CSV, 23 columns. Each item includes: scenario name and ID, real-or-synthetic flag, input channel, the user's verbatim query text, user language code, jurisdiction fields (state, county, FIPS), the expert's short and long explanations of the case context, primary legal issue and full LIST code(s), Top Parent issue area, the audience-served gold tags (kebab-case from the LHC 22-term vocabulary), task and process-step labels, emergency flag, full JSON-formatted case profiler output (the canonical machine-readable gold label), and difficulty grade.

Provenance: Constructed at Stanford Legal Design Lab in 2025 and 2026 across iterative working sessions with practitioners. Initial scenarios derived from Subject Matter Expert reviews of Content Safety Wrapper outputs at legal aid programs in MI, OH, OR, IL, TX, WV, and CA. The 22-term audience_served vocabulary tags align to the Legal Help Data Standard. Multi-channel intake samples (hotline, voicemail, SMS, email, referral notes) were constructed to test classifier robustness across input modalities. Confounder, emergency-variation, false-positive, and no-legal-issue test categories were added to support negative-case evaluation.

Status: Draft v0.2 (May 2026). All items synthetic. Active expansion targeted: adversarial intake cases (users gaming the classifier), multi-turn dialog cases, additional state coverage in underrepresented jurisdictions, and additional channel samples (court self-help kiosk inputs, embedded form-completion sessions, voice-AI agent transcripts).

License: CC BY 4.0.

Citation: Stanford Legal Design Lab. Legal Help User Profiles (v0.2). Legal Help Commons / JusticeBench, May 2026.

User Intake Classification labeled dataset

Description

Access the Dataset