JusticeBench

PRBench (Professional Reasoning Benchmark) is a public, rubric-based evaluation suite for high-stakes professional tasks in Law and Finance. It contains 1,100 expert-authored prompts (600 Finance, 500 Law) sourced from 182 qualified professionals, with coverage across 114 countries and 47 U.S. jurisdictions.

The law cases are primarily focused on business & corporate scenarios, though there are some on family law cases.

Each prompt is paired with an expert-curated rubric of 10–30 criteria (19,356 criteria total) spanning capability categories such as Legal/Financial Accuracy, Procedural Correctness, Handling Uncertainty, Practical Utility, Instruction Following, and others.

PRBench reports model scores using a rubric-weighted scheme (and a min-normalized variant for category comparisons) and includes a designated “Hard” subset (300 Finance, 250 Law). In reported evaluations of 20 models, best scores on the Hard subsets are 0.39 (Finance) and 0.37 (Legal). The dataset and rubrics are open-sourced to support interpretable, fine-grained assessment of reasoning on open-ended, economically consequential tasks.

Scope and scale: Compared with prior professional benchmarks, PRBench emphasizes open-ended QA, multi-turn context, and larger rubric coverage (1,100 tasks; 19,356 criteria).

Rubric design: Criteria are binary, self-contained, and weighted from −10 to +10 across severity bands; an independent expert validation reported 93.9% agreement on rubric clarity/validity

Category coverage: Law categories include (e.g.) Legal Accuracy, Application of Law to Facts, and Procedural Correctness; Finance includes (e.g.) Financial Accuracy and Process Transparency & Auditability.

See the release for full details at: https://scale.com/research/prbench

See the datasets at HuggingFace: https://huggingface.co/datasets/ScaleAI/PRBench

You can also explore the data with the visualizer at: https://prbench-explorer.vercel.app/

PRBench

Description

Access the Dataset