JusticeBench

As part of their effort to responsibly develop and improve Beagle+, an AI-powered legal information chatbot, the People’s Law School has publicly released the internal testing dataset used to evaluate model performance across seven major development configurations. This dataset includes 42 nuanced legal questions, each accompanied by an ideal answer, specific evaluation criteria, and reviewer assessments across safety and value dimensions.

The questions reflect real-world legal issues submitted by users in British Columbia. Topics include housing, wills, employment, consumer protection, and jurisdictional edge cases—such as whether a New Brunswick power of attorney is valid in British Columbia, or how a tenant can legally remove a roommate. These questions were intentionally selected for their complexity, to go beyond the “easy wins” that GenAI typically handles well.

To assess responses, the team used a dual-axis rubric. Each AI-generated answer was evaluated for safety (Was the information legally accurate and unlikely to mislead?) and value (Did the tone, clarity, and empowerment level support the user in taking next steps?). All reviews were conducted by a panel of three lawyers, with written comments explaining scoring decisions. Each of the seven model configurations was evaluated, ranging from GPT-3.5 through to GPT-4-turbo, with differences in system prompts, temperature, and retrieval chunking.

Over time, the team observed improvements in both safety and value. Early versions returned a higher percentage of unsafe or low-value answers, while later iterations—particularly those using GPT-4-turbo with revised prompts—approached near-perfect accuracy and usefulness, achieving 41/42 responses rated as safe and 39/42 as valuable or very valuable. The final prompt included enhancements like writing at a Grade 8 reading level, linking to sources, and encouraging step-by-step reasoning.

In releasing this dataset, People’s Law School aims to support other teams developing generative AI tools for public legal education or triage. The dataset serves as a compact benchmark for evaluating AI systems' ability to handle real legal user queries with appropriate caution, jurisdictional relevance, and helpfulness. It also highlights the importance of iterative prompt engineering and structured testing in GenAI development.

The dataset is now being used alongside Langfuse evaluators for live AI monitoring and evaluation of conversations in Beagle+ today. As Beagle+ continues to evolve, including tests with GPT-4.5-preview and GPT-4o, this dataset remains a core part of its evaluation and refinement process.

Review the data and explanation at People's Law School post about it, here.

Beagle+ Legal Chatbot Testing Dataset

Description

Access the Dataset