Who Checks the Citations? Benchmarking Legal Hallucination Detection

Attorneys, judges, and pro se filers increasingly use AI to draft legal documents, yet these tools frequently fabricate citations. Despite predictions that newer models would hallucinate less or that court sanctions would deter negligent filers, we found over 900 filings containing fabricated citations across U.S. jurisdictions — with this number growing year-over-year. We propose a taxonomy of legal citation hallucinations grounded in actual court filings and introduce a dataset of 1,300 brief excerpts containing injected errors. Benchmarking five models in agentic and non-agentic settings reveals that while the latest iterations perform better — GPT-5 achieves 82.8% recall and a 60.5% F1 score in an agentic framework — all models struggle with subtle error categories.

Patty Liu, Dominik Stammbach, Peter Henderson

Princeton University  ·  April 2026

The Problem

Legal citations play a prominent role in U.S. legal practice. Attorneys must point to past judicial decisions and laws to make their case. Fabricating citations, or misrepresenting the content of those citations, is the same as pointing to made-up law to win the case — one judge called it:

"an abuse of the adversary system" that puts the integrity of the judicial process at risk. — Noland v. Land of the Free, 114 Cal.App.5th 426, 445 (2025)

The rapid adoption of large language models (LLMs) in the legal system has turned rare individual instances of fabrication into a systemic problem. Pro se litigants, trained attorneys, and even judges are using LLMs to generate briefs, motions, and other court filings. A prevailing argument has been that this problem is temporary — that hallucination rates would diminish as models improve and that high-profile sanctions would induce greater caution. Our findings challenge these assumptions.

Real-world impact
900+ Hallucinated Filings
Court filings containing fabricated citations identified across diverse U.S. jurisdictions, growing steadily year-over-year.
Trend reversal
Rates Not Decreasing
GPT-5.1 produces hallucinated citations at 6.57%, significantly higher than the best mid-2024 GPT-4o release at 1.23% (p = 0.001).
Growing burden
The Jevons Paradox
Newer models cite more cases per document from broader, less canonical sets — harder to verify even if individual rates improve.
Hallucination rate over time across GPT models and cumulative hallucinated court filings

Legal hallucination rates are not consistently decreasing across GPT model generations, while hallucinated citations in court filings grow steadily. Hallucination rate is the percentage of fabricated citations in all model-generated citations.

Judges repeatedly describe the resulting burden as an "enormous waste of judicial resources," and increasingly impose sanctions because "lesser sanctions have been insufficient to deter the conduct." Mid Cent. Operating Eng’rs Health & Welfare Fund v. HoosierVac LLC, 2025 WL 574234, at *3 (S.D.Ind. Feb. 21, 2025); Powhatan County Sch. Bd. v. Skinger, 2025 WL 1559593, at *10 (E.D.Va. June 2, 2025). Courts impose these sanctions while pro se litigants — who stand to benefit most from AI's ability to improve access to justice — are least equipped to detect hallucinated citations and lack access to commercial legal databases.

Through a controlled experiment querying eight generations of ChatGPT models on 92 legal drafting prompts, we find that hallucination rates are no longer consistently decreasing across model generations. Early GPT-4o models released in mid-2024 exhibit the lowest rates at 1.23%, substantially improving over GPT-3.5's ~25%. GPT-5.1 reverses this trend at 6.57%. Newer models also generate more citations per document, drawn from a broader and less canonical set of cases that are individually harder to verify.

The LePhantomCite Dataset

To assess the promise of AI in automatically checking legal filings, we introduce LePhantomCite (Legal Phantom Citation), a benchmarking dataset of legal brief excerpts augmented with injected hallucinations. It is accompanied by a taxonomy of citation hallucination types derived from failure modes observed in real court filings.

Sources
245 Appellate Briefs
From 13 U.S. Courts of Appeals (2012–2021). Pre-2022 filings minimize risk of AI-generated content. Converted via olmOCR and segmented into 5,648 coherent passages.
Dataset
1,300 Excerpts
1,000 from appellate briefs + 300 from Dahl et al. (2024). 4,499 total citations; 1,107 contain injected hallucinations across five types. Balanced 50/50 clean/hallucinated split.
Evaluation
390 Test Examples
70-30 train/test split. Segment-level evaluation with relaxed matching (match if one segment is a substring of another).

Hallucination Taxonomy

We propose five categories of legal citation hallucinations grounded in failure modes documented in actual court filings. Categories are not mutually exclusive, but only one hallucination type is introduced per citation in the dataset.

Benchmark Results

We evaluate five models in agentic and non-agentic settings using a custom harness with access to CourtListener search, local opinion retrieval, and open web search. We adapt BOED agent from Zheng et al., 2026 as our agentic framework. BOED agent maintains an explicit, language-based belief state that is updated after each action. We adopt this framework because case citation verification is inherently sequential and information-dependent: the agent must extract citations from the brief excerpts, decide on how to gather information and when it has gathered sufficient evidence to make a hallucination determination. We set a maximum of 30 steps per episode.

See main results (precision, recall, F1)

All models evaluated on the 390-example test set. Temperature 0.8 for all models; max tokens 8,192 (10,000 for GPT-5).

Model Agentic Non-Agentic
PrecisionRecallF1 PrecisionRecallF1
GPT-547.682.860.542.963.151.1
Qwen3.5-27B27.859.838.028.138.732.6
GPT-OSS 120B24.557.734.417.533.923.1
Gemini 2.5 Flash20.265.330.927.845.534.5
Qwen3-8B15.147.122.915.635.521.6

Values are percentages (±95% CI reported in paper). GPT-5 averages 16.9 tool steps per episode; Gemini 2.5 Flash averages 10.9.

See recall by hallucination type (agentic)

Agentic recall (%) broken down by the five hallucination categories. Incorrect pincite is consistently the hardest across all models — a structural barrier, not a model capability issue.

Hallucination Type GPT-5 Qwen3.5-27B GPT-OSS 120B Gemini 2.5 Flash Qwen3-8B
Non-existent citation100.096.890.390.358.1
Case name mismatch97.194.183.885.352.9
Content misrepresentation84.049.653.455.754.2
Verbatim misquote82.632.639.167.432.6
Incorrect pincite18.225.521.816.416.4

Stronger models (GPT-5) detect over 80% of content-based hallucinations; weaker models plateau around 50%.

See agent behavior analysis

GPT-5 devotes 46.6% of its actions to local opinion search (highest of any model) and 8.7% to open web search as a fallback for citations absent from CourtListener. It has the lowest false positive rate on citations not found in CourtListener — weaker models often treat absence from CourtListener as evidence of hallucination.

Model Avg Steps False Positive Rate
(citations not on CourtListener)
GPT-OSS 120B17.743.4%
GPT-516.924.0%
Gemini 2.5 Flash10.958.9%
Qwen3.5-27B10.265.9%
Qwen3-8B7.551.9%

A notable side effect: the agent surfaced over 10 citation errors in pre-LLM briefs submitted to state supreme courts, demonstrating practical value beyond the benchmark.

Policy Recommendations

Our results reveal that some failures are structural rather than purely technical. The most consequential barrier is access to legal data.