Who Checks the Citations? Benchmarking Legal Hallucination Detection

Attorneys, judges, and pro se filers increasingly use AI to draft legal documents, yet these tools frequently fabricate citations. Despite predictions that newer models would hallucinate less or that court sanctions would deter negligent filers, we found over 900 filings containing fabricated citations across U.S. jurisdictions — with this number growing year-over-year. We propose a taxonomy of legal citation hallucinations grounded in actual court filings and introduce a dataset of 1,300 brief excerpts containing injected errors. Benchmarking five models in agentic and non-agentic settings reveals that while the latest iterations perform better — GPT-5 achieves 82.8% recall and a 60.5% F1 score in an agentic framework — all models struggle with subtle error categories.

Patty Liu, Dominik Stammbach, Peter Henderson

Princeton University · April 2026

Paper Dataset Code

The Problem

Legal citations play a prominent role in U.S. legal practice. Attorneys must point to past judicial decisions and laws to make their case. Fabricating citations, or misrepresenting the content of those citations, is the same as pointing to made-up law to win the case — one judge called it:

"an abuse of the adversary system" that puts the integrity of the judicial process at risk. — Noland v. Land of the Free, 114 Cal.App.5th 426, 445 (2025)

The rapid adoption of large language models (LLMs) in the legal system has turned rare individual instances of fabrication into a systemic problem. Pro se litigants, trained attorneys, and even judges are using LLMs to generate briefs, motions, and other court filings. A prevailing argument has been that this problem is temporary — that hallucination rates would diminish as models improve and that high-profile sanctions would induce greater caution. Our findings challenge these assumptions.

Real-world impact

900+ Hallucinated Filings

Court filings containing fabricated citations identified across diverse U.S. jurisdictions, growing steadily year-over-year.

→

Trend reversal
Rates Not Decreasing
GPT-5.1 produces hallucinated citations at 6.57%, significantly higher than the best mid-2024 GPT-4o release at 1.23% (p = 0.001).

→

Growing burden

The Jevons Paradox

Newer models cite more cases per document from broader, less canonical sets — harder to verify even if individual rates improve.

Hallucination rate over time across GPT models and cumulative hallucinated court filings

Legal hallucination rates are not consistently decreasing across GPT model generations, while hallucinated citations in court filings grow steadily. Hallucination rate is the percentage of fabricated citations in all model-generated citations.

Judges repeatedly describe the resulting burden as an "enormous waste of judicial resources," and increasingly impose sanctions because "lesser sanctions have been insufficient to deter the conduct." Mid Cent. Operating Eng’rs Health & Welfare Fund v. HoosierVac LLC, 2025 WL 574234, at *3 (S.D.Ind. Feb. 21, 2025); Powhatan County Sch. Bd. v. Skinger, 2025 WL 1559593, at *10 (E.D.Va. June 2, 2025). Courts impose these sanctions while pro se litigants — who stand to benefit most from AI's ability to improve access to justice — are least equipped to detect hallucinated citations and lack access to commercial legal databases.

Through a controlled experiment querying eight generations of ChatGPT models on 92 legal drafting prompts, we find that hallucination rates are no longer consistently decreasing across model generations. Early GPT-4o models released in mid-2024 exhibit the lowest rates at 1.23%, substantially improving over GPT-3.5's ~25%. GPT-5.1 reverses this trend at 6.57%. Newer models also generate more citations per document, drawn from a broader and less canonical set of cases that are individually harder to verify.

The LePhantomCite Dataset

To assess the promise of AI in automatically checking legal filings, we introduce LePhantomCite (Legal Phantom Citation), a benchmarking dataset of legal brief excerpts augmented with injected hallucinations. It is accompanied by a taxonomy of citation hallucination types derived from failure modes observed in real court filings.

Sources

245 Appellate Briefs

From 13 U.S. Courts of Appeals (2012–2021). Pre-2022 filings minimize risk of AI-generated content. Converted via olmOCR and segmented into 5,648 coherent passages.

→

Dataset
1,300 Excerpts
1,000 from appellate briefs + 300 from Dahl et al. (2024). 4,499 total citations; 1,107 contain injected hallucinations across five types. Balanced 50/50 clean/hallucinated split.

→

Evaluation

390 Test Examples

70-30 train/test split. Segment-level evaluation with relaxed matching (match if one segment is a substring of another).

Hallucination Taxonomy

We propose five categories of legal citation hallucinations grounded in failure modes documented in actual court filings. Categories are not mutually exclusive, but only one hallucination type is introduced per citation in the dataset.

Non-existent citation. The citation does not correspond to any real case — the reporter volume, abbreviation, or page number contain implausible values.

See example

Injected: Original 133 S. Ct. 1017 → Hallucinated 446 Cal. Rptr. 4th 183

Real-world case: A pro se plaintiff cited "Graham v. Nyquist (1974)" but the citation points to no existing case (Sims v. Souily-Lefave, No. 2:24-cv-00831 (D. Nev. Apr. 15, 2025)).
Case name mismatch. The reporter citation and case name refer to two different real cases. Created by replacing either the case name or the reporter citation with that of another existing case.

See example

Injected: Cinel v. Connick, 15 F.3d 1338 → Boone v. Vinson, 15 F.3d 1338

Real-world case: An attorney cited "Tinch v. Video Indus. Servs., Inc., 2019 WL 1396975" but that Westlaw citation corresponds to Dolberry v. Jakob, 2019 WL 1396975 instead.
Incorrect pincite. The citation refers to the correct case, but the cited page number does not support the quoted language or proposition. This is largely a structural barrier: page number information for many cases is only accessible through Westlaw and LexisNexis.

See example

Injected: 830 F.3d at 514 → 830 F.3d at 511

Public repositories do not include official reporter pagination, which originates in commercial publisher formatting. This makes pincite verification infeasible without a commercial database subscription, explaining why all models fail on this category.
Verbatim misquote. The exact quoted language does not appear in the cited case. Generated by replacing one or two words with semantically similar synonyms, preserving meaning so there is no content mismatch.

See example

Injected: "[T]he parties' dispute over arbitrability specifically falls within those carve-outs." → "[T]he parties' dispute over arbitrability specifically resides within those exclusions."

Real-world case: In Harris v. Take-Two Interactive Software, a plaintiff cited Chambers v. NASCO, Inc. with a quotation that "courts have 'the inherent power to police themselves and to sanction bad-faith litigation conduct.'" This quote was not found in the cited decision.
Content misrepresentation. A cited case exists but does not support the proposition for which it is cited. The hardest hallucination type to inject synthetically and to detect automatically.

See example

Injected: Original holding: "These rules cannot support a claim for retaliatory discharge." → Hallucinated: "Any private policies cannot support a claim for retaliatory discharge under Kansas law."

Real-world case: In Jakes v. Youngblood, a defendant's attorney wrote that a Pennsylvania Superior Court case "emphasized that even where statements are embarrassing or upsetting, the Plaintiff must demonstrate their precise defamatory content and origin." The cited case contains no such opinion.

Benchmark Results

We evaluate five models in agentic and non-agentic settings using a custom harness with access to CourtListener search, local opinion retrieval, and open web search. We adapt BOED agent from Zheng et al., 2026 as our agentic framework. BOED agent maintains an explicit, language-based belief state that is updated after each action. We adopt this framework because case citation verification is inherently sequential and information-dependent: the agent must extract citations from the brief excerpts, decide on how to gather information and when it has gathered sufficient evidence to make a hallucination determination. We set a maximum of 30 steps per episode.

Agentic retrieval significantly outperforms non-agentic baselines. GPT-5 improves recall by 19.7 percentage points with access to search tools (63.1% → 82.8% recall). All five models benefit from the agentic framework.
GPT-5 leads across nearly all categories. It achieves 100% recall on non-existent citations and 97.1% on case name mismatches, and over 80% recall on verbatim misquotes and content misrepresentation. Weaker models are considerably lower, especially on harder types.
Incorrect pincites remain intractable for all models. GPT-5 achieves only 18.2% recall on this category. This is a structural barrier: official reporter pagination is locked behind commercial legal databases, not a model capability issue.
Precision is low across the board. High recall comes at the cost of false positives. GPT-5 achieves 47.6% precision agentically — the best of any model. Weaker models often treat absence from CourtListener as evidence of hallucination, inflating false positive rates.

See main results (precision, recall, F1)

All models evaluated on the 390-example test set. Temperature 0.8 for all models; max tokens 8,192 (10,000 for GPT-5).

Model	Agentic			Non-Agentic
Model	Precision	Recall	F1	Precision	Recall	F1
GPT-5	47.6	82.8	60.5	42.9	63.1	51.1
Qwen3.5-27B	27.8	59.8	38.0	28.1	38.7	32.6
GPT-OSS 120B	24.5	57.7	34.4	17.5	33.9	23.1
Gemini 2.5 Flash	20.2	65.3	30.9	27.8	45.5	34.5
Qwen3-8B	15.1	47.1	22.9	15.6	35.5	21.6

Values are percentages (±95% CI reported in paper). GPT-5 averages 16.9 tool steps per episode; Gemini 2.5 Flash averages 10.9.

See recall by hallucination type (agentic)

Agentic recall (%) broken down by the five hallucination categories. Incorrect pincite is consistently the hardest across all models — a structural barrier, not a model capability issue.

Hallucination Type	GPT-5	Qwen3.5-27B	GPT-OSS 120B	Gemini 2.5 Flash	Qwen3-8B
Non-existent citation	100.0	96.8	90.3	90.3	58.1
Case name mismatch	97.1	94.1	83.8	85.3	52.9
Content misrepresentation	84.0	49.6	53.4	55.7	54.2
Verbatim misquote	82.6	32.6	39.1	67.4	32.6
Incorrect pincite	18.2	25.5	21.8	16.4	16.4

Stronger models (GPT-5) detect over 80% of content-based hallucinations; weaker models plateau around 50%.

See agent behavior analysis

GPT-5 devotes 46.6% of its actions to local opinion search (highest of any model) and 8.7% to open web search as a fallback for citations absent from CourtListener. It has the lowest false positive rate on citations not found in CourtListener — weaker models often treat absence from CourtListener as evidence of hallucination.

Model	Avg Steps	False Positive Rate (citations not on CourtListener)
GPT-OSS 120B	17.7	43.4%
GPT-5	16.9	24.0%
Gemini 2.5 Flash	10.9	58.9%
Qwen3.5-27B	10.2	65.9%
Qwen3-8B	7.5	51.9%

A notable side effect: the agent surfaced over 10 citation errors in pre-LLM briefs submitted to state supreme courts, demonstrating practical value beyond the benchmark.

Policy Recommendations

Our results reveal that some failures are structural rather than purely technical. The most consequential barrier is access to legal data.

Improve public access to legal data. Although judicial opinions are public domain, they are not freely accessible through a centralized service. PACER charges per-page fees; commercial providers bill up to $100 per query; public repositories like CourtListener have incomplete coverage and contain no pincite information. Contrast with Canada: CanLII provides free, near-complete access to all court judgments — one Canadian court noted that verifying AI-cited cases requires only "a simple search on CanLII," an assumption US courts cannot reasonably make. Improving automated verification will require either broader public access to legal databases or verification systems built atop commercial platforms.
Support pro se litigants with targeted AI literacy guidance. The share of AI hallucinations in courts attributable to pro se filers has risen every year. Courts already provide self-help resources for unrepresented litigants — these should include guidance on AI citation risks and concrete verification best practices. Raising sanctions without addressing structural limitations makes AI a false promise for the litigants who stand to benefit most.
Develop and deploy automated citation verification tools. Currently, verifying citations is largely a manual process performed by attorneys, law clerks, or judges — especially costly when a cited case does not exist. Our benchmark, dataset, and agent harness provide a foundation for building and auditing such tools. With sufficient data access, automated tools could ease court verification burdens and assist non-lawyers directly in verifying AI-generated citations.