Attorneys, judges, and pro se filers increasingly use AI to draft legal documents, yet these tools frequently fabricate citations. Despite predictions that newer models would hallucinate less or that court sanctions would deter negligent filers, we found over 900 filings containing fabricated citations across U.S. jurisdictions — with this number growing year-over-year. We propose a taxonomy of legal citation hallucinations grounded in actual court filings and introduce a dataset of 1,300 brief excerpts containing injected errors. Benchmarking five models in agentic and non-agentic settings reveals that while the latest iterations perform better — GPT-5 achieves 82.8% recall and a 60.5% F1 score in an agentic framework — all models struggle with subtle error categories.
Princeton University · April 2026
Legal citations play a prominent role in U.S. legal practice. Attorneys must point to past judicial decisions and laws to make their case. Fabricating citations, or misrepresenting the content of those citations, is the same as pointing to made-up law to win the case — one judge called it:
"an abuse of the adversary system" that puts the integrity of the judicial process at risk. — Noland v. Land of the Free, 114 Cal.App.5th 426, 445 (2025)
The rapid adoption of large language models (LLMs) in the legal system has turned rare individual instances of fabrication into a systemic problem. Pro se litigants, trained attorneys, and even judges are using LLMs to generate briefs, motions, and other court filings. A prevailing argument has been that this problem is temporary — that hallucination rates would diminish as models improve and that high-profile sanctions would induce greater caution. Our findings challenge these assumptions.
Legal hallucination rates are not consistently decreasing across GPT model generations, while hallucinated citations in court filings grow steadily. Hallucination rate is the percentage of fabricated citations in all model-generated citations.
Judges repeatedly describe the resulting burden as an "enormous waste of judicial resources," and increasingly impose sanctions because "lesser sanctions have been insufficient to deter the conduct." Mid Cent. Operating Eng’rs Health & Welfare Fund v. HoosierVac LLC, 2025 WL 574234, at *3 (S.D.Ind. Feb. 21, 2025); Powhatan County Sch. Bd. v. Skinger, 2025 WL 1559593, at *10 (E.D.Va. June 2, 2025). Courts impose these sanctions while pro se litigants — who stand to benefit most from AI's ability to improve access to justice — are least equipped to detect hallucinated citations and lack access to commercial legal databases.
Through a controlled experiment querying eight generations of ChatGPT models on 92 legal drafting prompts, we find that hallucination rates are no longer consistently decreasing across model generations. Early GPT-4o models released in mid-2024 exhibit the lowest rates at 1.23%, substantially improving over GPT-3.5's ~25%. GPT-5.1 reverses this trend at 6.57%. Newer models also generate more citations per document, drawn from a broader and less canonical set of cases that are individually harder to verify.
To assess the promise of AI in automatically checking legal filings, we introduce LePhantomCite (Legal Phantom Citation), a benchmarking dataset of legal brief excerpts augmented with injected hallucinations. It is accompanied by a taxonomy of citation hallucination types derived from failure modes observed in real court filings.
We propose five categories of legal citation hallucinations grounded in failure modes documented in actual court filings. Categories are not mutually exclusive, but only one hallucination type is introduced per citation in the dataset.
Injected: Original 133 S. Ct. 1017 → Hallucinated 446 Cal. Rptr. 4th 183
Real-world case: A pro se plaintiff cited "Graham v. Nyquist (1974)" but the citation points to no existing case (Sims v. Souily-Lefave, No. 2:24-cv-00831 (D. Nev. Apr. 15, 2025)).
Injected: Cinel v. Connick, 15 F.3d 1338 → Boone v. Vinson, 15 F.3d 1338
Real-world case: An attorney cited "Tinch v. Video Indus. Servs., Inc., 2019 WL 1396975" but that Westlaw citation corresponds to Dolberry v. Jakob, 2019 WL 1396975 instead.
Injected: 830 F.3d at 514 → 830 F.3d at 511
Public repositories do not include official reporter pagination, which originates in commercial publisher formatting. This makes pincite verification infeasible without a commercial database subscription, explaining why all models fail on this category.
Injected: "[T]he parties' dispute over arbitrability specifically falls within those carve-outs." → "[T]he parties' dispute over arbitrability specifically resides within those exclusions."
Real-world case: In Harris v. Take-Two Interactive Software, a plaintiff cited Chambers v. NASCO, Inc. with a quotation that "courts have 'the inherent power to police themselves and to sanction bad-faith litigation conduct.'" This quote was not found in the cited decision.
Injected: Original holding: "These rules cannot support a claim for retaliatory discharge." → Hallucinated: "Any private policies cannot support a claim for retaliatory discharge under Kansas law."
Real-world case: In Jakes v. Youngblood, a defendant's attorney wrote that a Pennsylvania Superior Court case "emphasized that even where statements are embarrassing or upsetting, the Plaintiff must demonstrate their precise defamatory content and origin." The cited case contains no such opinion.
We evaluate five models in agentic and non-agentic settings using a custom harness with access to CourtListener search, local opinion retrieval, and open web search. We adapt BOED agent from Zheng et al., 2026 as our agentic framework. BOED agent maintains an explicit, language-based belief state that is updated after each action. We adopt this framework because case citation verification is inherently sequential and information-dependent: the agent must extract citations from the brief excerpts, decide on how to gather information and when it has gathered sufficient evidence to make a hallucination determination. We set a maximum of 30 steps per episode.
All models evaluated on the 390-example test set. Temperature 0.8 for all models; max tokens 8,192 (10,000 for GPT-5).
| Model | Agentic | Non-Agentic | ||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |
| GPT-5 | 47.6 | 82.8 | 60.5 | 42.9 | 63.1 | 51.1 |
| Qwen3.5-27B | 27.8 | 59.8 | 38.0 | 28.1 | 38.7 | 32.6 |
| GPT-OSS 120B | 24.5 | 57.7 | 34.4 | 17.5 | 33.9 | 23.1 |
| Gemini 2.5 Flash | 20.2 | 65.3 | 30.9 | 27.8 | 45.5 | 34.5 |
| Qwen3-8B | 15.1 | 47.1 | 22.9 | 15.6 | 35.5 | 21.6 |
Values are percentages (±95% CI reported in paper). GPT-5 averages 16.9 tool steps per episode; Gemini 2.5 Flash averages 10.9.
Agentic recall (%) broken down by the five hallucination categories. Incorrect pincite is consistently the hardest across all models — a structural barrier, not a model capability issue.
| Hallucination Type | GPT-5 | Qwen3.5-27B | GPT-OSS 120B | Gemini 2.5 Flash | Qwen3-8B |
|---|---|---|---|---|---|
| Non-existent citation | 100.0 | 96.8 | 90.3 | 90.3 | 58.1 |
| Case name mismatch | 97.1 | 94.1 | 83.8 | 85.3 | 52.9 |
| Content misrepresentation | 84.0 | 49.6 | 53.4 | 55.7 | 54.2 |
| Verbatim misquote | 82.6 | 32.6 | 39.1 | 67.4 | 32.6 |
| Incorrect pincite | 18.2 | 25.5 | 21.8 | 16.4 | 16.4 |
Stronger models (GPT-5) detect over 80% of content-based hallucinations; weaker models plateau around 50%.
GPT-5 devotes 46.6% of its actions to local opinion search (highest of any model) and 8.7% to open web search as a fallback for citations absent from CourtListener. It has the lowest false positive rate on citations not found in CourtListener — weaker models often treat absence from CourtListener as evidence of hallucination.
| Model | Avg Steps | False Positive Rate (citations not on CourtListener) |
|---|---|---|
| GPT-OSS 120B | 17.7 | 43.4% |
| GPT-5 | 16.9 | 24.0% |
| Gemini 2.5 Flash | 10.9 | 58.9% |
| Qwen3.5-27B | 10.2 | 65.9% |
| Qwen3-8B | 7.5 | 51.9% |
A notable side effect: the agent surfaced over 10 citation errors in pre-LLM briefs submitted to state supreme courts, demonstrating practical value beyond the benchmark.
Our results reveal that some failures are structural rather than purely technical. The most consequential barrier is access to legal data.