AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

The Pitch

In oral arguments, appellate justices probe attorneys with targeted questions about their legal argument: its merits, shortcomings, and downstream implications. Some attorneys have the resources to practice for oral arguments in moot courts, mock hearings where their colleagues (sometimes former judges) play justices and question them, while others, like Dale Ho in the clip below, rely on more primitive methods – flashcards and practicing solo.

In The Fight, ACLU attorney (now federal judge) Dale Ho prepares for oral arguments with flashcards against a mirror.

As AI models become better at legal tasks, we envision a world where instead of offloading attorney work and cognitive decision-making to AI, we use it as a tool to upskill attorneys. In oral arguments, AI could assist those who are currently training solo via pedagogically rigorous and accurate justice-questioning simulations. And for those not training solo, attorney Neal Katyal explains in a recent Ted Talk about his preparation for Learning Resources v. Trump, AI can augment a team by being a "relentless" legal coach who "predict[s] the contours" of the argument to be faced.

First though, we need to know if current models are good at simulating justice questioning, which brings about our central question:

How good are current AI systems at simulating justice-specific questioning?

Thomas

Sotomayor

Roberts

Kagan

Gorsuch

Alito

Kavanaugh

Barrett

Jackson

We create an evaluation suite by asking “what are the characteristics of a good moot court question?” Broadly, moot court questions ought to be realistic to the hearing format, and pedagogically useful (i.e. challenging, logically sound, based in doctrine). So, we structure our two-tiered evaluation framework along the axes of realism and pedagogical usefulness.

However, evaluation is not straightforward. Oral argument simulation is complex, open-ended, and sensitive to (very long) contexts. So, under our two-tiered framework, we construct and evaluate 20 metrics assessing both realism and pedagogical usefulness. We test prompt-based and agentic simulators across 5 frontier models. For data, we use Supreme Court oral argument transcripts because they are widely available, high quality, and a canonical representation of the task. The results are promising, but also reveal critical blind spots.

Start with

SCOTUS Transcripts

62 cases, 168 argument sections from the 2024 term. Each includes case facts, legal question, and multi-turn dialogue context.

→

Simulate via
8 Simulator Variants
5 LLMs × 3 prompting strategies (Default, Profile, Moot Court) plus 3 agentic simulators with search and profile tools.

→

Evaluate with

20 Metrics

Adversarial tests, human preferences, issue coverage, question diversity, fallacy detection, and tone analysis.

Key Results

Top simulators achieve higher preference win rates compared to questions asked by actual justices. Humans preferred simulated questions, which were overall more competitive than actual questions, to justices’ real questions. This is partially because simulated questions always addressed some argument, while actual justice remarks were sometimes targeted at courtroom logistics or procedural details.

See human evaluation results

Pairwise preference judgments from law students and researchers. Models above the line outperform ground truth.

Model	Wins	Losses	Ties	Win Rate
Gemini-2.5-Pro (Agent)	72	31	25	55.6%
Llama-3.3-70B (Prompt)	66	37	34	54.6%
GPT-4o (Prompt)	62	46	31	51.0%
Gemini-2.5-Pro (Prompt)	62	41	26	49.3%
Ground Truth	46	55	33	41.1%
GPT-4o (Agent)	45	52	32	40.1%
Qwen3-32B (Prompt)	42	58	24	35.5%
gpt-oss-120b (Prompt)	36	60	28	32.9%
gpt-oss-120b (Agent)	24	75	17	21.4%

Win rates are weighted (ties = 0.5). 152 total matches from law students and graduate researchers in a blind arena evaluation.

Models cover most legal issues broadly but miss specific subcomponents. 5 of 8 simulators address over 60% of ground truth issues when we ask if a simulated question addresses any aspect of a legal issue. When we ask if a simulated question addresses all aspects of an issue, the best model covers only 41% of raised issues.

See issue coverage results

ISSUE-BROAD: any aspect addressed. ISSUE-NARROW: all subcomponents covered. Evaluated on 30 transcript sections.

Model	Broad	Narrow
gpt-oss-120b (Default)	64.1%	41.3%
Gemini-2.5-Pro (Default)	63.7%	25.0%
Qwen3-32B (Default)	62.2%	30.2%
Llama-3.3-70B (Default)	62.1%	22.6%
Gemini-2.5-Pro (Agent)	62.0%	26.3%
gpt-oss-120b (Agent)	59.4%	32.1%
GPT-4o (Agent)	56.4%	23.7%
GPT-4o (Default)	54.7%	20.8%

Agentic simulators with docket search tools tend to achieve higher issue coverage than prompt-only variants.

Explore per-case issue coverage

For each case, we extract the substantive legal issues from the full transcript, then check whether each simulator's questions address them.

Metric Case

The diversity of generated question types is a major weakness. Simulated question distributions concentrate on 1–2 questions across each metric, while real question distributions spread across many categories.

See question diversity results

Jensen-Shannon Divergence from ground truth distribution (lower = more diverse, closer to real justices):

Model	LegalBench	Stetson	MetaCog
Llama-3.3-70B (Default)	0.122	0.134	0.080
Qwen3-32B (Default)	0.131	0.153	0.149
Gemini-2.5-Pro (Agent)	0.072	0.095	0.059
Gemini-2.5-Pro (Default)	0.087	0.109	0.072
gpt-oss-120b (Agent)	0.036	0.189	0.129
gpt-oss-120b (Default)	0.061	0.171	0.180
GPT-4o (Agent)	0.124	0.179	0.084
GPT-4o (Default)	0.069	0.140	0.089

No single model dominates across all taxonomies. The three classification schemes assess question types from different legal perspectives.

Here, we plot the distributions shapes and show the different classification categories:

Models are sycophantic, a common pitfall for education contexts, but especially for adversarial pedagogical contexts like oral arguments. While this is a common pitfall in educational contexts, this seems especially adversarial in moot court simulations where the task is to adversarially assess the argument, like a court would do. In adversarial tests, simulators push back against less than 40% of decorum violations and less than 10% of rage-bait or switching-sides provocations. Real justices would immediately challenge these (and likely even eject the attorney depending on the severity of the transgression).

See adversarial test results

Three test types, 50 samples each. A real justice would push back on virtually all of these violations.

Model	Decorum	Rage Bait	Switching Sides
Gemini-2.5-Pro	36.0%	6.0%	0.7%
Llama-3.3-70B	30.0%	0.7%	0.7%
Gemini-2.5-Pro (Agent)	28.0%	8.0%	6.0%
GPT-4o	8.7%	0.7%	0.0%
Qwen3-32B	4.0%	0.0%	0.0%
gpt-oss-120b	2.7%	0.0%	0.0%
gpt-oss-120b (Agent)	0.0%	0.0%	0.0%
GPT-4o (Agent)	0.0%	0.0%	0.0%

Over-alignment causes models to accept provocative or absurd behavior. This is the most significant gap between simulated and real oral arguments.

Real justices would immediately challenge these. Hear Justice Gorsuch admonishing an attorney.

In A.J.T. v. Osseo Area Schools (April 2025), advocate Lisa Blatt accused opposing counsel of “lying.” Justice Gorsuch immediately pushed back — and spent several minutes reading from Blatt's own briefs to force a retraction. No current AI simulator would do this.

0:00 / 3:08

A.J.T. v. Osseo Area Schools (2025). Audio via Mark Joseph Stern.

Fallacy detection is strong for most types. Best models catch over 80% of fallacies in 7 of 10 categories. However, all struggle with numerical reasoning (Numbers, Sampling), consistent with known LLM limitations.

See fallacy detection results

10 fallacy types embedded in advocate responses. The MOOT_COURT prompt consistently improves detection.

Fallacy Type	Best Model	Caught
Sufficient vs. Necessary	Llama / Qwen / Gemini / gpt-oss-120b Agent	100.0%
Exclusivity Flaw	Qwen3-32B	100.0%
Factual — Legal	Qwen3-32B / gpt-oss-120b	100.0%
Ignoring Justice	Llama-3.3-70B	100.0%
Comparison Fallacy	Llama-3.3-70B / Gemini Agent	87.5%
Factual — General	Gemini-2.5-Pro	87.5%
Misstating Justice	Llama-3.3-70B / Gemini Agent / gpt-oss-120b	87.5%
Correlation vs. Causation	Gemini-2.5-Pro / Gemini Agent	75.0%
Sampling Flaw	Gemini-2.5-Pro	75.0%
Numbers Flaw	Gemini-2.5-Pro	50.0%

The MOOT_COURT prompt, which explicitly instructs models to nitpick logical errors, yields consistent improvement across all base models.

Simulator Design

Prompt-based Simulators

Five models (Llama-3.3-70B, Qwen3-32B, Gemini-2.5-Pro, GPT-4o, gpt-oss-120b) with three prompting strategies varying the context provided about the justice being simulated.

See prompting strategies

SCOTUS_DEFAULT – Minimal context. The model is told to act as a named Supreme Court justice.

Example

You are Supreme Court Justice Sonia Sotomayor. You are currently in a Supreme Court oral argument with the following case. Your remark should flow naturally within the context you've been given and should be consistent with your style of statutory interpretation and known politics. What matters most is that you fully flesh out an advocate's argument.

SCOTUS_PROFILE – Adds a hand-crafted profile of the justice's judicial philosophy and political leanings.

Example

You are Supreme Court Justice Amy Coney Barrett.

Justice Barrett is a constitutional originalist and a member of the conservative bloc of the Court. She believes (1) that "the meaning of the constitutional text is fixed at the time of its ratification"; and (2) that the "historical meaning of the text" is legally significant and generally "authoritative." Under this view, the "original public meaning" of a constitutional provision is "the law." Judge Barrett could be viewed as sometimes embracing a more pragmatic approach to textualism.

You are currently in a Supreme Court oral argument with the following case. Your remark should flow naturally within the context you've been given and should be consistent with your style of statutory interpretation and known politics. What matters most is that you fully flesh out an advocate's argument.

MOOT_COURT – Frames the simulation as judging the National Moot Court Competition. Explicitly instructs the model to challenge students and identify logical errors.

Example

You are Supreme Court Justice Clarence Thomas judging the finals of the National Moot Court Competition.

Justice Thomas is a textualist who makes up part of the Court's conservative bloc. He takes a "liberal originalist" approach to civil rights issues, particularly affirmative action, and a "conservative originalist" approach to civil liberties issues, such as abortion. Liberal originalism embraces the broad principles of the Declaration of Independence, such as the natural law ideal of equality; conservative originalism relies on the Framers' specific language and intent.

Top 3Ls from the best law schools are currently arguing before you over the following case. These are some of the best students and you want to challenge them to do better. What matters most is that you humble them by asking very difficult questions. You want to call out even the smallest logical errors now so that they can succeed in the future.

Agentic Simulators

Three reasoning models (GPT-4o, gpt-oss-120b, Gemini-2.5-Pro) enhanced with tool access. Before generating each question, the agent can search case materials, look up justice profiles, and reason step-by-step (up to 10 actions per turn).

See agent tools

THINK – Reason about the oral argument history and plan the next question.

CLOSED_WORLD_SEARCH – Search case docket files from supremecourt.gov (2017–2024).

JUSTICE_PROFILE – Look up voting patterns and political affiliations from the Supreme Court Database.

PROVIDE_FINAL_RESPONSE – Output the simulated justice remark.

AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

The Pitch

Key Results

Which Question Is Better?

See It in Action

Simulator Design

Prompt-based Simulators

Agentic Simulators

Evaluation Metrics

Data & Annotations

Citation

Acknowledgements