To prepare for oral arguments, attorneys practice in moot courts — simulated hearings where experienced partners play the role of judges. We examine whether AI can be a suitable moot court practice partner by evaluating its justice-specific questioning on a two-layer evaluation framework for realism and pedagogical usefulness. We find that models achieve surprising realism but exhibit low question diversity and sycophancy. Our evaluation framework accurately captures abilities and blind spots of current AI-assisted moot courts, which would not be found by naïve evaluation approaches.
Princeton University · Stanford University · *Equal contribution · CSLAW 2026
In oral arguments, appellate justices probe attorneys with targeted questions about their legal argument: its merits, shortcomings, and downstream implications. Some attorneys have the resources to practice for oral arguments in moot courts, mock hearings where their colleagues (sometimes former judges) play justices and question them, while others, like Dale Ho in the clip below, rely on more primitive methods – flashcards and practicing solo.
As AI models become better at legal tasks, we envision a world where instead of offloading attorney work and cognitive decision-making to AI, we use it as a tool to upskill attorneys. In oral arguments, AI could assist those who are currently training solo via pedagogically rigorous and accurate justice-questioning simulations. And for those not training solo, attorney Neal Katyal explains in a recent Ted Talk about his preparation for Learning Resources v. Trump, AI can augment a team by being a "relentless" legal coach who "predict[s] the contours" of the argument to be faced.
First though, we need to know if current models are good at simulating justice questioning, which brings about our central question:
How good are current AI systems at simulating justice-specific questioning?
We create an evaluation suite by asking “what are the characteristics of a good moot court question?” Broadly, moot court questions ought to be realistic to the hearing format, and pedagogically useful (i.e. challenging, logically sound, based in doctrine). So, we structure our two-tiered evaluation framework along the axes of realism and pedagogical usefulness.
However, evaluation is not straightforward. Oral argument simulation is complex, open-ended, and sensitive to (very long) contexts. So, under our two-tiered framework, we construct and evaluate 20 metrics assessing both realism and pedagogical usefulness. We test prompt-based and agentic simulators across 5 frontier models. For data, we use Supreme Court oral argument transcripts because they are widely available, high quality, and a canonical representation of the task. The results are promising, but also reveal critical blind spots.
Pairwise preference judgments from law students and researchers. Models above the line outperform ground truth.
| Model | Wins | Losses | Ties | Win Rate |
|---|---|---|---|---|
| Gemini-2.5-Pro (Agent) | 72 | 31 | 25 | 55.6% |
| Llama-3.3-70B (Prompt) | 66 | 37 | 34 | 54.6% |
| GPT-4o (Prompt) | 62 | 46 | 31 | 51.0% |
| Gemini-2.5-Pro (Prompt) | 62 | 41 | 26 | 49.3% |
| Ground Truth | 46 | 55 | 33 | 41.1% |
| GPT-4o (Agent) | 45 | 52 | 32 | 40.1% |
| Qwen3-32B (Prompt) | 42 | 58 | 24 | 35.5% |
| gpt-oss-120b (Prompt) | 36 | 60 | 28 | 32.9% |
| gpt-oss-120b (Agent) | 24 | 75 | 17 | 21.4% |
Win rates are weighted (ties = 0.5). 152 total matches from law students and graduate researchers in a blind arena evaluation.
ISSUE-BROAD: any aspect addressed. ISSUE-NARROW: all subcomponents covered. Evaluated on 30 transcript sections.
| Model | Broad | Narrow |
|---|---|---|
| gpt-oss-120b (Default) | 64.1% | 41.3% |
| Gemini-2.5-Pro (Default) | 63.7% | 25.0% |
| Qwen3-32B (Default) | 62.2% | 30.2% |
| Llama-3.3-70B (Default) | 62.1% | 22.6% |
| Gemini-2.5-Pro (Agent) | 62.0% | 26.3% |
| gpt-oss-120b (Agent) | 59.4% | 32.1% |
| GPT-4o (Agent) | 56.4% | 23.7% |
| GPT-4o (Default) | 54.7% | 20.8% |
Agentic simulators with docket search tools tend to achieve higher issue coverage than prompt-only variants.
For each case, we extract the substantive legal issues from the full transcript, then check whether each simulator's questions address them.
Jensen-Shannon Divergence from ground truth distribution (lower = more diverse, closer to real justices):
| Model | LegalBench | Stetson | MetaCog |
|---|---|---|---|
| Llama-3.3-70B (Default) | 0.122 | 0.134 | 0.080 |
| Qwen3-32B (Default) | 0.131 | 0.153 | 0.149 |
| Gemini-2.5-Pro (Agent) | 0.072 | 0.095 | 0.059 |
| Gemini-2.5-Pro (Default) | 0.087 | 0.109 | 0.072 |
| gpt-oss-120b (Agent) | 0.036 | 0.189 | 0.129 |
| gpt-oss-120b (Default) | 0.061 | 0.171 | 0.180 |
| GPT-4o (Agent) | 0.124 | 0.179 | 0.084 |
| GPT-4o (Default) | 0.069 | 0.140 | 0.089 |
No single model dominates across all taxonomies. The three classification schemes assess question types from different legal perspectives.
Here, we plot the distributions shapes and show the different classification categories:
Three test types, 50 samples each. A real justice would push back on virtually all of these violations.
| Model | Decorum | Rage Bait | Switching Sides |
|---|---|---|---|
| Gemini-2.5-Pro | 36.0% | 6.0% | 0.7% |
| Llama-3.3-70B | 30.0% | 0.7% | 0.7% |
| Gemini-2.5-Pro (Agent) | 28.0% | 8.0% | 6.0% |
| GPT-4o | 8.7% | 0.7% | 0.0% |
| Qwen3-32B | 4.0% | 0.0% | 0.0% |
| gpt-oss-120b | 2.7% | 0.0% | 0.0% |
| gpt-oss-120b (Agent) | 0.0% | 0.0% | 0.0% |
| GPT-4o (Agent) | 0.0% | 0.0% | 0.0% |
Over-alignment causes models to accept provocative or absurd behavior. This is the most significant gap between simulated and real oral arguments.
In A.J.T. v. Osseo Area Schools (April 2025), advocate Lisa Blatt accused opposing counsel of “lying.” Justice Gorsuch immediately pushed back — and spent several minutes reading from Blatt's own briefs to force a retraction. No current AI simulator would do this.
10 fallacy types embedded in advocate responses. The MOOT_COURT prompt consistently improves detection.
| Fallacy Type | Best Model | Caught |
|---|---|---|
| Sufficient vs. Necessary | Llama / Qwen / Gemini / gpt-oss-120b Agent | 100.0% |
| Exclusivity Flaw | Qwen3-32B | 100.0% |
| Factual — Legal | Qwen3-32B / gpt-oss-120b | 100.0% |
| Ignoring Justice | Llama-3.3-70B | 100.0% |
| Comparison Fallacy | Llama-3.3-70B / Gemini Agent | 87.5% |
| Factual — General | Gemini-2.5-Pro | 87.5% |
| Misstating Justice | Llama-3.3-70B / Gemini Agent / gpt-oss-120b | 87.5% |
| Correlation vs. Causation | Gemini-2.5-Pro / Gemini Agent | 75.0% |
| Sampling Flaw | Gemini-2.5-Pro | 75.0% |
| Numbers Flaw | Gemini-2.5-Pro | 50.0% |
The MOOT_COURT prompt, which explicitly instructs models to nitpick logical errors, yields consistent improvement across all base models.
Read the oral argument context, then pick which follow-up question you think is more effective. One was asked by a real Supreme Court justice. The other was generated by an AI simulator. After you choose, we'll reveal which was which.
Below are real examples from our evaluation. Select a scenario and model to see how different simulators handle the same oral argument context, adversarial provocations, and logical fallacies.
Five models (Llama-3.3-70B, Qwen3-32B, Gemini-2.5-Pro, GPT-4o, gpt-oss-120b) with three prompting strategies varying the context provided about the justice being simulated.
SCOTUS_DEFAULT – Minimal context. The model is told to act as a named Supreme Court justice.
Example
You are Supreme Court Justice Sonia Sotomayor. You are currently in a Supreme Court oral argument with the following case. Your remark should flow naturally within the context you've been given and should be consistent with your style of statutory interpretation and known politics. What matters most is that you fully flesh out an advocate's argument.
SCOTUS_PROFILE – Adds a hand-crafted profile of the justice's judicial philosophy and political leanings.
Example
You are Supreme Court Justice Amy Coney Barrett.
Justice Barrett is a constitutional originalist and a member of the conservative bloc of the Court. She believes (1) that "the meaning of the constitutional text is fixed at the time of its ratification"; and (2) that the "historical meaning of the text" is legally significant and generally "authoritative." Under this view, the "original public meaning" of a constitutional provision is "the law." Judge Barrett could be viewed as sometimes embracing a more pragmatic approach to textualism.
You are currently in a Supreme Court oral argument with the following case. Your remark should flow naturally within the context you've been given and should be consistent with your style of statutory interpretation and known politics. What matters most is that you fully flesh out an advocate's argument.
MOOT_COURT – Frames the simulation as judging the National Moot Court Competition. Explicitly instructs the model to challenge students and identify logical errors.
Example
You are Supreme Court Justice Clarence Thomas judging the finals of the National Moot Court Competition.
Justice Thomas is a textualist who makes up part of the Court's conservative bloc. He takes a "liberal originalist" approach to civil rights issues, particularly affirmative action, and a "conservative originalist" approach to civil liberties issues, such as abortion. Liberal originalism embraces the broad principles of the Declaration of Independence, such as the natural law ideal of equality; conservative originalism relies on the Framers' specific language and intent.
Top 3Ls from the best law schools are currently arguing before you over the following case. These are some of the best students and you want to challenge them to do better. What matters most is that you humble them by asking very difficult questions. You want to call out even the smallest logical errors now so that they can succeed in the future.Three reasoning models (GPT-4o, gpt-oss-120b, Gemini-2.5-Pro) enhanced with tool access. Before generating each question, the agent can search case materials, look up justice profiles, and reason step-by-step (up to 10 actions per turn).
THINK – Reason about the oral argument history and plan the next question.
CLOSED_WORLD_SEARCH – Search case docket files from supremecourt.gov (2017–2024).
JUSTICE_PROFILE – Look up voting patterns and political affiliations from the Supreme Court Database.
PROVIDE_FINAL_RESPONSE – Output the simulated justice remark.
Our two-layer framework uses 20 metrics across realism and pedagogical usefulness. Each metric targets a specific quality that effective oral argument simulation should exhibit.
Hover or tap a metric above to see its description.
Our test set draws from U.S. Supreme Court oral argument transcripts accessed via the Oyez API, focusing on cases from the 2024 term.
Human annotations were collected through a custom Gradio-based arena interface where law students and graduate researchers compared simulated and real justice responses in blind pairwise matchups. The full annotation dataset — including preference judgments and quality assessments across 9 metric dimensions — is available on Hugging Face.
@inproceedings{zhang2026ai,
title={AI-Assisted Moot Courts: Simulating Justice-Specific
Questioning in Oral Arguments},
author={Zhang, Kylie and Nadeem, Nimra and Zheng, Lucia
and Stammbach, Dominik and Henderson, Peter},
booktitle={Symposium on Computer Science and Law (CSLAW)},
year={2026}
}
The authors thank Dan Bateyko, Zirui Cheng, Lucy He, Michel Liao, Patty Liu, Max Gonzalez Saez-Diez, and Zeyu Shen for their contributions. This work was funded by a Princeton Language+Intelligence grant and the Schmidt Science Humanities and Virtual Institute Grant.