Modern MoE LLMs switch their active expert set at almost every generated token. That foreclosures memory optimizations like offloading once expert counts outgrow GPU capacity. We argue that this is exactly the structure that options with deliberation costs were designed for — and show that even pretrained MoEs (like gpt-oss-20b) can be cheaply converted into temporally extended ones, dropping switch rates from >50% to under 5% while retaining most of the base model's accuracy.
Princeton University
Standard MoEs change their active expert set at almost every token. Our option controller learns when to keep the current set and when to switch, governed by a deliberation cost η. The result: switch rates collapse from over 50% to below 5%, while accuracy stays close to the base model — opening the door to memory-efficient serving, temporal chunking for training, and continual expansion of the expert pool.
gpt-oss-20b with LoRA and a self-distillation reward, learns the trade-off. No new pretraining needed.
Mixture-of-Experts (MoE) layers are everywhere in modern LLMs — Gemini-2.5-Pro, DeepSeek-V3, Qwen3-Next-80B, gpt-oss, and many more. Only a sparse subset of experts is active per token, so a 120B-parameter model like gpt-oss-120b only activates ~5B parameters at a time. In principle, you can keep growing the expert pool while inference compute stays flat.
But once the experts no longer fit on GPU, weights have to live on host memory or disk and be loaded on demand. Every load is latency. Current MoE routers don't account for this cost at all: the active expert set changes at almost every token. We measured this directly across three frontier open-source MoEs, on 1,000 prompts spanning chat, code, math, STEM, and 6 multilingual categories — the switch rate is essentially 1.
Active experts (rows) over generated token positions (columns) on the same prompt. There is essentially no temporal structure: nearly every column is a fresh draw.
Average switch rate across 100 prompts per category (with $\maskk = \activek$, i.e., the model's native top-$\activek$). Mean ± std.
| Model | Chat | Code | Math | STEM | Multi (en) | Multi (de) | Multi (es) | Multi (fr) | Multi (it) | Multi (ja) |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt-oss-20b | 0.94 | 0.95 | 0.94 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
| gpt-oss-120b | 0.98 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
| Qwen3-Next-80B | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
If the active expert set were temporally extended, several memory-related optimizations become natural:
gpt-oss-20b, keeping 16 of 32 experts saves ~4.7 GiB (37%); keeping 8 saves ~7.1 GiB (55%) of VRAM.Choosing when to commit to a set of resources and when to pay the cost of switching is exactly the structure formalized by temporally extended actions in the options framework (Sutton, Precup, Singh, 1999). An agent picks a high-level "option" that persists over many time steps; switching to a new option incurs a deliberation cost (Harb et al., 2018).
We propose temporally extended Mixture-of-Experts: a small per-layer controller that learns when to switch the expert set and which one to switch to, optimized via the option-critic architecture (Bacon et al., 2017) augmented with a deliberation cost η. Because the switching cost is an explicit term in the objective, the controller automatically discovers temporal structure — it only switches when the expected quality gain justifies the cost.
The intra-option policy is the LLM itself; primitive actions are generated tokens. Critic and termination updates follow A2OC (Harb et al., 2018) with GAE($\lambda$); intra-option policy gradients use Monte Carlo returns.
For each MoE layer $\ell$, the option space is the set of binary expert masks of size $\maskk$:
The router is constrained to pick its top-$\activek$ experts only from the active mask $\omega^{(\ell)}_t$. The mask persists across tokens until a termination decision $d^{(\ell)}_t = 1$ triggers selection of a new mask. We factorize across layers for tractability: each layer has its own controller conditioning on its hidden state and current mask.
Each MoE layer gets its own controller, operating on the pre-MLP hidden state $h^{(\ell)}_t$ and the current mask $\omega^{(\ell)}_{t-1}$. It has four heads:
Following on-policy distillation (Gu et al., 2024), we use the per-token reverse KL between the frozen base ("teacher") model and the controller-augmented "student" as the per-token reward:
In expectation over the student, $-r_t$ is an unbiased estimator of $\mathrm{KL}(p_\text{student}\,\|\,p_\text{teacher})$. To avoid reward hacking via degenerate outputs, we sample tokens from a mixture $p_\mathrm{mix} = (1-\tau)p_\text{student} + \tau p_\text{teacher}$ with $\tau = 0.2$ and apply approximate importance weights, following MiniLLM (Gu et al., 2024).
Critics are trained against GAE($\lambda$) targets. Termination gradient adds the deliberation cost $\eta$ as a margin (so the option only terminates when $Q_\Omega(s, \omega) - V_\Omega(s) + \eta < 0$):
Selection gradients are accumulated only at switch positions, using $Q_\Omega - V_\Omega$ as the advantage. Intra-option policy gradients update LoRA adapters on experts and attention (rank 16) and the router weights, using Monte Carlo returns of $r_t$.
We train one controller per setting on top of gpt-oss-20b (24 layers, 32 experts/layer, top-4 routing) on the Nemotron Post-Training Dataset v2, with 4 × H200 GPUs and a modified TRL. We sweep deliberation cost $\eta \in \{0.02, 0.03, 0.04\}$ and mask size $\maskk \in \{8, 16\}$. We compare against four pruning baselines (Frequency, Reconstruction, Random, Wanda-structured) calibrated on 128 generated responses.
| Benchmark | Base Model | Pruning Baselines | Ours (Learned Controller) | |||||
|---|---|---|---|---|---|---|---|---|
| Frequency | Reconstruction | Random | Wanda | η = 0.02 | η = 0.03 | η = 0.04 | ||
| MATH | 71.5 | 53.5 | 51.5 | 15.0 | 3.5 | 64.0 | 58.5 | 55.0 |
| switch % | 58.6 | — | — | — | — | 4.1 | 1.3 | 1.2 |
| MMLU | 79.5 | 55.5 | 35.0 | 33.5 | 9.0 | 72.5 | 67.5 | 63.0 |
| switch % | 57.1 | — | — | — | — | 4.2 | 1.3 | 1.2 |
| MMMLU | 67.5 | 42.0 | 48.0 | 24.0 | 7.0 | 59.5 | 56.5 | 49.5 |
| switch % | 54.5 | — | — | — | — | 4.2 | 1.4 | 1.2 |
Accuracy and switch rate (%, mean ± 95% CI) on 200 randomly selected questions from each benchmark. Bold = best non-base-model column. The controller (eta = 0.02) keeps switch rates around 4% while losing only a few points relative to the base model, dramatically outperforming all pruning baselines.
The trade-off becomes steeper, but our controller still substantially outperforms all baselines and pushes switch rates as low as 5%.
| Benchmark | Base | Pruning Baselines | Ours | |||||
|---|---|---|---|---|---|---|---|---|
| Freq | Recon | Random | Wanda | η = 0.02 | η = 0.03 | η = 0.04 | ||
| MATH | 71.5 | 11.5 | 7.5 | 0.0 | 0.0 | 27.5 | 23.0 | 15.5 |
| switch % | 79.0 | — | — | — | — | 9.2 | 7.4 | 5.4 |
| MMLU | 79.5 | 12.5 | 2.5 | 4.0 | 0.0 | 48.5 | 41.0 | 38.0 |
| switch % | 77.4 | — | — | — | — | 8.5 | 7.6 | 5.0 |
| MMMLU | 67.5 | 8.5 | 1.0 | 3.0 | 0.0 | 39.0 | 31.5 | 22.5 |
| switch % | 75.5 | — | — | — | — | 9.0 | 8.0 | 5.4 |
Wanda's structured pruning collapses to 0% on MATH/MMLU at this rate — the model effectively breaks. Random and reconstruction-based pruning also fall apart. Our controller still gets 27–48% on the three benchmarks with switch rates of 5–9%.
Reward (negative reverse KL per token) climbs steadily, while switch rate first drops sharply (as the value networks learn) and then settles to a level governed by η. Larger η converges to a lower switch rate, as expected.
Under our controller, expert masks persist for many tokens at a time. Each row is an expert; each column is a generated token. The animation below alternates between the two regimes — base gpt-oss-20b (chaotic, switching every token) and our trained controller (long flat runs).
Switches at ~94% of token positions. Mask reuse: essentially none.
Same prompt, same model. Switches at ~9% of positions: the active expert set persists for ~100 tokens at a time.
$\maskk = 8$, $\eta = 0.02$



$\maskk = 8$, $\eta = 0.03$



$\maskk = 16$, $\eta = 0.03$



$\maskk = 16$, $\eta = 0.04$



Different layers can have different temporal continuity, but qualitatively all of them switch rarely under the trained controller.
We see this as evidence that even pretrained MoEs can be cheaply converted into temporally extended ones — no large-scale retraining required. As expert counts continue to grow (potentially scaling with available disk rather than GPU memory), the cost of switching will increasingly dominate serving latency. Treating expert loading as a temporally extended decision, with an explicit deliberation cost, may offer a principled handle on this trade-off.
Our framework is compatible with existing caching/prefetching systems like MoE-Infinity: those approaches optimize how to move experts; ours reduces how often they need to move at all.
We view this work as a first step. Promising directions: making temporal extension a first-class objective in MoE pre-training, scaling to much larger expert pools (where GPU residency is fundamentally infeasible), and using the controller's mask continuity as a primitive for continual learning with growing expert sets.
@article{shen2026temoe,
title = {Temporally Extended Mixture-of-Experts Models},
author = {Shen, Zeyu and Henderson, Peter},
year = {2026}
}