DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors

1UC Berkeley,  2UIUC,  3UT Austin,  4UCLA
Async execution with discrete diffusion policies solving dynamic manipulation.

Figure 1: Async Execution with discrete diffusion policies solving dynamic manipulation. Gray rectangles and blocks represent the action chunks and the actions. Yellow and green cubes represent the masked and unmasked action tokens. During each inference cycle, discrete diffusion policies copy the tail of the last action chunk as the committed prefix, and inpaint upon it by simply forwarding itself. Compared with flow-matching-based inpainting that relies on $\Pi$GDM, discrete-diffusion-based inpainting inference is simpler to implement, faster at inference, and better at execution.

TL;DR

Flow-matching-based real-time chunking (RTC) is ill-suited for asynchronous execution due to four critical limitations. By replacing the action head with a discrete diffusion policy, all the aforementioned limitations can be resolved at once. Or, to put it simply: Discrete diffusion policies are natural asynchronous executors. During asynchronous execution, discrete diffusion policy achieves:

0 lines of code
to implement the inpainting-specific inference for async execution
Simpler to Implement
~0.7× Inference Cost
compared with ~1.7× for inference with flow-matching-based RTC
Faster at Inference
+65% Succ. Rate
on real hockey defend vs. flow-matching-based RTC (65% vs. 0%)
Better at Execution

Abstract

Unlike chatbots, physical AI must act while the world keeps evolving. The inter-chunk pause of synchronous executors is fatal for dynamic tasks — regardless of how fast the inference is. Asynchronous execution — thinking while acting — is therefore a structural requirement, and real-time chunking (RTC) makes it viable by recasting chunk transitions as inpainting: freezing committed actions and consistently generating the rest.

However, RTC with a flow-matching policy is structurally suboptimal: its inpainting comes from inference-time corrections rather than the base policy, yielding little pre-training benefit, specific fine-tuning, heuristic guidance weights, and extra computation that inflates the latency.

In this work, we observe that discrete diffusion policies, which generate actions by iteratively unmasking, are natural asynchronous executors that resolve all limitations at once: they are fine-tuning free since inpainting is their native operation, while early stopping further provides adaptive guidance and reduces inference cost.

We propose DiscreteRTC, which replaces external corrections with native unmasking, and show on dynamic simulated benchmarks and real-world dynamic manipulation tasks that it achieves higher success rates than continuous RTC and other baselines — while being simpler to implement, faster at inference, and better at execution.

Flow-Matching is Not Suitable for RTC

RTC was specifically designed for flow-matching policies — the predominant action head in today's state-of-the-art VLAs. Yet, as we systematically show, RTC with a flow-matching head is far from ideal with four critical limitations, all stemming from the same root cause: the inpainting capability comes from inference-time correction (e.g., $\Pi$GDM), not from the base policy.

Continuous RTC (Flow-Matching)

Four Critical Limitations

Figure 2: RTC with a flow-matching head. Color encodes noise level (green = clean, yellow = pure noise).

Inpainting with Flow-matching

\[ \mathbf{A}_t^{\tau+\frac{1}{n}} = \mathbf{A}_t^\tau + \frac{1}{n}\!\left( v_\pi(\mathbf{A}_t^\tau, \mathbf{o}_t, \tau) + \underbrace{\min\!\left(\beta,\,\tfrac{1-\tau}{\tau\,r_\tau^2}\right)\!\left(\mathbf{Y} - f_{\mathbf{A}^1}(\mathbf{A}^\tau)\right)^{\!\top}\!\mathrm{diag}(\mathbf{W})\,\frac{\partial f_{\mathbf{A}^1}}{\partial \mathbf{A}'}\bigg|_{\mathbf{A}^\tau}}_{\Pi\text{GDM gradient correction}} \right) \]
aPre-training without Inpainting

Flow-matching pre-training corrupts every action in a chunk at a single, consistent noise level $\tau$. RTC inference instead starts from an inconsistent chunk — 0 for committed, 1 for new, interpolated in between — a pattern the model has never seen. Therefore, $\Pi$GDM corrections are inevitable and scaling pre-training does not directly improve asynchronous performance.

bFine-tuning Required

Adequate inpainting quality demands a dedicated fine-tuning stage with techniques such as action-suffix conditioning to explicitly introduce the inpainting-specific noise pattern. This adds training complexity, risks interfering with base generation quality, and is especially burdensome for large VLAs already expensive to fine-tune.

cHeuristic Guidance

$\Pi$GDM's soft-masking weights $\mathbf{W}$ follow a hand-crafted exponential-decay schedule, and the clipping threshold $\beta$ is fixed across delays, execution horizons, and tasks — validated only empirically. The schedule cannot adapt to inference-time conditions.

dExtra Inference Cost

Computing the guidance term at every denoising step requires a vector-Jacobian product (VJP), which roughly doubles per-step computation — ironically inflating the very latency RTC was designed to hide.

Discrete Diffusion Policies are Natural Asynchronous Executors

Rather than seeking innovation within the flow-matching paradigm, our key observation is structural: replacing the action head with a discrete diffusion policy resolves all four limitations at once. In discrete diffusion, inpainting is the native operation — given a partially masked token sequence, the policy reconstructs the target chunk by predicting the masked tokens, identical to standard masked generation.

Discrete RTC (Ours)

All Resolved at Once

Figure 3: RTC with a discrete diffusion head. Color encodes masking status (green = unmasked, yellow = masked).

Inpainting with Discrete Diffusion

\[ \mathbf{A}_t^{k+1} = \pi(\mathbf{A}_t^k, \mathbf{o}_t) \]
aInpainting as Pre-training

Discrete diffusion policies are pre-trained to inpaint upon randomly masked sequences — structurally identical to RTC's inference-time pattern. Scaling pre-training (model, data, compute) directly improves asynchronous performance, and the native forward pass suits inference-time inpainting out of the box.

bFine-tuning Free

Inpainting-specific patterns are implicitly introduced during pre-training, making discrete diffusion a fine-tuning-free approach for high-quality, out-of-the-box asynchronous execution — no extra loss term, training stage, or implementation work.

cNatural Guidance

Once a token is unmasked, it already carries a clear action semantic — unlike flow-matching, where the chunk is only valid at $\tau{=}1$. We can early-exit inference once the next $s$ actions after the $d$ committed ones are unmasked; the remaining $H-d-s$ partially-unmasked tokens carry over as a natural and adaptive guidance signal for the next inference, replacing the heuristic fixed schedule.

dLower Inference Cost

With committed tokens from previous chunks, the tokens to unmask per inference shrink to roughly $s/H$ of the original (or at least $1-d/H$). DiscreteRTC reduces inference cost rather than inflating it — or, alternatively, keeps the step budget fixed and produces finer-grained actions at the same compute as from-scratch generation.

Experiments

5.1  Simulated Benchmark: Kinetix

We evaluate DiscreteRTC on Kinetix — a dynamic-task benchmark where added Gaussian actuation noise makes closed-loop corrections critical. Built on the official RTC codebase, the flow-matching and discrete diffusion policies share the same architecture (only the final logits layer differs) and use a trivial $512$-bin action quantization; each datapoint averages 2,048 trials.

Baselines (same architecture, different async strategies)

Three additional variants are introduced under Extended Results below (Training-time Continuous RTC, DiscreteRTC + Fixed Steps, DiscreteVLASH).

Across delays, DiscreteRTC consistently beats ContinuousRTC and other variants on both solve rate and throughput. Three further takeaways from the extended results: (1) DiscreteRTC outperforms Training-time Continuous RTC despite being fine-tuning free; (2) at the same compute, DiscreteRTC + Fixed Steps lifts performance via finer-grained actions; (3) DiscreteVLASH stabilizes performance across delays at a small low-delay cost — DiscreteRTC composes cleanly with advanced inference-time methods.

5.2  Real-World Results

We validate DiscreteRTC on a UR5e + Robotiq gripper with a wrist-mounted RGB camera on a single RTX 4090. All policies share a Qwen2.5-VL-3B-Instruct VLM backbone with a layerwise cross-attention DiT action head (StarVLA). Three reactiveness-stress tasks — Dynamic Pick (moving object), Dynamic Place (moving platform), and Hockey Defend (defending a high-speed puck) — are each evaluated over 20 trials. Pick & Place run closed-loop at 20 Hz, while Hockey Defend runs at 50 Hz with 2× linear interpolation to a 100 Hz servo stream to meet its far tighter reactiveness budget.

Air Hockey Defend

Hockey Defend is the most demanding stress test: the robot must block a high-speed puck launched by an opponent policy, which requires both fast inference and the smooth, uninterrupted execution that only asynchronous control provides. Here the synchronous baselines and flow-matching ContinuousRTC completely fail (0%) — the inter-chunk pause and $\Pi$GDM overhead are fatal at this speed. Even against training-time ContinuousRTC, which pays an extra fine-tuning cost, the fine-tuning-free DiscreteRTC still wins by a large margin (65% vs. 35%), empirically reaffirming that discrete diffusion policies are natural asynchronous executors.

Real-world UR5e setup for Dynamic Pick and Place

Figure 6: Real-world setup for Dynamic Pick & Dynamic Place — UR5e + Robotiq gripper, wrist-mounted RGB, turntable target.

Real-world air hockey defend setup

Figure 7: Air Hockey Defend setup — the robot must defend a high-speed puck launched by an opponent policy, demanding both efficient inference and smooth asynchronous execution.

Figure 8: Real-world results across the three dynamic tasks (success rate over 20 trials per task and average inference time on a single RTX 4090). Both Sync baselines completely fail (0%), confirming reactive asynchronous execution is indispensable. DiscreteRTC tops ContinuousRTC with the largest gaps on the continuously moving Dynamic Pick and the high-speed Hockey Defend. Costs move in opposite directions: ContinuousRTC inflates flow-matching inference ($\sim$1.7× $\Pi$GDM overhead), while DiscreteRTC reduces discrete diffusion inference — RTC and discrete diffusion compound rather than conflict.

Dynamic Pick

Sync
Continuous RTC
Discrete RTC (Ours)

Dynamic Place

Sync
Continuous RTC
Discrete RTC (Ours)

Hockey Defend

Sync
Continuous RTC
Discrete RTC (Ours)

Discussion & Future Steps

Conclusions

We presented DiscreteRTC, which exploits the native inpainting capability of discrete diffusion for asynchronous real-time control — eliminating the need for inpainting-specific fine-tuning, heuristic guidance weights, and extra inference cost inherent to flow-matching-based RTC. Simulated and real-world experiments show that DiscreteRTC outperforms ContinuousRTC and other asynchronous baselines, while being simpler to implement, faster at inference, and better at execution — and composes seamlessly with methods such as VLASH for further gains.

Limitations

Each limitation coincides with an active research direction whose progress directly benefits DiscreteRTC:

  1. Naive $k$-bin quantization ignores temporal structure and yields overly long token sequences, motivating compact, temporally aware tokenizers.
  2. Our modularized AR-VLM + discrete diffusion head prevents the backbone from participating in iterative unmasking — a gap unified discrete-diffusion VLAs aim to close.
  3. Traditional max-confidence unmasking does not yet translate the natural schedule into consistent gains.

Future Steps

  1. A time-causally ordered action tokenizer producing temporally ordered, compact token representations.
  2. A unified discrete diffusion VLA in which observation reasoning and action generation share the same backbone.
  3. An appropriate yet principled unmasking strategy — emerging techniques such as AR-block decoding align better with the implicit autoregressive structure induced by RTC.

Citation

@article{wang2026discreterc,
  title         = {DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors},
  author        = {Wang, Pengcheng and Hong, Kaiwen and Peng, Chensheng and
                   Driggs-Campbell, Katherine and Tomizuka, Masayoshi and
                   Xu, Chenfeng and Tang, Chen},
  journal       = {arXiv preprint},
  year          = {2026}
}