DFlash on TPU: Speculative Decoding at Scale

Abstract

Autoregressive LLM decoding is inherently sequential: each token depends on the last, so generating $n$ tokens requires $n$ forward passes through the full model. This makes decoding the dominant latency bottleneck for long-form tasks such as chain-of-thought reasoning and code generation. Speculative decoding alleviates this by having a lightweight draft model propose multiple candidate tokens, which the full target model then verifies in a single batched pass, accepting correct tokens and falling back to standard decoding only at the first rejection.

DFlash advances this paradigm by replacing the conventional autoregressive drafter with a diffusion-based block drafter. While methods like Eagle3 still generate draft tokens one at a time ($\mathcal{O}(k)$ passes for $k$ tokens), DFlash uses a compact 4-layer transformer with non-causal attention to predict an entire 16-token block in a single parallel forward pass, reducing draft cost to $\mathcal{O}(1)$. It reuses the target model’s embedding layer and LM head, requiring no separate vocabulary and enabling training on just 289K samples.

We port DFlash from GPU to TPU within vLLM’s tpu-inference stack, addressing challenges including dual KV cache management, non-causal attention kernel routing, and a critical sequence-length alignment bug whose fix alone nearly doubled performance. Evaluated on Qwen3-4B across 9 benchmarks spanning math, code, and chat tasks on TPU V5P, our port achieves $\mathbf{3.01\times}$ standalone speedup and $\mathbf{2.31\times}$ in the full serving pipeline, reaching $\mathbf{94.9\%}$ of published GPU draft quality.

Methods

Porting DFlash from GPU/PyTorch to TPU/JAX required rethinking every stateful component. PyTorch’s DFlash relies on mutable state: DynamicCache.append(), in-place tensor ops, and Python-level control flow. JAX is functional: all arrays are immutable and model functions must return new state, driving the architecture of every subsystem below.

Results

We evaluate DFlash on Qwen3-4B (target) with DFlash-b16 (draft, block size 16) on a single TPU V5P host (4 chips, 8 cores). Benchmarks span 9 datasets across three task categories: math (AIME 2024, AIME 2025, MATH-500, GSM8K), code (HumanEval, MBPP, SWE-Bench), and chat (MT-Bench, Alpaca). All experiments use greedy decoding (temperature = 0) to ensure deterministic, reproducible comparisons.

In standalone mode (raw model decoding without the vLLM serving layer), DFlash achieves an average $\mathbf{3.01\times}$ speedup over autoregressive decoding, peaking at $\mathbf{3.72\times}$ on math tasks where token predictability is highest. In the full vLLM serving pipeline, which adds scheduling, paged KV cache management, and rejection-sampling overhead, the speedup is $\mathbf{2.31\times}$.

Critically, TPU draft quality matches the original GPU implementation: our port achieves $\mathbf{94.9\%}$ of published GPU $\tau$ (accepted tokens per draft) on average, and exceeds GPU on MATH-500 ($8.80$ vs. $7.84$). The small remaining gap is attributable to bf16 vs. fp16 numerical precision rather than any architectural limitation.

Detailed Results

vLLM Pipeline TPS

End-to-end tokens/sec throughput in the full vLLM serving pipeline comparing baseline, DFlash, and Eagle3 speculative methods.

This measures real serving throughput, not just raw decoding speed. The vLLM pipeline adds request scheduling, paged KV cache management, batch formation, and the rejection-sampling verification loop on top of the model forward pass.

DFlash achieves $\mathbf{2.31\times}$ speedup over baseline by drafting all 16 tokens in a single diffusion forward pass. Eagle3 (autoregressive drafter) is included as a reference. Its sequential $\mathcal{O}(k)$ drafting makes it slower per proposal despite competitive acceptance rates.

Standalone TPU vs GPU DFlash

Side-by-side DFlash throughput (tok/s) on TPU V5P vs GPU A100 paper numbers for the four math benchmarks with published GPU data.

Despite a full framework migration (PyTorch → JAX) and entirely different hardware (NVIDIA A100 → Google TPU V5P), TPU throughput consistently matches or exceeds GPU. This confirms that DFlash’s block-parallel drafting pattern maps naturally to TPU’s systolic MXU architecture.

Acceptance Volume ($\tau \times$ drafts/sec)

Effective throughput decomposed into two factors: bar height = $\tau$ (accepted tokens/draft), bar width = drafts/sec, bar area = total accepted tokens/sec.

Speculative decoding throughput depends on two independent factors: (1) how many tokens the target model accepts per draft block ($\tau$), and (2) how many draft blocks the system proposes per second (drafts/sec). This chart visualizes both simultaneously.

Math tasks exhibit the highest $\tau$ values because structured reasoning produces highly predictable token sequences. Code tasks have moderate $\tau$ due to syntactic patterns, while chat tasks are least predictable. Drafts/sec is primarily determined by hardware speed and model size, remaining relatively constant across tasks.

Acceptance Rate by Position

Acceptance probability at each of the 16 draft positions, showing how draft quality decays further from the last verified token.

Position 1 is immediately after the last verified token and is easiest to predict; acceptance probability drops at later positions. Math benchmarks maintain >70% acceptance through position 8 to 10; chat benchmarks drop below 50% by position 4 to 5.

This decay pattern explains why DFlash’s block size of 16 is well-calibrated: most of the accepted tokens come from the first 8 to 10 positions, and the marginal gain from later positions is small. The bidirectional context available in diffusion drafting helps sustain acceptance at later positions compared to autoregressive drafters.

DFlash Speedup by Dataset (TPU V5P)

DFlash speedup over autoregressive baseline on all 9 V5P benchmarks, colored by task category (math, code, chat).

Each bar shows the ratio of DFlash TPS to baseline TPS. Color indicates task category: math (avg $3.72\times$), code (avg $2.77\times$), chat (avg $1.96\times$). Overall average: $\mathbf{3.01\times}$.

The variation across datasets reflects the inherent predictability of different content types. AIME problems, with their structured mathematical reasoning chains, yield the highest speedups. SWE-Bench and MT-Bench, requiring diverse and less predictable outputs, show more modest but still substantial gains.

Performance by Task Category

Category-level aggregation: average speedup, $\tau$, and DFlash TPS for math, code, and chat task groups.

The progression math > code > chat reflects decreasing predictability. DFlash’s non-causal block attention thrives when upcoming tokens can be inferred from bidirectional context.

Math tasks benefit from long chains of formulaic reasoning where each step follows logically from the previous one. Code tasks exhibit predictable syntactic patterns but more variable logic. Chat tasks involve creative, open-ended generation where the drafter has less context to leverage.

TPU V5P vs GPU A100: $\tau$ Parity

Head-to-head $\tau$ and speedup comparison: our TPU V5P results vs. published GPU A100 numbers on matching math benchmarks.

TPU achieves $\mathbf{94.9\%}$ of GPU paper $\tau$ on average. Math500 exceeds GPU ($8.80$ vs. $7.84 = 112\%$). The remaining gap likely comes from bf16 vs. fp16/fp32 precision and different attention kernel tiling.

Output Quality: Exact Token Match

Fraction of output samples producing byte-identical token sequences between DFlash and autoregressive baseline across all 9 datasets.

Match rates range from 0% (swe-bench) to 62.5% (gsm8k). Mismatches are expected and come from bf16 floating-point divergence in batch verification, not correctness errors.

When the drafter and verifier process tokens in parallel batches, bf16 rounding accumulates differently than in sequential autoregressive decoding. This creates numerically distinct but semantically equivalent outputs. GSM8K’s shorter, more formulaic answers naturally produce higher exact-match rates. Longer, more complex outputs (SWE-Bench) diverge earlier but remain correct.

Step Profiling: GSM8K on TPU V4

Time breakdown of a single speculative decoding step.

Verification alone is $59\%$ of step time; LM head matmuls are $\sim\!30\%$. vLLM orchestration (scheduling, rejection sampling) accounts for most of the standalone to pipeline gap.

The target model’s verification forward pass dominates because it must process all 16 draft tokens in a single pass. The LM head projection (hidden dim → vocabulary size) is the most compute-intensive single operation. Draft generation, by contrast, is lightweight: the diffusion model produces all 16 tokens in one forward pass, costing only $\sim\!5\%$ of total step time.

Speculative Methods Compared

DFlash standalone vs pipeline vs Eagle3 on math benchmarks (TPU V4).

DFlash+TPU is the only combination where both draft and verify stay constant as block size grows. Cost ratio at $K\!=\!128\,/\,K\!=\!16$ is $0.97\times$ on TPU vs. $\sim\!2.3\times$ on GPU.

Autoregressive drafters (Eagle3) require $\mathcal{O}(k)$ sequential forward passes per draft, making them increasingly expensive at larger block sizes. DFlash generates all $k$ tokens in a single diffusion pass regardless of $k$, giving it a fundamental scaling advantage. On TPU, the systolic array executes the same padded matrix multiply whether $k\!=\!16$ or $k\!=\!128$, so increasing draft length is essentially free.

1 / 10

Conclusion

We have demonstrated that diffusion-based speculative decoding transfers effectively from GPU to TPU, despite fundamental differences in programming models and hardware characteristics. By porting DFlash from PyTorch to JAX within the vLLM serving stack, we addressed three core engineering challenges: implementing non-causal attention through a dedicated Pallas kernel, designing an immutable KV cache architecture compatible with JAX's functional paradigm, and diagnosing a critical sequence-length inflation bug in vLLM's speculative decoding manager.

Our evaluation across 9 benchmarks spanning math, code, and chat tasks shows that the TPU port achieves $\mathbf{94.9\%}$ of GPU draft quality ($\tau$) while delivering $\mathbf{3.01\times}$ standalone speedup and $\mathbf{2.31\times}$ end-to-end vLLM pipeline speedup over autoregressive decoding. Notably, DFlash's single-pass block drafting is a natural fit for TPU's architecture: verification cost remains flat as block size grows, unlike autoregressive drafters (e.g., Eagle3) where cost scales linearly with draft length.

The cost analysis further validates TPU as a compelling platform for speculative decoding. At GCP on-demand pricing, TPU V5P with DFlash achieves the lowest cost per million tokens among all hardware and method combinations tested, making it a practical choice for production LLM serving workloads.

Hardware Generations

DFlash Speedup: TPU V4 vs V5P

V5P baseline is $1.69\times$ faster than V4; absolute DFlash TPS is higher on V5P across all benchmarks.

Cost Efficiency: TPU vs GPU

$/million tokens at GCP on-demand pricing. V5P + DFlash is the most cost-efficient option.

Future Work

Several directions remain open for further improvement:

Wider draft blocks ($K\!=\!64$ to $128$): TPU verification cost stays constant as block size grows, making larger blocks viable. Wider blocks provide richer bidirectional context for the diffusion drafter, potentially increasing $\tau$ and overall throughput.
Pipeline optimization: The standalone $\tau$ of $6.67$ drops to $4.48$ in the full vLLM pipeline. Step profiling shows that vLLM orchestration (scheduling, rejection sampling, KV cache management) accounts for most of this gap. Streamlining the speculative decoding loop could recover much of the lost performance.
LM head fusion: The two LM head matmuls (draft and target) account for $\sim\!30\%$ of step time. Approximate methods such as Medusa-style parallel heads, or fused kernel implementations, could substantially reduce this bottleneck.
Multi-device scaling: Our current evaluation uses a single TPU V5P host (4 chips). Extending to multi-host configurations with tensor parallelism could further improve serving throughput for larger models.

BibTeX

@misc{2025capstone_dflash_tpu, author = {Feng, Aaron and Luo, Zhongyan and Nguyen, Son and Huang, Andy}, title = {Porting DFlash to TPU: Accelerating LLM Inference with Speculative Decoding}, year = {2025}, institution = {UC San Diego}, note = {DSC 180 Capstone Project}, }

Porting DFlash to TPU: Accelerating LLM Inference with Speculative Decoding