The bandwidth floor

273 GB/s. 819 GB/s. 30.7 TB/s. Only one crosses the bar.

Decode at 120 tok/s with three concurrent users at 500k context is bounded by single-node memory bandwidth — not aggregate, and not by the cluster network, which cannot route around per-node latency. The three named stacks differ by one to two orders of magnitude.

Single-node memory bandwidth, log scale · sustained tok/s/stream · 120 tok/s × 3-stream bar at ≈ 12 TB/s

Stack · 4–8 unit cluster

100 GB/s 1 TB/s 10 TB/s 100 TB/s

Verdict · capex hardware

4× DGX Spark GB10, 128 GB LPDDR5X each, 200 GbE cluster

273 GB/s · 5–25 tok/s

Does not meet $29k hw developer prototype

4× Mac Studio M3 Ultra 512GB 819 GB/s unified, Thunderbolt 5 / 10 GbE cluster

819 GB/s · 17–25 tok/s

Does not meet $72k hw SKU withdrawn 5 Mar 2026

8× H200 SXM HGX node 141 GB HBM each, NVLink, 30.7 TB/s achievable aggregate

30.7 TB/s · 250–400 tok/s

Meets the bar $620k hw Supermicro / Dell / AU integrator

·Why bandwidth, not compute: decode is memory-bandwidth-bound: per-token cost = active params bytes ÷ aggregate HBM bandwidth
·Why the cluster network does not help: 200 GbE Spark and Thunderbolt-5 Mac fabrics enable model-fit, not faster single-stream decode — inter-node hops add latency per token rather than dividing it
·The H200 headroom: 120 tok/s bar met with 2–3× headroom; the binding constraint becomes prefill TTFT (50–100 s cold at 500k), not steady-state decode
·The H100 alternative: 8× H100 SXM (640 GB) is workable but caps at ~430k context for Llama 4 / GQA models; H200 (1.13 TB) is the right node size at 500k
·The Blackwell upgrade path: 8× B200 HGX delivers ~2.5× H100 inference at similar power; supply constrained through mid-2026

SourceOn-prem frontier LLM briefing, §2 Hardware sizing and §3 The three stacks costed. Aggregate H200 bandwidth derived from 4.8 TB/s HBM × 8 GPUs × ~80% achievable. Mac Studio MLX-lm and DGX Spark llama.cpp throughput numbers cited verbatim from the published benchmarks referenced in the briefing's source appendix. AUD throughout; $ prefix on all figures. USD/AUD planning rate 1.50, RBA spot ≈ 1.38.

Correction · briefingThe Mac Studio M3 Ultra 512GB unified-memory configuration was withdrawn from the Apple Store on or around 5 March 2026 amid the global DRAM squeeze. As of 11 May 2026 the SKU is available only via secondary-market channels at approximately 10–25% over launch ($16,500–$18,500 against $14,999). The configuration is retained in the costing because it was specified in the brief, but a 4-unit fresh procurement from Apple is no longer a viable path.