III / VI · The bandwidth floor

The bandwidth floor

273 GB/s. 819 GB/s. 30.7 TB/s. Only one crosses the bar.

Decode at 120 tok/s with three concurrent users at 500k context is bounded by single-node memory bandwidth — not aggregate, and not by the cluster network, which cannot route around per-node latency. The three named stacks differ by one to two orders of magnitude.

Single-node memory bandwidth, log scale · sustained tok/s/stream · 120 tok/s × 3-stream bar at ≈ 12 TB/s
Stack · 4–8 unit cluster
100 GB/s 1 TB/s 10 TB/s 100 TB/s
Verdict · capex hardware
4× DGX Spark GB10, 128 GB LPDDR5X each, 200 GbE cluster
273 GB/s · 5–25 tok/s
Does not meet AUD 29k hw developer prototype
4× Mac Studio M3 Ultra 512GB 819 GB/s unified, Thunderbolt 5 / 10 GbE cluster
819 GB/s · 17–25 tok/s
Does not meet AUD 72k hw SKU withdrawn 5 Mar 2026
8× H200 SXM HGX node 141 GB HBM each, NVLink, 30.7 TB/s achievable aggregate
30.7 TB/s · 250–400 tok/s
Meets the bar AUD 620k hw Supermicro / Dell / AU integrator
·Why bandwidth, not compute
decode is memory-bandwidth-bound: per-token cost = active params bytes ÷ aggregate HBM bandwidth
first principles
·Why the cluster network does not help
200 GbE Spark and Thunderbolt-5 Mac fabrics enable model-fit, not faster single-stream decode — inter-node hops add latency per token rather than dividing it
topology bound
·The H200 headroom
120 tok/s bar met with 2–3× headroom; the binding constraint becomes prefill TTFT (50–100 s cold at 500k), not steady-state decode
meets
·The H100 alternative
8× H100 SXM (640 GB) is workable but caps at ~430k context for Llama 4 / GQA models; H200 (1.13 TB) is the right node size at 500k
AUD 480–580k hw
·The Blackwell upgrade path
8× B200 HGX delivers ~2.5× H100 inference at similar power; supply constrained through mid-2026
AUD 420–525k GPUs

SourceOn-prem frontier LLM briefing, §2 Hardware sizing and §3 The three stacks costed. Aggregate H200 bandwidth derived from 4.8 TB/s HBM × 8 GPUs × ~80% achievable. Mac Studio MLX-lm and DGX Spark llama.cpp throughput numbers cited verbatim from the published benchmarks referenced in the briefing's source appendix.

Correction · briefingThe Mac Studio M3 Ultra 512GB unified-memory configuration was withdrawn from the Apple Store on or around 5 March 2026 amid the global DRAM squeeze. As of 11 May 2026 the SKU is available only via secondary-market channels at approximately 10–25% over launch (AUD 16,500–18,500 against AUD 14,999). The configuration is retained in the costing because it was specified in the brief, but a 4-unit fresh procurement from Apple is no longer a viable path.