IV / VII · Pro-sumer compute, compared

Pro-sumer compute · four off-the-shelf SKUs · AU market · May 2026

$13k. $28k. $72k. $620k. Frontier becomes its own line item.

Four off-the-shelf SKUs an AU SME buys for AUD $15k – $80k, a competent geek, and under thirty days to running. The fourth panel — the smallest frontier-class box on the same row — is the ceiling against which the pro-sumer tier reads. The price of frontier capability on-prem is its own line item: roughly twenty-two-fold the most expensive pro-sumer build above it, and forty-eight-fold the cheapest.

Same model · same benchmark · $ bars indexed to a shared 0 → $700k scale
01 · Workstation Does not meet
GB202 · 32 GB GDDR7 PCIe 5.0 ×16
RTX 5090 · consumer GPU, single card

A single 32 GB GDDR7 card in a Threadripper workstation. The cheapest path to an on-prem LLM that runs at all.

Tokens / s · sustained ~12 tok/s Llama 3.3 70B Q4 does not fit in 32 GB; figure is Q3_K_S with CPU spillover. Quality and context both degrade.
Context · practical 8 – 16 k tok
$ · deployable ≈ $13k* RTX 5090 ~$4.2k + Threadripper workstation ~$7.5k + 10 GbE + rack share.
$13k $700k →
02 · Workstation, pro Does not meet
GB202 PRO · 96 GB GDDR7 ECC NVLink
RTX PRO 6000 Blackwell · pro GPU, single card

A single 96 GB ECC card in a Threadripper PRO workstation. Holds the model with room for usable context; single-card bandwidth caps the rest.

Tokens / s · sustained ~38 – 45 tok/s 1.79 TB/s memory bandwidth ÷ 40 GB active weights. No headroom for the briefing's 120 tok/s × 3 stream bar.
Context · practical ~128 k tok
$ · deployable ≈ $28k* RTX PRO 6000 ~$15.5k + Threadripper PRO workstation ~$11k + 10 GbE + rack share.
$28k $700k →
03 · Apple cluster Does not meet
4 × M3 ULTRA · 4 × 512 GB UNIFIED TB5 / 10 GbE fabric
M3 Ultra 512 GB cluster · four networked Mac Studios

Four Mac Studios sharded by tensor or pipeline parallelism. The largest practical context window of the four; per-node bandwidth still caps decode.

Tokens / s · sustained 17 – 25 tok/s Per-node 819 GB/s; inter-node Thunderbolt 5 / 10 GbE adds latency per token rather than dividing it.
Context · practical 500k+ tok 2 TB aggregate unified memory is the standout figure on this row.
$ · deployable ≈ $72k* 4 × Mac Studio ~$17k (secondary, SKU withdrawn from Apple 5 Mar 2026) + 10 GbE + TB5 fabric.
$72k $700k →
04 · HGX node Meets the bar
8 × H200 SXM · NVLINK · 1.13 TB HBM3e Supermicro / Dell / AU integrator · 6–8U · ~10 kW
H200 server · single integrated HGX node

One integrated SXM node. 30.7 TB/s aggregate clears the briefing's 120 tok/s × 3 stream bar at 500k context. The only candidate that does.

Tokens / s · sustained 250 – 400 tok/s Aggregate HBM3e bandwidth ÷ active weights; prefill TTFT becomes the binding constraint, not decode.
Context · practical 500k+ tok
$ · deployable ≈ $620k* HGX 8× H200 node + chassis + 100 GbE + half-rack + cooling load. AU integrator-indication; NVIDIA does not publish HGX list.
$620k $700k →
·The benchmark held across all four
Llama 3.3 70B Q4_K_M, single concurrent user, sustained decode after warm KV cache — chosen because it is the open-weight reference for the briefing's frontier-class category
~40 GB at Q4
·Where the 5090 fails
32 GB GDDR7 does not hold the model at Q4; falling to Q3 sacrifices quality and squeezes context — the figure given is the most generous reading
fit · not met
·Where the PRO 6000 stalls
96 GB holds the model with usable context; single-card 1.79 TB/s bandwidth caps decode at ~40 tok/s — a third of the 120 tok/s × 3 stream bar
bar · not met
·Where the Mac cluster fails differently
2 TB unified memory is the largest context surface on the page; per-node bandwidth and inter-node hops still cap decode at 17–25 tok/s. The SKU was withdrawn from Apple 5 March 2026 — secondary market only
bar · not met
·Where the H200 clears
30.7 TB/s aggregate at 1.13 TB HBM3e meets the bar with 2–3× headroom; prefill TTFT at 500k becomes the binding constraint, not steady-state decode
meets · $620k
·The cost geometry
the smallest frontier-class box costs 22× the most expensive pro-sumer build and 48× the cheapest — frontier capability on-prem is its own line item, not the next step up
22× · 48×

SourceOn-prem frontier LLM briefing §2 Hardware sizing and §3 The three stacks costed, extended to two workstation-tier candidates for the SME shopping context. Throughput from published Llama-3 70B Q4 vLLM and llama.cpp benchmarks for each silicon class; context windows from KV-cache budget at the respective memory size after model weights. AU prices from PLE, Scorptec, Mwave, Apple secondary-market trackers, and the briefing's three quoted HGX integrators (Hyperscalers AU, Broadberry AU, XENON). Workstation chassis costs estimated from current Threadripper / Threadripper PRO platform builds at AU pricing.

Scope · the asterisked $ figureThe $ figure on each panel is the deployable-unit cost: silicon plus the workstation or server chassis the silicon must sit inside, plus 10 / 25 / 100 GbE networking sized to the tier, plus half- or full-rack share and the marginal cooling load. It is not the bare chip and not the three-year TCO. Three-year TCO with platform engineering, compliance, and surveillance is the subject of V · Three-year TCO. NVIDIA does not publish HGX list prices and several PRO-tier SKUs are quote-only at the AU integrator tier; estimates are marked accordingly. AUD throughout; $ prefix on all figures. USD/AUD planning rate 1.50, RBA spot ≈ 1.38.