II / VI · The capability gap

The capability gap

Open-weight trails the closed frontier by 5 to 15 points where it matters.

DeepSeek-V4-Pro, the best open-weight model in May 2026, trails GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro by three to six months and three to fifteen points on benchmarks an SME would actually rely on. The three material gaps compound in agentic chains. A 90% per-step success rate over five steps is 59% end-to-end; an 80% per-step is 33%.

Closed frontier vs best open-weight, by benchmark — May 2026
Benchmark
Closed frontier
Best open-weight
Gap
Material gaps · compound in agentic chains three rows
Terminal-Bench 2.0 agentic shell
82.7 GPT-5.5
67.9 V4-Pro
14.8 pt
BFCL v4 function-calling
~95 frontier
~85 V3.1/V4
~10 pt
SWE-bench Pro real GitHub issues
64.3 Opus 4.7
58.4 GLM-4.7
5.9 pt
Immaterial gaps · solved or near-solved two rows
Long-context needle at 500k+ tokens
~95 frontier
~85–92 V4/Qwen3
5–10 pt
GPQA Diamond graduate reasoning
94.3 Gemini 3.1
90.1 V4-Pro
4.2 pt
Material · compound in agentic chains three rows
Terminal-Bench 2.0
14.8 pt
agentic shell
82.7 GPT-5.5
67.9 V4-Pro
BFCL v4
~10 pt
function-calling
~95
~85
SWE-bench Pro
5.9 pt
real GitHub issues
64.3 Opus 4.7
58.4 GLM-4.7
Immaterial · solved or near-solved two rows
Long-context needle
5–10 pt
at 500k+ tokens
~95
~85–92
GPQA Diamond
4.2 pt
graduate reasoning
94.3
90.1
·The agentic compounding penalty
a 90% per-step success rate over five steps lands at 59% end-to-end; 80% per-step lands at 33%
where the gap is felt
·The lead candidate, open-weight
DeepSeek-V4-Pro · 1.6T total / 49B active MoE · 1M native context · CSA+HCA attention · released 24 April 2026
3–6 months behind
·The closed frontier, May 2026
GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro — the comparators on the rows above
three vendors
·Llama 4, explicitly flagged
Maverick's LMArena variant ≠ released weights — production fidelity unproven; we do not recommend
excluded
·Where the gap closes
document-grounded retrieval, summarisation, classification, modest code assistance — under 5 pt
flip 01 →

SourceOn-prem frontier LLM briefing, §1 Capability gap. Benchmark snapshot May 2026; numbers move monthly. Closed-frontier scores cite each row's leader; open-weight scores cite the best open release as of 11 May 2026. SWE-bench Pro and Terminal-Bench 2.0 are the agentic benchmarks an SME would actually rely on; BFCL v4 measures the function-calling fidelity MCP integrations depend on.