The capability gap

Open-weight trails the closed frontier by 5 to 15 points where it matters.

DeepSeek-V4-Pro, the best open-weight model in May 2026, trails GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro by three to six months and three to fifteen points on benchmarks an SME would actually rely on. The three material gaps compound in agentic chains. A 90% per-step success rate over five steps is 59% end-to-end; an 80% per-step is 33%.

Closed frontier vs best open-weight, by benchmark — May 2026

Benchmark

Closed frontier

Best open-weight

Gap

Material gaps · compound in agentic chains three rows

Terminal-Bench 2.0 agentic shell

82.7 GPT-5.5

67.9 V4-Pro

14.8 pt

BFCL v4 function-calling

~95 frontier

~85 V3.1/V4

~10 pt

SWE-bench Pro real GitHub issues

64.3 Opus 4.7

58.4 GLM-4.7

5.9 pt

Immaterial gaps · solved or near-solved two rows

Long-context needle at 500k+ tokens

~95 frontier

~85–92 V4/Qwen3

5–10 pt

GPQA Diamond graduate reasoning

94.3 Gemini 3.1

90.1 V4-Pro

4.2 pt

Material · compound in agentic chains three rows

Terminal-Bench 2.0

14.8 pt

agentic shell

82.7 GPT-5.5

67.9 V4-Pro

BFCL v4

~10 pt

function-calling

~95

~85

SWE-bench Pro

5.9 pt

real GitHub issues

64.3 Opus 4.7

58.4 GLM-4.7

Immaterial · solved or near-solved two rows

Long-context needle

5–10 pt

at 500k+ tokens

~95

~85–92

GPQA Diamond

4.2 pt

graduate reasoning

94.3

90.1

·The agentic compounding penalty: a 90% per-step success rate over five steps lands at 59% end-to-end; 80% per-step lands at 33%
·The lead candidate, open-weight: DeepSeek-V4-Pro · 1.6T total / 49B active MoE · 1M native context · CSA+HCA attention · released 24 April 2026
·The closed frontier, May 2026: GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro — the comparators on the rows above
·Llama 4, explicitly flagged: Maverick's LMArena variant ≠ released weights — production fidelity unproven; we do not recommend
·Where the gap closes: document-grounded retrieval, summarisation, classification, modest code assistance — under 5 pt

SourceOn-prem frontier LLM briefing, §1 Capability gap. Benchmark snapshot May 2026; numbers move monthly. Closed-frontier scores cite each row's leader; open-weight scores cite the best open release as of 11 May 2026. SWE-bench Pro and Terminal-Bench 2.0 are the agentic benchmarks an SME would actually rely on; BFCL v4 measures the function-calling fidelity MCP integrations depend on.