A Comprehensive Thesis on the Top 30 Legacy and Frontier AI Models, Their Unique Capabilities, and Their Latest Products
Technology & AI Research Division · June 2026 · 30 Models · 9 Makers · 6 Nations · Frontier to Edge
Executive Summary
This thesis surveys thirty of the most consequential AI models on the market as of June 2026 — the closed-source frontier, the open-weight challengers, edge and enterprise specialists, and the image and video generators that turned creative production into an automated pipeline.
The headline of 2026 is convergence. The four leading US frontier models are separated by only a few points on the Artificial Analysis Intelligence Index, and “which model is best” has become “which model for which job.” Claude Opus 4.8 leads overall and on sustained coding; GPT-5.5 owns creative writing and terminal workflows; Gemini 3.1 Pro leads reasoning, data analysis and multimodality; Grok 4.3 is the budget agentic option with real-time data.
Beneath them, open-weight models from China (DeepSeek, Qwen, Kimi, GLM), the US (Llama, Gemma) and Europe (Mistral) have closed most of the capability gap while collapsing cost. The serious AI stack is no longer a single model but a routed portfolio.
Bottom line
There is no single best model in 2026. Capability has converged at the top; differentiation lives in agentic tooling, ecosystem fit, latency, licensing and price. The winning strategy is a multi-model portfolio with a standing evaluation framework and pinned endpoints.
A Brief History: How We Got to Thirty Models
Phase 1 — Transformer foundation (2017–2022). The transformer architecture made internet-scale training practical. GPT-3 showed scale produces general capability; ChatGPT (2022) turned it into a mass-market product and started the race.
Phase 2 — The GPT-4 era (2023–2024). GPT-4 set the bar. Claude established a safety-and-coding track, Google consolidated into Gemini, Meta open-sourced Llama. GPT-4o mainstreamed multimodal voice/vision; Claude 3.5 Sonnet defined the coding assistant.
Phase 3 — The reasoning turn (late 2024–2025). OpenAI’s o1 opened the inference-compute reasoning axis; DeepSeek-R1 proved open weights could match it cheaply, forcing a global re-pricing of frontier reasoning.
Phase 4 — Agentic & convergence (2025–2026). The late-2025 wave shifted the contest from “what can it answer” to “what can it do.” Native computer-use, persistent coding agents and multi-agent orchestration arrived; MCP crossed 97M installs and an Agentic AI Foundation formed under the Linux Foundation. By mid-2026 routing strategy mattered more than model choice.
The compression of time
GPT-3 to ChatGPT took two years; ChatGPT to the agentic frontier took three. Anthropic shipped four Opus versions (4.5–4.8) in six months. The frontier now moves faster than most organisations can re-architect.
Reading the Benchmarks Critically
Every figure here should be read with informed scepticism. Benchmarks are signals, not verdicts.
- Intelligence Index — a composite; good for rough ranking, misleading for a specific workload.
- SWE-bench Verified / Pro — real GitHub issues; the most realistic coding test, but harness-dependent.
- GPQA Diamond — graduate-level science reasoning.
- AIME / MATH-500 / FrontierMath — competition & research maths (FrontierMath v2 was re-released June 2026 after corrections).
- LMSYS Arena (Elo) — human preference; captures “feel” but reflects popularity too.
The only benchmark that matters
Your own. A reproducible evaluation set from real tasks — measuring quality, latency, cost-per-task and error rate — beats any public leaderboard. Treat published numbers as a shortlist generator, not a decision.
Part I — The Frontier: Closed-Source Leaders & Their Legacy
01Claude Opus 4.8
Anthropic · USA · May 2026 New
The model to beat in mid-2026, and the spine of Anthropic’s agentic product line.
TypeFrontier closed-source · Reasoning + agentic
Context~1M tokens (input)
Pricing$5 in / $25 out per 1M tokens (Fast mode $10 / $50)
Capabilities
Currently the #1 overall model on the Artificial Analysis Intelligence Index (61.4). The strongest sustained-coding and long-horizon agentic model available, leading SWE-bench Verified (≈88.6%) and SWE-bench Pro (≈69.2%). Its lead grows the longer and more complex the task, making it the default for multi-file engineering, strategic synthesis and document-heavy work.
What makes it unique
Best-in-class agentic coding paired with Claude Code; pioneered the Model Context Protocol (MCP) now standard across the industry. Prompt caching can cut input cost up to 90%.
Latest products & successors
Powers Claude Code, Claude Cowork, and agentic browser/Excel/PowerPoint tools. Sits beneath Anthropic’s new Mythos tier (Claude Mythos 5 / Fable 5, June 2026), with Opus 4.8 acting as a safety fallback layer.
Considerations & trade-offs
Premium pricing and an optional Fast mode that doubles cost; the very top of the market, so overkill (and over-budget) for routine tasks better served by Sonnet or Haiku.
02 GPT-5.5
OpenAI · USA · April 2026
New
OpenAI’s do-everything flagship and the broadest consumer-to-enterprise footprint in AI.
TypeFrontier closed-source · Agentic + creative
Context~1M tokens in / 128K out
PricingStandard tier ≈ $2.50 in / $15 out per 1M (Pro variant far higher)
Capabilities
OpenAI’s current flagship, second on the Intelligence Index (60.2) and effectively tied with Opus 4.8 for top coding performance. Built for agentic and professional work — research, tool use and long-horizon tasks — while staying token-efficient. The leading model for creative writing with a warm, natural tone, and unmatched 128K max output for long-form generation.
What makes it unique
Native computer-use (controls browsers, fills forms, executes workflows); broadest product ecosystem via ChatGPT. Strongest CLI/terminal workflow performance.
Latest products & successors
GPT-5.5 (April 2026) succeeds the Nov-2025 GPT-5.1 wave. GPT-5.6 widely rumoured for late June 2026, reportedly rebuilt around a redesigned reward-audit pipeline after the GPT-5.5 ‘Goblin’ persona-contamination post-mortem.
Considerations & trade-offs
The GPT-5.5 ‘Goblin’ persona-contamination episode showed reward-model fragility; a 5.6 successor is rumoured imminently, so pin versions for production.
03 Gemini 3.1 Pro
Google DeepMind · USA · February 2026
New
The most underrated frontier model — quietly leading reasoning and multimodality.
TypeFrontier closed-source · Multimodal + reasoning
Context~1M tokens (doubles price above 200K)
PricingCompetitive; roughly doubles for >200K-token contexts
Capabilities
Leads the field on reasoning and data analysis, and is consistently strong on multimodal tasks (text, image, audio, video). Frequently underestimated but competitive with the very top tier, and the cheapest of the closed frontier models for short prompts. Tied near the top of the Intelligence Index (≈57).
What makes it unique
Deepest native multimodality and tightest Google ecosystem integration (Search, Workspace, Android). Excellent grounding via live Google Search.
Latest products & successors
Gemini 3.5 Flash shipped at Google I/O (May 2026); Gemini 3.5 Pro promised ‘next month’ as of late May. The Gemini Omni line now powers native video+audio generation.
Considerations & trade-offs
Pricing roughly doubles above 200K-token contexts, which can surprise long-document workloads; ecosystem lock-in to Google for the best experience.
04 Grok 4.3
xAI · USA · April 2026
New
The budget frontier option with a real-time data advantage no rival matches.
TypeFrontier closed-source · Real-time + agentic
Context~1M tokens
Pricing≈ $1.25 in / $2.50 out per 1M tokens
Capabilities
The cheapest of the four US frontier models, with strong agentic and tool-use scores (≈94.1% on agentic-accuracy benchmarks via its ReAct-2 framework). Native video input and top-tier long-chain agent capability. Intelligence Index ≈53.
What makes it unique
Real-time X/web data access through an ‘X-Platform Latent Index’ that surfaces global trends faster than search-based agents; a 4-agent internal system (Grok, Harper, Benjamin, Lucas).
Latest products & successors
Grok 4.3 (April 2026) is the current public flagship; Grok 5 (a rumoured ~6T-parameter model) is targeted for the Q2 2026 window. SuperGrok Video added competitive portrait/character video generation.
Considerations & trade-offs
Trails Opus 4.8 and DeepSeek V4 Pro on pure coding; the X-data tie-in is a strength for trend tasks but irrelevant for many enterprise uses.
05 Claude Sonnet 4.6
Anthropic · USA · February 2026
New
The default ‘most-tasks’ model — near-Opus quality at a working budget.
TypeWorkhorse closed-source · Balanced
Context~1M tokens (beta)
Pricing≈ $3 in per 1M tokens (output higher)
Capabilities
The best value for production coding — near-Opus quality at a fraction of the cost. The default ‘most tasks’ model for marketing, technical docs, integration logic and extended coding, scoring ≈89.3% on GPQA Diamond. Widely used as the balanced tier in multi-model routing stacks.
What makes it unique
Near-frontier quality at ~$3/M; 1M-token context in beta; shares Claude’s agentic tooling and MCP support.
Latest products & successors
Current API string claude-sonnet-4-6. Frequently paired with Opus 4.8 (hard tasks) and Haiku 4.5 (cheap tasks) in production routing.
Considerations & trade-offs
Not the absolute top for the hardest problems; reserve Opus 4.8 for genuinely frontier coding and long-horizon agents.
06 GPT-5.3 Codex
OpenAI · USA · Early 2026
New
A coding-and-agent scalpel rather than a general-purpose chat model.
TypeSpecialised closed-source · Code/agent
ContextLarge (codebase-scale)
PricingUsage-based via OpenAI API
Capabilities
A coding-and-agent-tuned GPT variant optimised for terminal workflows, file editing and debugging. Notable for extremely low first-token latency in benchmark harnesses, making it well-suited to fast interactive coding agents.
What makes it unique
Tuned specifically for the Codex agent backend and CLI/terminal task execution rather than open-ended chat.
Latest products & successors
Runs inside OpenAI’s Codex agent. A ‘gpt-5.6’ routing reference was spotted in the Codex backend in May 2026, hinting at an imminent successor.
Considerations & trade-offs
Narrowly tuned for the Codex/CLI context; less suited to open-ended conversation or creative work.
07 Claude Haiku 4.5
Anthropic · USA · Oct 2025
New
The cheap, fast leg of a routed Claude stack.
TypeBudget closed-source · Fast
ContextLarge
PricingBudget tier (lowest of the Claude family)
Capabilities
Anthropic’s fast, low-cost tier for high-volume work — classification, extraction, routing and lightweight chat — while retaining Claude’s tool-use and safety behaviour. Competes with GPT-5 nano for budget-tier tasks.
What makes it unique
Low latency and cost with full MCP/tool support; ideal as the cheap leg of a routed multi-model stack.
Latest products & successors
API string claude-haiku-4-5. Remains the current Haiku generation as of mid-2026.
Considerations & trade-offs
Not for hard reasoning or complex coding; a router should escalate difficult tasks to Sonnet or Opus.
08 Gemini 2.5 Flash
Google DeepMind · USA · 2025–26
Legacy (still widely used)
The price/throughput champion that still handles most routine work.
TypeBudget closed-source · High-throughput
ContextLarge
Pricing≈ $0.30/M tokens
Capabilities
Consistently offers the best capability-to-price ratio among hosted models — often the cheapest strong model available (≈$0.30/M). The default for high-volume summarisation, Slack digests and standard-turnaround tasks at scale.
What makes it unique
Extremely cheap, fast, and good enough for the majority of routine production tasks; deep Google integration.
Latest products & successors
Superseded at the quality frontier by Gemini 3.5 Flash (May 2026) but retained for its price/throughput; still a routing favourite.
Considerations & trade-offs
Behind the 2026 frontier on hard tasks; superseded in quality by Gemini 3.5 Flash but kept for cost.
09 GPT-4o
OpenAI · USA · 2024
Legacy
The model that taught the public what an AI assistant feels like.
TypeLegacy flagship · Multimodal
Context128K tokens
PricingMid-tier legacy pricing
Capabilities
The model that mainstreamed real-time multimodal interaction (text, vision, voice) and defined the assistant experience for over a year. Still a capable, widely-integrated general model, though now well behind the 2026 frontier on reasoning and coding.
What makes it unique
Fast omni-modal voice/vision; enormous installed base and tooling. A reference point for ‘GPT-4-class’ as a capability tier.
Latest products & successors
Superseded by GPT-5.x. Remains available via API (gpt-4o, gpt-4o-mini) for cost-sensitive or compatibility-bound workloads.
Considerations & trade-offs
Well behind 2026 frontier on reasoning and coding; retained mainly for cost, compatibility and its huge installed base.
10 Claude 3.5 Sonnet
Anthropic · USA · 2024
Legacy
The coding assistant that defined an era.
TypeLegacy flagship · Coding
Context200K tokens
PricingLegacy pricing
Capabilities
A landmark 2024 model that set the standard for coding assistants of its era (≈72.5% SWE-bench in its day) and popularised Artifacts-style structured output. Historically important as the model that established Claude’s coding reputation.
What makes it unique
Introduced computer-use as a research preview; strong instruction-following and ‘feel’ that defined a generation of coding tools.
Latest products & successors
Long superseded by the Claude 4.x line, but still referenced as a capability baseline and available for legacy integrations.
Considerations & trade-offs
Long superseded; relevant now as a capability baseline and for legacy integrations, not new builds.
11 OpenAI o1
OpenAI · USA · Late 2024
Legacy
The model that created the reasoning-model category.
TypeLegacy reasoning · Chain-of-thought
Context128K tokens
PricingLegacy reasoning-tier pricing
Capabilities
The model that launched the dedicated ‘reasoning model’ category — trading speed for accuracy by spending inference compute on hidden chain-of-thought. Established the template (o1 → o3 → o4-mini) that every major lab has since adopted.
What makes it unique
First mainstream test-time-compute reasoning model; excelled at hard maths, logic and structured analysis relative to its generation.
Latest products & successors
Superseded by the o3/o4 series and folded into the GPT-5 reasoning stack, but historically pivotal.
Considerations & trade-offs
Folded into the GPT-5 reasoning stack; historically pivotal but not a current production choice.
12 Claude Opus 4.5
Anthropic · USA · Nov 2025
Recent legacy
The November-2025 flagship that proved autonomous long-running coding at scale.
TypeRecent-legacy flagship · Coding
Context~1M tokens
PricingPremium tier (now lower than 4.8)
Capabilities
The November-2025 flagship that topped SWE-bench Verified at 80.9% and anchored the late-2025 frontier wave alongside GPT-5.1, Grok 4.1 and Gemini 3 Pro. Directly preceded the 4.6/4.7/4.8 rapid-iteration cycle.
What makes it unique
Demonstrated the step-change in autonomous, long-running coding that Opus 4.8 later extended; strong agentic reliability.
Latest products & successors
Superseded within months by Opus 4.6 (Feb), 4.7 (April) and 4.8 (May 2026) — illustrating 2026’s compressed release cadence.
Considerations & trade-offs
Superseded within months by 4.6/4.7/4.8; choose 4.8 unless cost or pinning dictates otherwise.
Part II — The Open-Weight Insurgency & Chinese Frontier
13 DeepSeek V4 Pro
DeepSeek · China · April 2026
New
The open-weight model that re-priced frontier reasoning — again.
TypeOpen-weight frontier · Reasoning + coding
Context1M tokens (native)
PricingList ≈ $1.74 in / $3.48 out per 1M; self-host = near-zero
Capabilities
The most capable open-weight model for reasoning and mathematics. A 1.6T-parameter MoE (≈42–49B active per token) scoring ≈80.6% on SWE-bench Verified — about 7–8 points above Grok 4.3 on coding. Disrupts the ‘intelligence-per-dollar’ curve with aggressive pricing.
What makes it unique
MIT licence (fully self-hostable); hybrid Compressed/Heavily-Compressed Attention cuts FLOPs to ~27% and KV-cache to ~10% vs V3.2 at 1M context. Three reasoning modes: Non-think, Think High, Think Max.
Latest products & successors
DeepSeek V4 (Pro + Flash) public preview, April 2026. Legacy deepseek-chat / deepseek-reasoner endpoints retire 24 July 2026 — migrate to V4.
Considerations & trade-offs
As a Chinese-origin model, cloud use raises data-governance questions; self-host the open weights for sensitive work. Legacy endpoints retire 24 July 2026.
14 DeepSeek V4 Flash
DeepSeek · China · April 2026
New
The cheapest credible 1M-context API in its class.
TypeOpen-weight · Cost-efficient long-context
Context1M tokens
Pricing≈ $0.14 in / $0.28 out per 1M tokens
Capabilities
A 284B-total / 13B-active MoE delivering frontier-adjacent quality at one of the lowest costs for 1M-context throughput. The pragmatic choice when the dominant constraint is cheap, long-context volume.
What makes it unique
Best cost-sensitive long-context API in its class; open weights for private deployment.
Latest products & successors
Released alongside V4 Pro (April 2026); listed at ≈$0.14/M input, $0.28/M output.
Considerations & trade-offs
Lower ceiling than V4 Pro; best for high-volume long-context throughput rather than the hardest reasoning.
15 Qwen 3.5 (397B-A17B)
Alibaba · China · 2026
New
The download leader and the most versatile open family.
TypeOpen-weight frontier · Multilingual
Context262K native (extendable to 1M+)
PricingFree to self-host; low hosted-API pricing
Capabilities
Alibaba’s flagship open model and the download leader — Qwen captured over 50% of global open-source model downloads by April 2026. Competitive with GPT-5.5 on reasoning, with multimodal reasoning across text/image/video/documents and coverage of 200+ languages.
What makes it unique
Apache-2.0 licence; the broadest fine-tune ecosystem among Chinese models; specialised Qwen-Coder and Qwen-VL variants.
Latest products & successors
Qwen 3.5 / 3.6 / 3.7 line plus Qwen3-Coder-Next (80B-total, 3B-active) for efficient self-hosted coding agents.
Considerations & trade-offs
Quality varies sharply across the many variants; ‘Coder’-named models didn’t always top independent coding harnesses — test the specific checkpoint.
16 Llama 4 Scout
Meta · USA · 2025–26
New
The context-window king, by a factor no rival approaches.
TypeOpen-weight · Ultra-long-context
Context10M tokens (class-leading)
PricingFree (self-host); Meta custom licence with 700M-MAU cap + EU restrictions
Capabilities
The undisputed context-window champion at 10M tokens — nothing else comes close — enabling whole-codebase or whole-library reasoning in a single pass. Also one of the highest-throughput open models (≈2,600 t/s in benchmarks).
What makes it unique
10M-token context; open weights for self-hosting and fine-tuning; massive community tooling and the largest fine-tune ecosystem.
Latest products & successors
Part of the Llama 4 family (Scout / Maverick / Behemoth); Llama 5 reported in development with the largest community ecosystem.
Considerations & trade-offs
Meta’s custom licence caps use at 700M MAU and restricts the EU; raw context size doesn’t guarantee top reasoning quality.
17 Llama 4 Maverick
Meta · USA · 2025–26
New
Meta’s balanced open flagship and the enterprise default.
TypeOpen-weight · General-purpose flagship
ContextVery large
PricingFree (self-host); Meta custom licence
Capabilities
Meta’s general-purpose open flagship and the highest-MMLU open model in several comparisons (≈85.5%). The go-to open choice for balanced chat, reasoning and agentic features when self-hosting or avoiding vendor lock-in.
What makes it unique
Strong all-round open performance with Meta’s backing; agentic capabilities competitive with proprietary models for many tasks.
Latest products & successors
Current Llama 4 flagship alongside Scout and the larger Behemoth; anchors the most-deployed open-weight family in enterprise.
Considerations & trade-offs
Custom-licence constraints apply; for the very hardest tasks, closed frontier models still lead.
18 Kimi K2.6
Moonshot AI · China · 2026
New
The most disciplined Chinese coding family and a multimodal agentic standout.
TypeOpen-weight · Multimodal agentic
Context256K tokens
PricingLow; open weights available
Capabilities
A native multimodal (text/image/video) agentic model aimed at long-horizon coding, design and autonomous execution. The most disciplined Chinese coding family in independent tests — the only Chinese model to reach ‘Tier A’ with no caveats (≈87/100 vs Opus 4.7’s 97, at ~3.6× lower cost).
What makes it unique
Model-card claim of up to 300 sub-agents coordinating ~4,000 steps; strong cross-lingual code-switching; open weights.
Latest products & successors
Kimi K2.6 (multimodal) follows K2 / K2.5 / K2 Thinking; a top-5 LMSYS Arena presence in 2026.
Considerations & trade-offs
Bold autonomous-orchestration claims (300 sub-agents) need independent verification; still a measurable gap to Opus on secondary code-quality dimensions.
19 GLM-5.2
Z.ai (Zhipu / Tsinghua) · China · 2026
New
An open coding specialist that surprised the industry on SWE-bench.
TypeOpen-weight · Long-horizon coding
Context1M tokens
PricingLow; open weights (MIT)
Capabilities
Positioned around long-horizon coding with strong vendor-reported agentic-coding benchmarks; GLM-5 scores ≈77.8% on SWE-bench Verified and has surprised the industry with coding performance approaching Claude Opus on some tests.
What makes it unique
MIT licence; 1M-token context tuned for sustained coding sessions. (Independent harnesses caught some real-world bugs, so verify on your own stack.)
Latest products & successors
GLM-5 / 5.1 / 5.2 line from Z.ai; a leading open coding family in 2026.
Considerations & trade-offs
Independent harnesses caught real bugs (invented DSLs, history loss); vendor benchmarks ran ahead of some real-world results — verify on your stack.
20 Mistral Large 3
Mistral AI · France/EU · 2026
New
The European sovereign-AI choice with a now-permissive licence.
TypeOpen-weight · European frontier
ContextLarge
PricingFree (self-host, Apache-2.0); hosted API available
Capabilities
The leading European open-weight generalist, with strong multilingual support (80+ languages) and enterprise deployment paths. The default ‘sovereign’ choice for organisations needing an EU-based, vendor-backed open model.
What makes it unique
Now Apache-2.0 (a major shift from Mistral’s earlier restrictive licensing); strong commercial-relationship and on-prem story.
Latest products & successors
Mistral’s 2026 open line spans Large 3, Medium 3.5 and Small 4 — covering generalist, multimodal and efficient agentic use.
Considerations & trade-offs
Not the outright benchmark leader; chosen for EU residency, multilingual strength and on-prem control as much as raw capability.
21 Mistral Small 4
Mistral AI · France/EU · March 2026
New
A pocket-sized agentic coder for local and edge deployment.
TypeOpen-weight · Efficient agentic coding
Context256K tokens
PricingFree (self-host, Apache-2.0)
Capabilities
A compact ~6B-active model that folds in Devstral-style agentic coding capability, built for efficient local and hosted coding agents where footprint and latency matter more than maximum benchmark score.
What makes it unique
Apache-2.0; small enough to serve cheaply while retaining agentic tool-use; strong multilingual coverage.
Latest products & successors
Released March 2026 as part of Mistral’s refreshed open lineup.
Considerations & trade-offs
Small size caps maximum capability; a tool for efficiency-constrained agents, not frontier reasoning.
22 DeepSeek R1
DeepSeek · China · 2025 (updated 2026)
Legacy (landmark)
The landmark that made affordable, private, frontier-adjacent reasoning real.
TypeOpen-weight legacy · Reasoning
Context128K tokens
PricingFree (self-host); ~10× cheaper than GPT-4o-class APIs historically
Capabilities
The model that proved open-weight reasoning could rival proprietary systems — a 671B MoE (≈37B active) scoring ≈97.3% on MATH-500. Hugely influential: it forced a global re-pricing of frontier-grade reasoning and remains a self-hosting favourite for maths and structured analysis.
What makes it unique
MIT licence; <think>-tag chain-of-thought; the reference point for affordable, private, frontier-adjacent reasoning.
Latest products & successors
Now succeeded by the V4 line and DeepSeek V3.2-Speciale (gold at IMO/IOI/ICPC 2026), but R1 remains widely deployed via Ollama/vLLM.
Considerations & trade-offs
Now succeeded by V4 and V3.2-Speciale; still excellent for self-hosted maths/analysis but no longer the open ceiling.
23 Qwen 3 235B-A22B
Alibaba · China · 2025–26
Legacy
A widely-deployed Qwen baseline with strong maths/coding siblings.
TypeOpen-weight legacy · Reasoning/coding
Context128K tokens
PricingFree (self-host)
Capabilities
A widely-deployed earlier Qwen flagship that outperformed DeepSeek-R1 on several benchmarks (Arena-Hard, LiveBench, LiveCodeBench) and scored ≈85.7% on AIME ’24. Its Qwen-Math and Qwen-Coder siblings remain popular specialised open models.
What makes it unique
Apache-2.0; strong long-context comprehension and specialised maths/coding variants; broad ecosystem.
Latest products & successors
Superseded by the Qwen 3.5/3.6/3.7 generation but still a common self-hosted baseline.
Considerations & trade-offs
Superseded by the 3.5+ generation; kept as a stable, well-understood self-hosted reference.
24 Llama 3.3 70B
Meta · USA · 2024–25
Legacy
The most-pulled open workhorse of the 2024–25 era.
TypeOpen-weight legacy · General workhorse
Context128K tokens
PricingFree (self-host); Meta custom licence
Capabilities
The default recommendation for most local-deployment scenarios through 2025 — RAG systems, chatbots, code assistance and fine-tuning — backed by the largest open ecosystem (most integrations, tutorials and ready-made solutions). Still a high-throughput workhorse (≈2,500 t/s).
What makes it unique
Enormous community tooling; 128K context; reliable, well-understood behaviour for production RAG.
Latest products & successors
Superseded by Llama 4 at the frontier but remains one of the most-pulled models on Ollama in 2026.
Considerations & trade-offs
Behind Llama 4 at the frontier; ideal for proven RAG/chatbot workloads where stability beats peak capability.
Part III — Edge & Enterprise Specialists
25 Gemma 4 (31B / 26B-A4B)
Google · USA · April 2026
New
Google’s most genuinely open release — frontier-adjacent reasoning on one GPU.
TypeOpen-weight · Edge & on-device
Context256K tokens
PricingFree (self-host, Apache-2.0)
Capabilities
Google’s most genuinely open release to date: a 31B dense model delivering top-tier reasoning on a single H100, plus a 26B-A4B MoE giving near-4B serving cost at much higher quality. Native function-calling and 100+-language coverage make it a strong edge and on-device choice.
What makes it unique
Apache-2.0; purpose-built for on-device/edge (the Gemma 3 4B variant runs in ~4.2 GB RAM); FunctionGemma 270M targets IoT function-calling.
Latest products & successors
Gemma 4 released April 2026 under Apache-2.0, extending Google’s edge-AI push.
Considerations & trade-offs
Not a frontier model in absolute terms; its value is serving cost and edge deployment, not topping leaderboards.
26 Command A
Cohere · Canada · 2025–26
New
The grounded-RAG specialist built to cite rather than confabulate.
TypeOpen-weight · Enterprise RAG
Context256K tokens
PricingEnterprise API; some weights under CC-BY-NC
Capabilities
Cohere’s enterprise-focused model optimised for retrieval-augmented generation and grounding — built to cite sources and minimise hallucination in business-document workflows. The specialist pick for grounded, tool-using enterprise RAG.
What makes it unique
Grounding-optimised with strong citation behaviour; Cohere also ships multilingual Aya models (Tiny Aya covers 70+ languages at 3.35B params).
Latest products & successors
Part of Cohere’s 2026 enterprise stack; check the CC-BY-NC licence terms before commercial deployment.
Considerations & trade-offs
Some weights are CC-BY-NC — check licensing before commercial use; a focused enterprise tool, not a general chat champion.
Part IV — The Image-Generation Frontier
27 Midjourney v7
Midjourney · USA · 2026
New (long lineage)
The enduring benchmark for artistic and editorial image quality.
TypeImage generation · Artistic
Contextn/a (text-to-image)
PricingSubscription tiers
Capabilities
Remains the gold standard for editorial and artistic image quality — particularly fashion, architecture and concept art. Prized for aesthetic coherence and ‘taste’ rather than raw photorealism, and now adds image-to-video animation (v8.1 animates a still into four short clips).
What makes it unique
Distinctive artistic house-style and community workflow; commercially-licensed outputs; the reference for concept-art quality.
Latest products & successors
Midjourney v7 leads image quality; v8.1 adds 5-second 480p/720p image-to-video.
Considerations & trade-offs
Less about photorealism than ‘taste’; subscription-gated workflow and a distinctive house style that suits some brands more than others.
28 Nano Banana Pro
Google · USA · 2026
New
The community favourite for stylised work inside the Google ecosystem.
TypeImage generation · Stylised + ecosystem
Contextn/a (text-to-image)
PricingUsage-based via Google
Capabilities
The creative community’s favourite for artistic and stylised outputs and the strongest pick for Google-ecosystem image work. Competes with ChatGPT Images 2.0 (photorealism), Claude Design (brand consistency) and Adobe Firefly across the 2026 image stack.
What makes it unique
Strong stylised control with deep Google integration; SynthID watermarking is embedded by default (survives cropping/compression).
Latest products & successors
Nano Banana Pro / Nano Banana 2 available across Google and third-party hubs (Flux AI, ElevenLabs platform).
Considerations & trade-offs
Best results lean on Google integration; for pure photorealism, ChatGPT Images 2.0 often leads.
Part V — The Video-Generation Frontier
29 Google Veo 3.1
Google DeepMind · USA · 2026
New
The cinematic video leader — and the first to make native audio standard.
TypeVideo generation · Cinematic + audio
Contextn/a (text/image-to-video)
PricingEnterprise / Google Cloud; credit-based hubs
Capabilities
The most capable AI video generator in 2026 for photorealistic, cinematic scenes — best-in-class temporal consistency, prompt adherence, 4K landscape/portrait output and, crucially, integrated native audio that eliminates a whole post-production step.
What makes it unique
Native synchronised audio (analyses motion vectors to generate matching sound); ‘Ingredients-to-Video’ multi-reference control; SynthID watermarking standard.
Latest products & successors
Veo 3.1 (Quality / Fast / Lite) leads the field; the Gemini Omni line extends native video+audio generation across Google products.
Considerations & trade-offs
Enterprise/Google-Cloud-gated access; like all 2026 video, still struggles with hands, lip-sync and narrative coherence beyond ~60–90 seconds.
30 Seedance 2.0
ByteDance · China · 2026
New
The storytelling video specialist for multi-character, multi-shot scenes.
TypeVideo generation · Storytelling + character
Contextn/a (text/image-to-video)
PricingCredit-based; competitive
Capabilities
The most capable model for cinematic storytelling with multiple characters and scene transitions, and a frequent ‘safe default’ for high-quality, affordable text/image-to-video. Especially strong at consistent character movement, drawing on ByteDance’s short-form video expertise. Competes with Kling 3.0, Sora 2 and Runway Gen-4.5.
What makes it unique
Multi-character scene transitions with native audio; integrated into unified creative platforms (e.g. ElevenLabs) alongside Veo, Sora and Kling.
Latest products & successors
Seedance 2.0 (and 2.0 Fast) released 2026. Note: OpenAI’s competing Sora web/app is being discontinued (app 26 Apr 2026; API 24 Sept 2026), reshaping the video field.
Considerations & trade-offs
Chinese-origin model with the usual governance considerations; quality ceiling sits just below Veo 3.1/Kling O3 for hero shots.
Comparative Landscape
All thirty models at a glance. New = current generation; Legacy = superseded but still deployed.
| # | Model | Maker · Origin | Status | Best at |
|---|---|---|---|---|
| 1 | Claude Opus 4.8 | Anthropic · USA | New | Reasoning + agentic |
| 2 | GPT-5.5 | OpenAI · USA | New | Agentic + creative |
| 3 | Gemini 3.1 Pro | Google DeepMind · USA | New | Multimodal + reasoning |
| 4 | Grok 4.3 | xAI · USA | New | Real-time + agentic |
| 5 | Claude Sonnet 4.6 | Anthropic · USA | New | Balanced |
| 6 | GPT-5.3 Codex | OpenAI · USA | New | Code/agent |
| 7 | Claude Haiku 4.5 | Anthropic · USA | New | Fast |
| 8 | Gemini 2.5 Flash | Google DeepMind · USA | Legacy (still widely used) | High-throughput |
| 9 | GPT-4o | OpenAI · USA | Legacy | Multimodal |
| 10 | Claude 3.5 Sonnet | Anthropic · USA | Legacy | Coding |
| 11 | OpenAI o1 | OpenAI · USA | Legacy | Chain-of-thought |
| 12 | Claude Opus 4.5 | Anthropic · USA | Recent legacy | Coding |
| 13 | DeepSeek V4 Pro | DeepSeek · China | New | Reasoning + coding |
| 14 | DeepSeek V4 Flash | DeepSeek · China | New | Cost-efficient long-context |
| 15 | Qwen 3.5 (397B-A17B) | Alibaba · China | New | Multilingual |
| 16 | Llama 4 Scout | Meta · USA | New | Ultra-long-context |
| 17 | Llama 4 Maverick | Meta · USA | New | General-purpose flagship |
| 18 | Kimi K2.6 | Moonshot AI · China | New | Multimodal agentic |
| 19 | GLM-5.2 | Z.ai (Zhipu / Tsinghua) · China | New | Long-horizon coding |
| 20 | Mistral Large 3 | Mistral AI · France/EU | New | European frontier |
| 21 | Mistral Small 4 | Mistral AI · France/EU | New | Efficient agentic coding |
| 22 | DeepSeek R1 | DeepSeek · China | Legacy (landmark) | Reasoning |
| 23 | Qwen 3 235B-A22B | Alibaba · China | Legacy | Reasoning/coding |
| 24 | Llama 3.3 70B | Meta · USA | Legacy | General workhorse |
| 25 | Gemma 4 (31B / 26B-A4B) | Google · USA | New | Edge & on-device |
| 26 | Command A | Cohere · Canada | New | Enterprise RAG |
| 27 | Midjourney v7 | Midjourney · USA | New (long lineage) | Artistic |
| 28 | Nano Banana Pro | Google · USA | New | Stylised + ecosystem |
| 29 | Google Veo 3.1 | Google DeepMind · USA | New | Cinematic + audio |
| 30 | Seedance 2.0 | ByteDance · China | New | Storytelling + character |
Six Forces Reshaping the Market
1. Convergence at the top
By June 2026 the four US frontier models — Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro and Grok 4.3 — are separated by only a handful of points on the Artificial Analysis Intelligence Index (61.4 / 60.2 / 57 / 53). Raw capability is no longer the differentiator; workflow fit, ecosystem, latency and price are.
2. The agentic battleground
Agentic AI — models that plan and execute multi-step tasks autonomously, calling tools and controlling software — is the fastest-growing use case and the primary axis of competition. Native computer-use (GPT-5.5), persistent coding agents (Claude Code), and multi-agent orchestration (Kimi’s claimed 300 sub-agents) define the 2026 frontier. The Agentic AI Foundation (under the Linux Foundation, Dec 2025) and MCP’s 97M+ installs signal that agent infrastructure has become shared, standard plumbing.
3. Open-weight models went production-grade
Qwen 3.5, DeepSeek V4, GLM-5 and Llama 4 now match or beat proprietary models on key benchmarks while being self-hostable. The serious AI stack is now a portfolio: frontier closed models for the hardest work, cheap hosted open-weight APIs for volume, local models for privacy/fallback, and task-specific models for niches.
4. The cost floor collapsed
Self-hosted Llama drops per-token cost toward zero at scale; DeepSeek V4 Flash lists at ≈$0.14/$0.28 per 1M; Gemini 2.5 Flash at ≈$0.30/M. Teams still using 2025 pricing assumptions are often overpaying by 2–5×. Prompt caching (up to 90% on Claude) further reshapes economics.
5. Compressed release cadence
Anthropic shipped Opus 4.5 → 4.6 → 4.7 → 4.8 between November 2025 and May 2026; OpenAI moved GPT-5.1 → 5.5 with 5.6 rumoured weeks later. Endpoint pinning and a standing evaluation framework are now operational necessities, not nice-to-haves.
6. Multimodal & creative convergence
Image and video generation matured into dependable production tools. Native audio in video (Veo 3.1, Seedance 2.0, Kling Omni) collapses post-production steps, and unified platforms (ElevenLabs, Flux AI, Adobe Firefly) stitch many models into single pipelines — shifting value from individual models to orchestrated workflows.
Selection & Routing Guidance
No single model wins everything. Send each task to the cheapest model that clears your quality bar; reserve frontier models for the hardest work.
| Task | Recommended models |
|---|---|
| Sustained, complex coding & agents | Claude Opus 4.8 (+ Claude Code); GPT-5.3 Codex for fast CLI loops. |
| Best-value production coding | Claude Sonnet 4.6; open: DeepSeek V4 Pro / GLM-5.2. |
| Reasoning, maths & data analysis | Gemini 3.1 Pro; open: DeepSeek R1 / V4 Pro, Qwen 3.5. |
| Creative writing & tone | GPT-5.5. |
| Multimodal understanding | Gemini 3.1 Pro; open: Kimi K2.6. |
| Real-time / trend-aware | Grok 4.3. |
| Ultra-long context | Llama 4 Scout (10M tokens). |
| High-volume / cheap throughput | Gemini 2.5 Flash, Claude Haiku 4.5, self-hosted Llama / DeepSeek V4 Flash. |
| Edge / on-device | Gemma 4; Mistral Small 4. |
| Enterprise grounded RAG | Cohere Command A. |
| Image generation | ChatGPT Images 2.0 (photoreal), Midjourney v7 (art), Nano Banana Pro (stylised), Claude Design (brand). |
| Video generation | Veo 3.1 (cinematic + audio), Seedance 2.0 (multi-character). |
Operating principles
Pin production endpoints to explicit versions · keep a standing evaluation set · treat single-provider lock-in as a cost · use prompt caching and batch APIs aggressively · self-host open-weight models for sensitive data.
The Makers Behind the Models
Anthropic (USA)
Safety-first lab behind Claude; originated MCP and the deepest agentic tooling (Claude Code, Cowork). 2026 line spans Haiku → Sonnet → Opus → the new Mythos tier.
OpenAI (USA)
Triggered the consumer-AI era with ChatGPT; broadest product surface; pioneered reasoning models with o1. GPT-5.x leads creative writing and terminal workflows.
Google DeepMind (USA)
Gemini bets on native multimodality + ecosystem integration; leads reasoning and video (Veo); ships the most genuinely open Gemma edge models.
xAI (USA)
Differentiates on real-time X data and aggressive pricing; multi-agent Grok architecture; Grok 5 (~6T params) on the horizon.
Meta (USA)
Open-weight Llama seeded the open ecosystem and made self-hosting mainstream; class-leading 10M-token context in Scout.
DeepSeek (China)
Repeatedly disrupted frontier pricing (R1, then efficient V4 MoE at 1M context); MIT-licensed, self-host favourite.
Alibaba / Qwen (China)
Download leader of 2026 (>50% of open-model downloads); Apache-2.0 family across general, coding, multimodal and 200+ languages.
Moonshot, Z.ai & others (China)
Kimi (multimodal agentic), GLM (long-horizon coding), plus MiniMax and StepFun — a deep Chinese open bench.
Mistral AI (France/EU)
Europe’s frontier open lab; now Apache-2.0 across Large/Medium/Small — the sovereign, on-prem choice.
Cohere (Canada) & creative labs
Cohere specialises in grounded enterprise RAG; Midjourney, Black Forest Labs (FLUX) and ByteDance (Seedance/Kling) define creative generation.
On the Horizon
The frontier captured here will have moved by the time it is read. As of June 2026 the most-watched imminent releases:
GPT-5.6 (OpenAI)
Leaked in the Codex backend and canary-tested against live traffic; expected late June 2026. Reportedly rebuilt around a redesigned reward-audit pipeline after the GPT-5.5 persona-contamination issues.
Gemini 3.5 Pro (Google)
Trailed at Google I/O alongside the shipped 3.5 Flash; Pro promised “next month” as of late May 2026, aimed at the hard-reasoning ceiling.
Claude Mythos line (Anthropic)
New Mythos tier above Opus (Mythos 5 + safeguarded Fable 5, June 2026), with Opus 4.8 as a fallback safety layer. Access limited and partly export-restricted.
Grok 5 (xAI)
A rumoured ~6-trillion-parameter model targeted for the Q2 2026 window — roughly double Grok 4’s scale.
Conclusion
In 2024 the AI conversation was about a single best model. In 2026 it is about a portfolio. The frontier has converged so tightly that the marginal capability difference between the top closed models is smaller than the difference a good routing strategy makes — and open-weight models have made self-hosting frontier-grade intelligence a routine engineering decision.
The thirty models here are best understood not as competitors for one throne but as a toolkit. The organisations extracting the most value built the discipline to match each task to the right tool, measure relentlessly, and switch without friction as the frontier moves — which, in 2026, it does every few weeks.
The one durable skill
Model choice is now a configuration decision, not an engineering project. The durable advantage is the evaluation-and-routing discipline that lets an organisation absorb a new frontier model the week it ships — and drop it the week something better arrives.







Be First to Comment