Comprehensive Thesis: Top 30 Old and New AI Models in the Global Market (2026 Edition)

A Comprehensive Thesis on the Top 30 Legacy and Frontier AI Models, Their Unique Capabilities, and Their Latest Products

Technology & AI Research Division · June 2026 · 30 Models · 9 Makers · 6 Nations · Frontier to Edge

Executive Summary

This thesis surveys thirty of the most consequential AI models on the market as of June 2026 — the closed-source frontier, the open-weight challengers, edge and enterprise specialists, and the image and video generators that turned creative production into an automated pipeline.

The headline of 2026 is convergence. The four leading US frontier models are separated by only a few points on the Artificial Analysis Intelligence Index, and “which model is best” has become “which model for which job.” Claude Opus 4.8 leads overall and on sustained coding; GPT-5.5 owns creative writing and terminal workflows; Gemini 3.1 Pro leads reasoning, data analysis and multimodality; Grok 4.3 is the budget agentic option with real-time data.

Beneath them, open-weight models from China (DeepSeek, Qwen, Kimi, GLM), the US (Llama, Gemma) and Europe (Mistral) have closed most of the capability gap while collapsing cost. The serious AI stack is no longer a single model but a routed portfolio.

Bottom line

There is no single best model in 2026. Capability has converged at the top; differentiation lives in agentic tooling, ecosystem fit, latency, licensing and price. The winning strategy is a multi-model portfolio with a standing evaluation framework and pinned endpoints.

A Brief History: How We Got to Thirty Models

Phase 1 — Transformer foundation (2017–2022). The transformer architecture made internet-scale training practical. GPT-3 showed scale produces general capability; ChatGPT (2022) turned it into a mass-market product and started the race.

Phase 2 — The GPT-4 era (2023–2024). GPT-4 set the bar. Claude established a safety-and-coding track, Google consolidated into Gemini, Meta open-sourced Llama. GPT-4o mainstreamed multimodal voice/vision; Claude 3.5 Sonnet defined the coding assistant.

Phase 3 — The reasoning turn (late 2024–2025). OpenAI’s o1 opened the inference-compute reasoning axis; DeepSeek-R1 proved open weights could match it cheaply, forcing a global re-pricing of frontier reasoning.

Phase 4 — Agentic & convergence (2025–2026). The late-2025 wave shifted the contest from “what can it answer” to “what can it do.” Native computer-use, persistent coding agents and multi-agent orchestration arrived; MCP crossed 97M installs and an Agentic AI Foundation formed under the Linux Foundation. By mid-2026 routing strategy mattered more than model choice.

The compression of time

GPT-3 to ChatGPT took two years; ChatGPT to the agentic frontier took three. Anthropic shipped four Opus versions (4.5–4.8) in six months. The frontier now moves faster than most organisations can re-architect.

Reading the Benchmarks Critically

Every figure here should be read with informed scepticism. Benchmarks are signals, not verdicts.

Intelligence Index — a composite; good for rough ranking, misleading for a specific workload.
SWE-bench Verified / Pro — real GitHub issues; the most realistic coding test, but harness-dependent.
GPQA Diamond — graduate-level science reasoning.
AIME / MATH-500 / FrontierMath — competition & research maths (FrontierMath v2 was re-released June 2026 after corrections).
LMSYS Arena (Elo) — human preference; captures “feel” but reflects popularity too.

The only benchmark that matters

Your own. A reproducible evaluation set from real tasks — measuring quality, latency, cost-per-task and error rate — beats any public leaderboard. Treat published numbers as a shortlist generator, not a decision.

Part I — The Frontier: Closed-Source Leaders & Their Legacy

01Claude Opus 4.8

Anthropic · USA · May 2026 New

The model to beat in mid-2026, and the spine of Anthropic’s agentic product line.

TypeFrontier closed-source · Reasoning + agentic

Context~1M tokens (input)

Pricing$5 in / $25 out per 1M tokens (Fast mode $10 / $50)

Capabilities

Currently the #1 overall model on the Artificial Analysis Intelligence Index (61.4). The strongest sustained-coding and long-horizon agentic model available, leading SWE-bench Verified (≈88.6%) and SWE-bench Pro (≈69.2%). Its lead grows the longer and more complex the task, making it the default for multi-file engineering, strategic synthesis and document-heavy work.

What makes it unique

Best-in-class agentic coding paired with Claude Code; pioneered the Model Context Protocol (MCP) now standard across the industry. Prompt caching can cut input cost up to 90%.

Latest products & successors

Powers Claude Code, Claude Cowork, and agentic browser/Excel/PowerPoint tools. Sits beneath Anthropic’s new Mythos tier (Claude Mythos 5 / Fable 5, June 2026), with Opus 4.8 acting as a safety fallback layer.

Considerations & trade-offs

Premium pricing and an optional Fast mode that doubles cost; the very top of the market, so overkill (and over-budget) for routine tasks better served by Sonnet or Haiku.

02 GPT-5.5

OpenAI · USA · April 2026

New

OpenAI’s do-everything flagship and the broadest consumer-to-enterprise footprint in AI.

TypeFrontier closed-source · Agentic + creative

Context~1M tokens in / 128K out

PricingStandard tier ≈ $2.50 in / $15 out per 1M (Pro variant far higher)

Capabilities

OpenAI’s current flagship, second on the Intelligence Index (60.2) and effectively tied with Opus 4.8 for top coding performance. Built for agentic and professional work — research, tool use and long-horizon tasks — while staying token-efficient. The leading model for creative writing with a warm, natural tone, and unmatched 128K max output for long-form generation.

What makes it unique

Native computer-use (controls browsers, fills forms, executes workflows); broadest product ecosystem via ChatGPT. Strongest CLI/terminal workflow performance.

Latest products & successors

GPT-5.5 (April 2026) succeeds the Nov-2025 GPT-5.1 wave. GPT-5.6 widely rumoured for late June 2026, reportedly rebuilt around a redesigned reward-audit pipeline after the GPT-5.5 ‘Goblin’ persona-contamination post-mortem.

Considerations & trade-offs

The GPT-5.5 ‘Goblin’ persona-contamination episode showed reward-model fragility; a 5.6 successor is rumoured imminently, so pin versions for production.

03 Gemini 3.1 Pro

Google DeepMind · USA · February 2026

New

The most underrated frontier model — quietly leading reasoning and multimodality.

TypeFrontier closed-source · Multimodal + reasoning

Context~1M tokens (doubles price above 200K)

PricingCompetitive; roughly doubles for >200K-token contexts

Capabilities

Leads the field on reasoning and data analysis, and is consistently strong on multimodal tasks (text, image, audio, video). Frequently underestimated but competitive with the very top tier, and the cheapest of the closed frontier models for short prompts. Tied near the top of the Intelligence Index (≈57).

What makes it unique

Deepest native multimodality and tightest Google ecosystem integration (Search, Workspace, Android). Excellent grounding via live Google Search.

Latest products & successors

Gemini 3.5 Flash shipped at Google I/O (May 2026); Gemini 3.5 Pro promised ‘next month’ as of late May. The Gemini Omni line now powers native video+audio generation.

Considerations & trade-offs

Pricing roughly doubles above 200K-token contexts, which can surprise long-document workloads; ecosystem lock-in to Google for the best experience.

04 Grok 4.3

xAI · USA · April 2026

New

The budget frontier option with a real-time data advantage no rival matches.

TypeFrontier closed-source · Real-time + agentic

Context~1M tokens

Pricing≈ $1.25 in / $2.50 out per 1M tokens

Capabilities

The cheapest of the four US frontier models, with strong agentic and tool-use scores (≈94.1% on agentic-accuracy benchmarks via its ReAct-2 framework). Native video input and top-tier long-chain agent capability. Intelligence Index ≈53.

What makes it unique

Real-time X/web data access through an ‘X-Platform Latent Index’ that surfaces global trends faster than search-based agents; a 4-agent internal system (Grok, Harper, Benjamin, Lucas).

Latest products & successors

Grok 4.3 (April 2026) is the current public flagship; Grok 5 (a rumoured ~6T-parameter model) is targeted for the Q2 2026 window. SuperGrok Video added competitive portrait/character video generation.

Considerations & trade-offs

Trails Opus 4.8 and DeepSeek V4 Pro on pure coding; the X-data tie-in is a strength for trend tasks but irrelevant for many enterprise uses.

05 Claude Sonnet 4.6

Anthropic · USA · February 2026

New

The default ‘most-tasks’ model — near-Opus quality at a working budget.

TypeWorkhorse closed-source · Balanced

Context~1M tokens (beta)

Pricing≈ $3 in per 1M tokens (output higher)

Capabilities

The best value for production coding — near-Opus quality at a fraction of the cost. The default ‘most tasks’ model for marketing, technical docs, integration logic and extended coding, scoring ≈89.3% on GPQA Diamond. Widely used as the balanced tier in multi-model routing stacks.

What makes it unique

Near-frontier quality at ~$3/M; 1M-token context in beta; shares Claude’s agentic tooling and MCP support.

Latest products & successors

Current API string claude-sonnet-4-6. Frequently paired with Opus 4.8 (hard tasks) and Haiku 4.5 (cheap tasks) in production routing.

Considerations & trade-offs

Not the absolute top for the hardest problems; reserve Opus 4.8 for genuinely frontier coding and long-horizon agents.

06 GPT-5.3 Codex

OpenAI · USA · Early 2026

New

A coding-and-agent scalpel rather than a general-purpose chat model.

TypeSpecialised closed-source · Code/agent

ContextLarge (codebase-scale)

PricingUsage-based via OpenAI API

Capabilities

A coding-and-agent-tuned GPT variant optimised for terminal workflows, file editing and debugging. Notable for extremely low first-token latency in benchmark harnesses, making it well-suited to fast interactive coding agents.

What makes it unique

Tuned specifically for the Codex agent backend and CLI/terminal task execution rather than open-ended chat.

Latest products & successors

Runs inside OpenAI’s Codex agent. A ‘gpt-5.6’ routing reference was spotted in the Codex backend in May 2026, hinting at an imminent successor.

Considerations & trade-offs

Narrowly tuned for the Codex/CLI context; less suited to open-ended conversation or creative work.

07 Claude Haiku 4.5

Anthropic · USA · Oct 2025

New

The cheap, fast leg of a routed Claude stack.

TypeBudget closed-source · Fast

ContextLarge

PricingBudget tier (lowest of the Claude family)

Capabilities

Anthropic’s fast, low-cost tier for high-volume work — classification, extraction, routing and lightweight chat — while retaining Claude’s tool-use and safety behaviour. Competes with GPT-5 nano for budget-tier tasks.

What makes it unique

Low latency and cost with full MCP/tool support; ideal as the cheap leg of a routed multi-model stack.

Latest products & successors

API string claude-haiku-4-5. Remains the current Haiku generation as of mid-2026.

Considerations & trade-offs

Not for hard reasoning or complex coding; a router should escalate difficult tasks to Sonnet or Opus.

08 Gemini 2.5 Flash

Google DeepMind · USA · 2025–26

Legacy (still widely used)

The price/throughput champion that still handles most routine work.

TypeBudget closed-source · High-throughput

ContextLarge

Pricing≈ $0.30/M tokens

Capabilities

Consistently offers the best capability-to-price ratio among hosted models — often the cheapest strong model available (≈$0.30/M). The default for high-volume summarisation, Slack digests and standard-turnaround tasks at scale.

What makes it unique

Extremely cheap, fast, and good enough for the majority of routine production tasks; deep Google integration.

Latest products & successors

Superseded at the quality frontier by Gemini 3.5 Flash (May 2026) but retained for its price/throughput; still a routing favourite.

Considerations & trade-offs

Behind the 2026 frontier on hard tasks; superseded in quality by Gemini 3.5 Flash but kept for cost.

09 GPT-4o

OpenAI · USA · 2024

Legacy

The model that taught the public what an AI assistant feels like.

TypeLegacy flagship · Multimodal

Context128K tokens

PricingMid-tier legacy pricing

Capabilities

The model that mainstreamed real-time multimodal interaction (text, vision, voice) and defined the assistant experience for over a year. Still a capable, widely-integrated general model, though now well behind the 2026 frontier on reasoning and coding.

What makes it unique

Fast omni-modal voice/vision; enormous installed base and tooling. A reference point for ‘GPT-4-class’ as a capability tier.

Latest products & successors

Superseded by GPT-5.x. Remains available via API (gpt-4o, gpt-4o-mini) for cost-sensitive or compatibility-bound workloads.

Considerations & trade-offs

Well behind 2026 frontier on reasoning and coding; retained mainly for cost, compatibility and its huge installed base.

10 Claude 3.5 Sonnet

Anthropic · USA · 2024

Legacy

The coding assistant that defined an era.

TypeLegacy flagship · Coding

Context200K tokens

PricingLegacy pricing

Capabilities

A landmark 2024 model that set the standard for coding assistants of its era (≈72.5% SWE-bench in its day) and popularised Artifacts-style structured output. Historically important as the model that established Claude’s coding reputation.

What makes it unique

Introduced computer-use as a research preview; strong instruction-following and ‘feel’ that defined a generation of coding tools.

Latest products & successors

Long superseded by the Claude 4.x line, but still referenced as a capability baseline and available for legacy integrations.

Considerations & trade-offs

Long superseded; relevant now as a capability baseline and for legacy integrations, not new builds.

11 OpenAI o1

OpenAI · USA · Late 2024

Legacy

The model that created the reasoning-model category.

TypeLegacy reasoning · Chain-of-thought

Context128K tokens

PricingLegacy reasoning-tier pricing

Capabilities

The model that launched the dedicated ‘reasoning model’ category — trading speed for accuracy by spending inference compute on hidden chain-of-thought. Established the template (o1 → o3 → o4-mini) that every major lab has since adopted.

What makes it unique

First mainstream test-time-compute reasoning model; excelled at hard maths, logic and structured analysis relative to its generation.

Latest products & successors

Superseded by the o3/o4 series and folded into the GPT-5 reasoning stack, but historically pivotal.

Considerations & trade-offs

Folded into the GPT-5 reasoning stack; historically pivotal but not a current production choice.

12 Claude Opus 4.5

Anthropic · USA · Nov 2025

Recent legacy

The November-2025 flagship that proved autonomous long-running coding at scale.

TypeRecent-legacy flagship · Coding

Context~1M tokens

PricingPremium tier (now lower than 4.8)

Capabilities

The November-2025 flagship that topped SWE-bench Verified at 80.9% and anchored the late-2025 frontier wave alongside GPT-5.1, Grok 4.1 and Gemini 3 Pro. Directly preceded the 4.6/4.7/4.8 rapid-iteration cycle.

What makes it unique

Demonstrated the step-change in autonomous, long-running coding that Opus 4.8 later extended; strong agentic reliability.

Latest products & successors

Superseded within months by Opus 4.6 (Feb), 4.7 (April) and 4.8 (May 2026) — illustrating 2026’s compressed release cadence.

Considerations & trade-offs

Superseded within months by 4.6/4.7/4.8; choose 4.8 unless cost or pinning dictates otherwise.

Part II — The Open-Weight Insurgency & Chinese Frontier

13 DeepSeek V4 Pro

DeepSeek · China · April 2026

New

The open-weight model that re-priced frontier reasoning — again.

TypeOpen-weight frontier · Reasoning + coding

Context1M tokens (native)

PricingList ≈ $1.74 in / $3.48 out per 1M; self-host = near-zero

Capabilities

The most capable open-weight model for reasoning and mathematics. A 1.6T-parameter MoE (≈42–49B active per token) scoring ≈80.6% on SWE-bench Verified — about 7–8 points above Grok 4.3 on coding. Disrupts the ‘intelligence-per-dollar’ curve with aggressive pricing.

What makes it unique

MIT licence (fully self-hostable); hybrid Compressed/Heavily-Compressed Attention cuts FLOPs to ~27% and KV-cache to ~10% vs V3.2 at 1M context. Three reasoning modes: Non-think, Think High, Think Max.

Latest products & successors

DeepSeek V4 (Pro + Flash) public preview, April 2026. Legacy deepseek-chat / deepseek-reasoner endpoints retire 24 July 2026 — migrate to V4.

Considerations & trade-offs

As a Chinese-origin model, cloud use raises data-governance questions; self-host the open weights for sensitive work. Legacy endpoints retire 24 July 2026.

14 DeepSeek V4 Flash

DeepSeek · China · April 2026

New

The cheapest credible 1M-context API in its class.

TypeOpen-weight · Cost-efficient long-context

Context1M tokens

Pricing≈ $0.14 in / $0.28 out per 1M tokens

Capabilities

A 284B-total / 13B-active MoE delivering frontier-adjacent quality at one of the lowest costs for 1M-context throughput. The pragmatic choice when the dominant constraint is cheap, long-context volume.

What makes it unique

Best cost-sensitive long-context API in its class; open weights for private deployment.

Latest products & successors

Released alongside V4 Pro (April 2026); listed at ≈$0.14/M input, $0.28/M output.

Considerations & trade-offs

Lower ceiling than V4 Pro; best for high-volume long-context throughput rather than the hardest reasoning.

15 Qwen 3.5 (397B-A17B)

Alibaba · China · 2026

New

The download leader and the most versatile open family.

TypeOpen-weight frontier · Multilingual

Context262K native (extendable to 1M+)

PricingFree to self-host; low hosted-API pricing

Capabilities

Alibaba’s flagship open model and the download leader — Qwen captured over 50% of global open-source model downloads by April 2026. Competitive with GPT-5.5 on reasoning, with multimodal reasoning across text/image/video/documents and coverage of 200+ languages.

What makes it unique

Apache-2.0 licence; the broadest fine-tune ecosystem among Chinese models; specialised Qwen-Coder and Qwen-VL variants.

Latest products & successors

Qwen 3.5 / 3.6 / 3.7 line plus Qwen3-Coder-Next (80B-total, 3B-active) for efficient self-hosted coding agents.

Considerations & trade-offs

Quality varies sharply across the many variants; ‘Coder’-named models didn’t always top independent coding harnesses — test the specific checkpoint.

16 Llama 4 Scout

Meta · USA · 2025–26

New

The context-window king, by a factor no rival approaches.

TypeOpen-weight · Ultra-long-context

Context10M tokens (class-leading)

PricingFree (self-host); Meta custom licence with 700M-MAU cap + EU restrictions

Capabilities

The undisputed context-window champion at 10M tokens — nothing else comes close — enabling whole-codebase or whole-library reasoning in a single pass. Also one of the highest-throughput open models (≈2,600 t/s in benchmarks).

What makes it unique

10M-token context; open weights for self-hosting and fine-tuning; massive community tooling and the largest fine-tune ecosystem.

Latest products & successors

Part of the Llama 4 family (Scout / Maverick / Behemoth); Llama 5 reported in development with the largest community ecosystem.

Considerations & trade-offs

Meta’s custom licence caps use at 700M MAU and restricts the EU; raw context size doesn’t guarantee top reasoning quality.

17 Llama 4 Maverick

Meta · USA · 2025–26

New

Meta’s balanced open flagship and the enterprise default.

TypeOpen-weight · General-purpose flagship

ContextVery large

PricingFree (self-host); Meta custom licence

Capabilities

Meta’s general-purpose open flagship and the highest-MMLU open model in several comparisons (≈85.5%). The go-to open choice for balanced chat, reasoning and agentic features when self-hosting or avoiding vendor lock-in.

What makes it unique

Strong all-round open performance with Meta’s backing; agentic capabilities competitive with proprietary models for many tasks.

Latest products & successors

Current Llama 4 flagship alongside Scout and the larger Behemoth; anchors the most-deployed open-weight family in enterprise.

Considerations & trade-offs

Custom-licence constraints apply; for the very hardest tasks, closed frontier models still lead.

18 Kimi K2.6

Moonshot AI · China · 2026

New

The most disciplined Chinese coding family and a multimodal agentic standout.

TypeOpen-weight · Multimodal agentic

Context256K tokens

PricingLow; open weights available

Capabilities

A native multimodal (text/image/video) agentic model aimed at long-horizon coding, design and autonomous execution. The most disciplined Chinese coding family in independent tests — the only Chinese model to reach ‘Tier A’ with no caveats (≈87/100 vs Opus 4.7’s 97, at ~3.6× lower cost).

What makes it unique

Model-card claim of up to 300 sub-agents coordinating ~4,000 steps; strong cross-lingual code-switching; open weights.

Latest products & successors

Kimi K2.6 (multimodal) follows K2 / K2.5 / K2 Thinking; a top-5 LMSYS Arena presence in 2026.

Considerations & trade-offs

Bold autonomous-orchestration claims (300 sub-agents) need independent verification; still a measurable gap to Opus on secondary code-quality dimensions.

19 GLM-5.2

Z.ai (Zhipu / Tsinghua) · China · 2026

New

An open coding specialist that surprised the industry on SWE-bench.

TypeOpen-weight · Long-horizon coding

Context1M tokens

PricingLow; open weights (MIT)

Capabilities

Positioned around long-horizon coding with strong vendor-reported agentic-coding benchmarks; GLM-5 scores ≈77.8% on SWE-bench Verified and has surprised the industry with coding performance approaching Claude Opus on some tests.

What makes it unique

MIT licence; 1M-token context tuned for sustained coding sessions. (Independent harnesses caught some real-world bugs, so verify on your own stack.)

Latest products & successors

GLM-5 / 5.1 / 5.2 line from Z.ai; a leading open coding family in 2026.

Considerations & trade-offs

Independent harnesses caught real bugs (invented DSLs, history loss); vendor benchmarks ran ahead of some real-world results — verify on your stack.

20 Mistral Large 3

Mistral AI · France/EU · 2026

New

The European sovereign-AI choice with a now-permissive licence.

TypeOpen-weight · European frontier

ContextLarge

PricingFree (self-host, Apache-2.0); hosted API available

Capabilities

The leading European open-weight generalist, with strong multilingual support (80+ languages) and enterprise deployment paths. The default ‘sovereign’ choice for organisations needing an EU-based, vendor-backed open model.

What makes it unique

Now Apache-2.0 (a major shift from Mistral’s earlier restrictive licensing); strong commercial-relationship and on-prem story.

Latest products & successors

Mistral’s 2026 open line spans Large 3, Medium 3.5 and Small 4 — covering generalist, multimodal and efficient agentic use.

Considerations & trade-offs

Not the outright benchmark leader; chosen for EU residency, multilingual strength and on-prem control as much as raw capability.

21 Mistral Small 4

Mistral AI · France/EU · March 2026

New

A pocket-sized agentic coder for local and edge deployment.

TypeOpen-weight · Efficient agentic coding

Context256K tokens

PricingFree (self-host, Apache-2.0)

Capabilities

A compact ~6B-active model that folds in Devstral-style agentic coding capability, built for efficient local and hosted coding agents where footprint and latency matter more than maximum benchmark score.

What makes it unique

Apache-2.0; small enough to serve cheaply while retaining agentic tool-use; strong multilingual coverage.

Latest products & successors

Released March 2026 as part of Mistral’s refreshed open lineup.

Considerations & trade-offs

Small size caps maximum capability; a tool for efficiency-constrained agents, not frontier reasoning.

22 DeepSeek R1

DeepSeek · China · 2025 (updated 2026)

Legacy (landmark)

The landmark that made affordable, private, frontier-adjacent reasoning real.

TypeOpen-weight legacy · Reasoning

Context128K tokens

PricingFree (self-host); ~10× cheaper than GPT-4o-class APIs historically

Capabilities

The model that proved open-weight reasoning could rival proprietary systems — a 671B MoE (≈37B active) scoring ≈97.3% on MATH-500. Hugely influential: it forced a global re-pricing of frontier-grade reasoning and remains a self-hosting favourite for maths and structured analysis.

What makes it unique

MIT licence; <think>-tag chain-of-thought; the reference point for affordable, private, frontier-adjacent reasoning.

Latest products & successors

Now succeeded by the V4 line and DeepSeek V3.2-Speciale (gold at IMO/IOI/ICPC 2026), but R1 remains widely deployed via Ollama/vLLM.

Considerations & trade-offs

Now succeeded by V4 and V3.2-Speciale; still excellent for self-hosted maths/analysis but no longer the open ceiling.

23 Qwen 3 235B-A22B

Alibaba · China · 2025–26

Legacy

A widely-deployed Qwen baseline with strong maths/coding siblings.

TypeOpen-weight legacy · Reasoning/coding

Context128K tokens

PricingFree (self-host)

Capabilities

A widely-deployed earlier Qwen flagship that outperformed DeepSeek-R1 on several benchmarks (Arena-Hard, LiveBench, LiveCodeBench) and scored ≈85.7% on AIME ’24. Its Qwen-Math and Qwen-Coder siblings remain popular specialised open models.

What makes it unique

Apache-2.0; strong long-context comprehension and specialised maths/coding variants; broad ecosystem.

Latest products & successors

Superseded by the Qwen 3.5/3.6/3.7 generation but still a common self-hosted baseline.

Considerations & trade-offs

Superseded by the 3.5+ generation; kept as a stable, well-understood self-hosted reference.

24 Llama 3.3 70B

Meta · USA · 2024–25

Legacy

The most-pulled open workhorse of the 2024–25 era.

TypeOpen-weight legacy · General workhorse

Context128K tokens

PricingFree (self-host); Meta custom licence

Capabilities

The default recommendation for most local-deployment scenarios through 2025 — RAG systems, chatbots, code assistance and fine-tuning — backed by the largest open ecosystem (most integrations, tutorials and ready-made solutions). Still a high-throughput workhorse (≈2,500 t/s).

What makes it unique

Enormous community tooling; 128K context; reliable, well-understood behaviour for production RAG.

Latest products & successors

Superseded by Llama 4 at the frontier but remains one of the most-pulled models on Ollama in 2026.

Considerations & trade-offs

Behind Llama 4 at the frontier; ideal for proven RAG/chatbot workloads where stability beats peak capability.

Part III — Edge & Enterprise Specialists

25 Gemma 4 (31B / 26B-A4B)

Google · USA · April 2026

New

Google’s most genuinely open release — frontier-adjacent reasoning on one GPU.

TypeOpen-weight · Edge & on-device

Context256K tokens

PricingFree (self-host, Apache-2.0)

Capabilities

Google’s most genuinely open release to date: a 31B dense model delivering top-tier reasoning on a single H100, plus a 26B-A4B MoE giving near-4B serving cost at much higher quality. Native function-calling and 100+-language coverage make it a strong edge and on-device choice.

What makes it unique

Apache-2.0; purpose-built for on-device/edge (the Gemma 3 4B variant runs in ~4.2 GB RAM); FunctionGemma 270M targets IoT function-calling.

Latest products & successors

Gemma 4 released April 2026 under Apache-2.0, extending Google’s edge-AI push.

Considerations & trade-offs

Not a frontier model in absolute terms; its value is serving cost and edge deployment, not topping leaderboards.

26 Command A

Cohere · Canada · 2025–26

New

The grounded-RAG specialist built to cite rather than confabulate.

TypeOpen-weight · Enterprise RAG

Context256K tokens

PricingEnterprise API; some weights under CC-BY-NC

Capabilities

Cohere’s enterprise-focused model optimised for retrieval-augmented generation and grounding — built to cite sources and minimise hallucination in business-document workflows. The specialist pick for grounded, tool-using enterprise RAG.

What makes it unique

Grounding-optimised with strong citation behaviour; Cohere also ships multilingual Aya models (Tiny Aya covers 70+ languages at 3.35B params).

Latest products & successors

Part of Cohere’s 2026 enterprise stack; check the CC-BY-NC licence terms before commercial deployment.

Considerations & trade-offs

Some weights are CC-BY-NC — check licensing before commercial use; a focused enterprise tool, not a general chat champion.

Part IV — The Image-Generation Frontier

27 Midjourney v7

Midjourney · USA · 2026

New (long lineage)

The enduring benchmark for artistic and editorial image quality.

TypeImage generation · Artistic

Contextn/a (text-to-image)

PricingSubscription tiers

Capabilities

Remains the gold standard for editorial and artistic image quality — particularly fashion, architecture and concept art. Prized for aesthetic coherence and ‘taste’ rather than raw photorealism, and now adds image-to-video animation (v8.1 animates a still into four short clips).

What makes it unique

Distinctive artistic house-style and community workflow; commercially-licensed outputs; the reference for concept-art quality.

Latest products & successors

Midjourney v7 leads image quality; v8.1 adds 5-second 480p/720p image-to-video.

Considerations & trade-offs

Less about photorealism than ‘taste’; subscription-gated workflow and a distinctive house style that suits some brands more than others.

28 Nano Banana Pro

Google · USA · 2026

New

The community favourite for stylised work inside the Google ecosystem.

TypeImage generation · Stylised + ecosystem

Contextn/a (text-to-image)

PricingUsage-based via Google

Capabilities

The creative community’s favourite for artistic and stylised outputs and the strongest pick for Google-ecosystem image work. Competes with ChatGPT Images 2.0 (photorealism), Claude Design (brand consistency) and Adobe Firefly across the 2026 image stack.

What makes it unique

Strong stylised control with deep Google integration; SynthID watermarking is embedded by default (survives cropping/compression).

Latest products & successors

Nano Banana Pro / Nano Banana 2 available across Google and third-party hubs (Flux AI, ElevenLabs platform).

Considerations & trade-offs

Best results lean on Google integration; for pure photorealism, ChatGPT Images 2.0 often leads.

Part V — The Video-Generation Frontier

29 Google Veo 3.1

Google DeepMind · USA · 2026

New

The cinematic video leader — and the first to make native audio standard.

TypeVideo generation · Cinematic + audio

Contextn/a (text/image-to-video)

PricingEnterprise / Google Cloud; credit-based hubs

Capabilities

The most capable AI video generator in 2026 for photorealistic, cinematic scenes — best-in-class temporal consistency, prompt adherence, 4K landscape/portrait output and, crucially, integrated native audio that eliminates a whole post-production step.

What makes it unique

Native synchronised audio (analyses motion vectors to generate matching sound); ‘Ingredients-to-Video’ multi-reference control; SynthID watermarking standard.

Latest products & successors

Veo 3.1 (Quality / Fast / Lite) leads the field; the Gemini Omni line extends native video+audio generation across Google products.

Considerations & trade-offs

Enterprise/Google-Cloud-gated access; like all 2026 video, still struggles with hands, lip-sync and narrative coherence beyond ~60–90 seconds.

30 Seedance 2.0

ByteDance · China · 2026

New

The storytelling video specialist for multi-character, multi-shot scenes.

TypeVideo generation · Storytelling + character

Contextn/a (text/image-to-video)

PricingCredit-based; competitive

Capabilities

The most capable model for cinematic storytelling with multiple characters and scene transitions, and a frequent ‘safe default’ for high-quality, affordable text/image-to-video. Especially strong at consistent character movement, drawing on ByteDance’s short-form video expertise. Competes with Kling 3.0, Sora 2 and Runway Gen-4.5.

What makes it unique

Multi-character scene transitions with native audio; integrated into unified creative platforms (e.g. ElevenLabs) alongside Veo, Sora and Kling.

Latest products & successors

Seedance 2.0 (and 2.0 Fast) released 2026. Note: OpenAI’s competing Sora web/app is being discontinued (app 26 Apr 2026; API 24 Sept 2026), reshaping the video field.

Considerations & trade-offs

Chinese-origin model with the usual governance considerations; quality ceiling sits just below Veo 3.1/Kling O3 for hero shots.

Comparative Landscape

All thirty models at a glance. New = current generation; Legacy = superseded but still deployed.

#	Model	Maker · Origin	Status	Best at
1	Claude Opus 4.8	Anthropic · USA	New	Reasoning + agentic
2	GPT-5.5	OpenAI · USA	New	Agentic + creative
3	Gemini 3.1 Pro	Google DeepMind · USA	New	Multimodal + reasoning
4	Grok 4.3	xAI · USA	New	Real-time + agentic
5	Claude Sonnet 4.6	Anthropic · USA	New	Balanced
6	GPT-5.3 Codex	OpenAI · USA	New	Code/agent
7	Claude Haiku 4.5	Anthropic · USA	New	Fast
8	Gemini 2.5 Flash	Google DeepMind · USA	Legacy (still widely used)	High-throughput
9	GPT-4o	OpenAI · USA	Legacy	Multimodal
10	Claude 3.5 Sonnet	Anthropic · USA	Legacy	Coding
11	OpenAI o1	OpenAI · USA	Legacy	Chain-of-thought
12	Claude Opus 4.5	Anthropic · USA	Recent legacy	Coding
13	DeepSeek V4 Pro	DeepSeek · China	New	Reasoning + coding
14	DeepSeek V4 Flash	DeepSeek · China	New	Cost-efficient long-context
15	Qwen 3.5 (397B-A17B)	Alibaba · China	New	Multilingual
16	Llama 4 Scout	Meta · USA	New	Ultra-long-context
17	Llama 4 Maverick	Meta · USA	New	General-purpose flagship
18	Kimi K2.6	Moonshot AI · China	New	Multimodal agentic
19	GLM-5.2	Z.ai (Zhipu / Tsinghua) · China	New	Long-horizon coding
20	Mistral Large 3	Mistral AI · France/EU	New	European frontier
21	Mistral Small 4	Mistral AI · France/EU	New	Efficient agentic coding
22	DeepSeek R1	DeepSeek · China	Legacy (landmark)	Reasoning
23	Qwen 3 235B-A22B	Alibaba · China	Legacy	Reasoning/coding
24	Llama 3.3 70B	Meta · USA	Legacy	General workhorse
25	Gemma 4 (31B / 26B-A4B)	Google · USA	New	Edge & on-device
26	Command A	Cohere · Canada	New	Enterprise RAG
27	Midjourney v7	Midjourney · USA	New (long lineage)	Artistic
28	Nano Banana Pro	Google · USA	New	Stylised + ecosystem
29	Google Veo 3.1	Google DeepMind · USA	New	Cinematic + audio
30	Seedance 2.0	ByteDance · China	New	Storytelling + character

Six Forces Reshaping the Market

1. Convergence at the top

By June 2026 the four US frontier models — Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro and Grok 4.3 — are separated by only a handful of points on the Artificial Analysis Intelligence Index (61.4 / 60.2 / 57 / 53). Raw capability is no longer the differentiator; workflow fit, ecosystem, latency and price are.

2. The agentic battleground

Agentic AI — models that plan and execute multi-step tasks autonomously, calling tools and controlling software — is the fastest-growing use case and the primary axis of competition. Native computer-use (GPT-5.5), persistent coding agents (Claude Code), and multi-agent orchestration (Kimi’s claimed 300 sub-agents) define the 2026 frontier. The Agentic AI Foundation (under the Linux Foundation, Dec 2025) and MCP’s 97M+ installs signal that agent infrastructure has become shared, standard plumbing.

3. Open-weight models went production-grade

Qwen 3.5, DeepSeek V4, GLM-5 and Llama 4 now match or beat proprietary models on key benchmarks while being self-hostable. The serious AI stack is now a portfolio: frontier closed models for the hardest work, cheap hosted open-weight APIs for volume, local models for privacy/fallback, and task-specific models for niches.

4. The cost floor collapsed

Self-hosted Llama drops per-token cost toward zero at scale; DeepSeek V4 Flash lists at ≈$0.14/$0.28 per 1M; Gemini 2.5 Flash at ≈$0.30/M. Teams still using 2025 pricing assumptions are often overpaying by 2–5×. Prompt caching (up to 90% on Claude) further reshapes economics.

5. Compressed release cadence

Anthropic shipped Opus 4.5 → 4.6 → 4.7 → 4.8 between November 2025 and May 2026; OpenAI moved GPT-5.1 → 5.5 with 5.6 rumoured weeks later. Endpoint pinning and a standing evaluation framework are now operational necessities, not nice-to-haves.

6. Multimodal & creative convergence

Image and video generation matured into dependable production tools. Native audio in video (Veo 3.1, Seedance 2.0, Kling Omni) collapses post-production steps, and unified platforms (ElevenLabs, Flux AI, Adobe Firefly) stitch many models into single pipelines — shifting value from individual models to orchestrated workflows.

Selection & Routing Guidance

No single model wins everything. Send each task to the cheapest model that clears your quality bar; reserve frontier models for the hardest work.

Task	Recommended models
Sustained, complex coding & agents	Claude Opus 4.8 (+ Claude Code); GPT-5.3 Codex for fast CLI loops.
Best-value production coding	Claude Sonnet 4.6; open: DeepSeek V4 Pro / GLM-5.2.
Reasoning, maths & data analysis	Gemini 3.1 Pro; open: DeepSeek R1 / V4 Pro, Qwen 3.5.
Creative writing & tone	GPT-5.5.
Multimodal understanding	Gemini 3.1 Pro; open: Kimi K2.6.
Real-time / trend-aware	Grok 4.3.
Ultra-long context	Llama 4 Scout (10M tokens).
High-volume / cheap throughput	Gemini 2.5 Flash, Claude Haiku 4.5, self-hosted Llama / DeepSeek V4 Flash.
Edge / on-device	Gemma 4; Mistral Small 4.
Enterprise grounded RAG	Cohere Command A.
Image generation	ChatGPT Images 2.0 (photoreal), Midjourney v7 (art), Nano Banana Pro (stylised), Claude Design (brand).
Video generation	Veo 3.1 (cinematic + audio), Seedance 2.0 (multi-character).

Operating principles

Pin production endpoints to explicit versions · keep a standing evaluation set · treat single-provider lock-in as a cost · use prompt caching and batch APIs aggressively · self-host open-weight models for sensitive data.

The Makers Behind the Models

Anthropic (USA)

Safety-first lab behind Claude; originated MCP and the deepest agentic tooling (Claude Code, Cowork). 2026 line spans Haiku → Sonnet → Opus → the new Mythos tier.

OpenAI (USA)

Triggered the consumer-AI era with ChatGPT; broadest product surface; pioneered reasoning models with o1. GPT-5.x leads creative writing and terminal workflows.

Google DeepMind (USA)

Gemini bets on native multimodality + ecosystem integration; leads reasoning and video (Veo); ships the most genuinely open Gemma edge models.

xAI (USA)

Differentiates on real-time X data and aggressive pricing; multi-agent Grok architecture; Grok 5 (~6T params) on the horizon.

Meta (USA)

Open-weight Llama seeded the open ecosystem and made self-hosting mainstream; class-leading 10M-token context in Scout.

DeepSeek (China)

Repeatedly disrupted frontier pricing (R1, then efficient V4 MoE at 1M context); MIT-licensed, self-host favourite.

Alibaba / Qwen (China)

Download leader of 2026 (>50% of open-model downloads); Apache-2.0 family across general, coding, multimodal and 200+ languages.

Moonshot, Z.ai & others (China)

Kimi (multimodal agentic), GLM (long-horizon coding), plus MiniMax and StepFun — a deep Chinese open bench.

Mistral AI (France/EU)

Europe’s frontier open lab; now Apache-2.0 across Large/Medium/Small — the sovereign, on-prem choice.

Cohere (Canada) & creative labs

Cohere specialises in grounded enterprise RAG; Midjourney, Black Forest Labs (FLUX) and ByteDance (Seedance/Kling) define creative generation.

On the Horizon

The frontier captured here will have moved by the time it is read. As of June 2026 the most-watched imminent releases:

GPT-5.6 (OpenAI)

Leaked in the Codex backend and canary-tested against live traffic; expected late June 2026. Reportedly rebuilt around a redesigned reward-audit pipeline after the GPT-5.5 persona-contamination issues.

Gemini 3.5 Pro (Google)

Trailed at Google I/O alongside the shipped 3.5 Flash; Pro promised “next month” as of late May 2026, aimed at the hard-reasoning ceiling.

Claude Mythos line (Anthropic)

New Mythos tier above Opus (Mythos 5 + safeguarded Fable 5, June 2026), with Opus 4.8 as a fallback safety layer. Access limited and partly export-restricted.

Grok 5 (xAI)

A rumoured ~6-trillion-parameter model targeted for the Q2 2026 window — roughly double Grok 4’s scale.

Conclusion

In 2024 the AI conversation was about a single best model. In 2026 it is about a portfolio. The frontier has converged so tightly that the marginal capability difference between the top closed models is smaller than the difference a good routing strategy makes — and open-weight models have made self-hosting frontier-grade intelligence a routine engineering decision.

The thirty models here are best understood not as competitors for one throne but as a toolkit. The organisations extracting the most value built the discipline to match each task to the right tool, measure relentlessly, and switch without friction as the frontier moves — which, in 2026, it does every few weeks.

The one durable skill

Model choice is now a configuration decision, not an engineering project. The durable advantage is the evaluation-and-routing discipline that lets an organisation absorb a new frontier model the week it ships — and drop it the week something better arrives.