Top AI Models March 2026
Comprehensive comparison of frontier AI models (March 2026): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.
π Text Model Leaderboards
HLE
- 1Gemini 3.1 Pro Preview (thinking high)46.4HLE (Humanity's Last Exam) β Epoch AI β Google DeepMind (2026-02). Updated 30/06/26.
- 2GPT-5.4 Pro44.3HLE (Humanity's Last Exam) β Epoch AI β OpenAI (2026-03). Updated 30/06/26.
- 3Muse Spark40.6HLE (Humanity's Last Exam) β Epoch AI β Meta AI (2026-04). Updated 30/06/26.
- 4gemini-3-pro-preview37.5HLE (Humanity's Last Exam) β Epoch AI β Google DeepMind (2025-11). Updated 30/06/26.
- 5gpt-5.4-2026-03-05 (xhigh thinking)36.2HLE (Humanity's Last Exam) β Epoch AI β OpenAI (2026-03). Updated 30/06/26.
- 6Claude Opus 4.736.2HLE (Humanity's Last Exam) β Epoch AI β Anthropic (2026-04). Updated 30/06/26.
- 7claude-opus-4-6-thinking-max34.4HLE (Humanity's Last Exam) β Epoch AI β Anthropic (2026-02). Updated 30/06/26.
- 8gpt-5-pro-2025-10-0631.6HLE (Humanity's Last Exam) β Epoch AI β OpenAI (2025-10). Updated 30/06/26.
- 9gpt-5.2-2025-12-1127.8HLE (Humanity's Last Exam) β Epoch AI β OpenAI (2025-12). Updated 30/06/26.
- 10gpt-5-2025-08-0725.3HLE (Humanity's Last Exam) β Epoch AI β OpenAI (2025-08). Updated 30/06/26.
GPQA
- 1gpt-5.4-pro94.6GPQA Diamond β Epoch AI β OpenAI (2026-03). Updated 30/06/26.
- 2gemini-3.1-pro-preview94.1GPQA Diamond β Epoch AI β Google DeepMind (2026-02). Updated 30/06/26.
- 3gpt-5.5-pre-release94GPQA Diamond β Epoch AI β OpenAI (2026-04). Updated 30/06/26.
- 4gpt-5.5-pro-pre-release93.9GPQA Diamond β Epoch AI β OpenAI (2026-04). Updated 30/06/26.
- 5gpt-5.493.3GPQA Diamond β Epoch AI β OpenAI (2026-03). Updated 30/06/26.
- 6gemini-3.5-flash92.8GPQA Diamond β Epoch AI β Google (2026-05). Updated 30/06/26.
- 7gemini-3-pro-preview92.6GPQA Diamond β Epoch AI β Google DeepMind (2025-11). Updated 30/06/26.
- 8glm-5.291.9GPQA Diamond β Epoch AI β Z.ai (Zhipu AI) (2026-06). Updated 30/06/26.
- 9qwen3.7-91.6GPQA Diamond β Epoch AI β Alibaba (2026-05). Updated 30/06/26.
- 10gpt-5.291.4GPQA Diamond β Epoch AI β OpenAI (2025-12). Updated 30/06/26.
Reasoning (LiveBench)
- 1claude-opus-4-8-xhigh-effort93.4Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 2claude-opus-4-8-high-effort93.4Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 3claude-fable-5-xhigh-effort92.8Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 4claude-fable-5-high-effort92.4Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 5gpt-5.5-xhigh92.1Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 6claude-opus-4-8-medium-effort91.6Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 7claude-sonnet-4-6-thinking-auto-medium-effort91.6Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 8gpt-5.5-high90.4Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 9claude-opus-4-7-xhigh-effort90.4Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
- 10claude-sonnet-4-6-thinking-auto-high-effort90.1Avg reasoning score (Zebra Puzzle, Spatial, Logic Navigation, Connections, Consecutive Events) β LiveBench 2026-01-08. 30/06/26
Programming (LiveBench)
- 1glm-5.275.9Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 2gpt-5.4-xhigh73Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 3kimi-k2.7-code71.6Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 4gpt-5.3-codex-xhigh71Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 5claude-sonnet-5-xhigh-effort70.4Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 6claude-opus-4-7-high-effort70.3Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 7claude-opus-4-5-20251101-thinking-64k-high-effort69.9Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 8gemini-3.1-pro-preview-high69.6Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 9claude-opus-4-5-20251101-medium-effort69.4Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
- 10claude-opus-4-7-xhigh-effort68.8Avg coding score (Code Generation, Code Completion, Python, JS, TS) β LiveBench 2026-01-08. 30/06/26
Sources: livebench.ai/table_2026_01_08.csv (Reasoning, Programming), epoch.ai/data/benchmark_data.zip β gpqa_diamond.csv (GPQA), epoch.ai/data/benchmark_data.zip β hle_external.csv (HLE) | Fetched: 6/30/2026
AI Image Generation Models
The FrΓ©chet Inception Distance (FID) score is a key metric for evaluating AI image generation quality, where lower scores indicate better performance. Below are comprehensive benchmarks across multiple metrics including CLIP Score, FID, F1, Precision, and Recall.
CLIP Score
Measures how closely a generated image matches its text prompt
- 10.265PhotonLuma Labs
- 20.263Flux ProBlack Forest Labs
- 30.259Dall-E 3OpenAI
- 40.258Nano BananaGoogle Gemini
- 50.251Runway Gen 4Runway AI
- 60.250Ideogram V3Ideogram
- 70.249Stability SD TurboStability AI
FID Score
Assesses how close AI-generated images are to real images (lower is better)
- 1305.600Ideogram V3Ideogram
- 2306.080Dall-E 3OpenAI
- 3317.520Runway Gen 4Runway AI
- 4318.550PhotonLuma Labs
- 5318.630Flux ProBlack Forest Labs
- 6318.800Nano BananaGoogle Gemini
- 7321.750Stability SD TurboStability AI
F1 Score
Combines precision and recall to show overall image accuracy
- 10.463PhotonLuma Labs
- 20.447Stability SD TurboStability AI
- 30.445Runway Gen 4Runway AI
- 40.421Flux ProBlack Forest Labs
- 50.415Ideogram V3Ideogram
- 60.380Dall-E 3OpenAI
- 70.351Nano BananaGoogle Gemini
Precision
Measures how many AI-images came out correct vs total generated
- 10.448PhotonLuma Labs
- 20.432Stability SD TurboStability AI
- 30.423Runway Gen 4Runway AI
- 40.406Flux ProBlack Forest Labs
- 50.397Ideogram V3Ideogram
- 60.358Dall-E 3OpenAI
- 70.339Nano BananaGoogle Gemini
Recall
Measures how many correct images the AI produced vs all possible correct images
- 10.533Stability SD TurboStability AI
- 20.532PhotonLuma Labs
- 30.522Runway Gen 4Runway AI
- 40.497Ideogram V3Ideogram
- 50.495Flux ProBlack Forest Labs
- 60.477Dall-E 3OpenAI
- 70.415Nano BananaGoogle Gemini
Source: dreamlayer.io/research | Fetched: 6/20/2026
AI Agent Terminal Benchmarks
Terminal-Bench 2.1 evaluates AI coding agents on real terminal tasks β file system operations, shell scripting, build systems, and debugging in live environments. Unlike static question-answer benchmarks, agents must autonomously complete multi-step tasks with actual tool execution.
π» AI Agent Terminal Benchmark
Terminal-Bench 2.1 β measures AI agent performance on real terminal tasks
| # | Agent | Model | Accuracy |
|---|---|---|---|
| 1 | Codex CLI | GPT-5.5 | 83.4%Β±2.2 |
| 2 | Claude Code | Claude 5 Fable | 83.1%Β±2 |
| 3 | Terminus 2 | Claude 5 Fable | 80.4%Β±2.3 |
| 4 | Claude Code | Claude Opus 4.8 | 78.9%Β±2.5 |
| 5 | Terminus 2 | GPT-5.5 | 78.2%Β±2.4 |
| 6 | Terminus 2 | Claude Opus 4.8 | 74.6%Β±2.4 |
| 7 | Terminus 2 | Gemini 3 Pro | 74.4%Β±2.6 |
| 8 | Gemini CLI | Gemini 3.1 Pro | 70.7%Β±2.9 |
| 9 | Terminus 2 | Gemini 3.1 Pro | 70.3%Β±2.9 |
| 10 | Claude Code | Claude Opus 4.7 | 69.7%Β±2.7 |
| 11 | Gemini CLI | Gemini 3 Pro | 66.3%Β±2.7 |
| 12 | Terminus 2 | Claude Opus 4.7 | 66.1%Β±2.7 |
| 13 | Claude Code | GLM 5.1 | 58.7%Β±2.4 |
Source: tbench.ai | Fetched: Jun 27, 2026
AI Model Specifications
| Model | Size | Training Data | AGI Level | Access |
|---|---|---|---|---|
| o1 | 200B | 20T | Level 3 | Access |
| o1-preview | 200B | 20T | Level 3 | Access |
| DeepSeek-R1 | 685B | 14.8T | Level 3 | Access |
| Claude 3.5 Sonnet (new) | 175B | 20T | Level 2 | Access |
| Gemini 2.0 Flash exp | 30B | 30T | Level 2 | Access |
| Claude 3.5 Sonnet | 70B | 15T | Level 2 | Access |
| Gemini-1.5-Pro-002 | 1500B | 30T | Level 2 | Access |
| MiniMax-Text-01 | 456B | 7.2T | Level 2 | Access |
| Grok-2 | 400B | 15T | Level 2 | Access |
| Llama 3.1 405B | 405B | 15.6T | Level 2 | Access |
| Sonus-1 Reasoning | 405B | 15T | Level 2 | Access |
| GPT-4o | 200B | 20T | Level 2 | Access |
| InternVL 2.5 | 78B | 18.12T | Level 2 | Access |
| Qwen2.5 | 72B | 18T | Level 2 | Access |
When you see "13B (Size) on 5.6T tokens (Training Data)", it means:
- 13B: 13 billion parameters (think of these as the AI's "brain cells")
- 5.6T: 5.6 trillion tokens of training data (each token β 4 characters)
These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).
Performance Milestones
As of Q1 2025, the theoretical performance ceilings were:
- GPQA: 74%
- MMLU: 90%
- HLE: 20%
These ceilings were notably surpassed:
- GPT-03 achieved 87.7% on GPQA
- OpenAI's
o1model surpassed both benchmarks in Q3 2025[ΒΉ]
In Q1 2025, these ceilings were surpassed with OpenAI's o1 model[ΒΉ].
Access & Details
For detailed information on each model, including:
- Technical specifications
- Use cases
- Access procedures
- Deployment guidelines
Please refer to our Models Access page.
Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.
Understanding the Benchmarks
Text Model Benchmarks
- HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
- MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
- MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
- GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.
Image Generation Benchmarks
- CLIP Score: Measures how closely a generated image matches its text prompt. Higher scores indicate better text-to-image alignment.
- FID Score: Assesses how close AI-generated images are to real images by comparing feature distributions. Lower scores are better.
- F1 Score: Combines precision and recall to show overall image generation accuracy. Balances false positives and false negatives.
- Precision: Measures how many AI-generated images came out correct compared to the total number of images generated.
- Recall: Measures how many of the correct images the AI was able to produce out of all the possible correct images it could have generated.
For more details on specific image generation models like Nano Banana, see our dedicated model pages.
Mathematics Competition Benchmarks
AIME25, USAMO25, and HMMT25 are prestigious American high school mathematics competitions expected to be held in 2025.
AIME25 (American Invitational Mathematics Examination): An intermediate competition for students who excel on the AMC 10/12 exams. It features 15 complex problems with integer answers, and top scorers may advance to the USAMO.
USAMO25 (United States of America Mathematical Olympiad): The premier national math olympiad in the US. It is a highly selective, proof-based exam for the top performers from the AIME. The USAMO is a key step in selecting the U.S. team for the International Mathematical Olympiad (IMO).
HMMT25 (Harvard-MIT Mathematics Tournament): A challenging and popular competition run by students from Harvard and MIT. It occurs twice a year (February at MIT, December at Harvard) and includes a mix of individual and team-based rounds, attracting top students from around the world.
These competitions, along with others, have recently been used as benchmarks to test the capabilities of advanced AI models.
[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/
