Top AI Models June 2025
Comprehensive comparison of frontier AI models (June 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.
🏆 Model Leaderboards
HLE
- 1o324.9Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 2Gemini 2.5 Pro 06-0521.6Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 3Gemini 2.5 Pro Preview18.8Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 4DeepSeek-R1-052817.7Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 5Agentic-Tx14.5Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 6o4-mini14.3Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 7o3-mini14Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 8Gemini 2.5 Flash Preview12.1Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 9Claude 3.7 Sonnet8.9Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
- 10o1-2024-12-178.8Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 11/06/25
Reasoning (LiveBench)
- 1claude-4-sonnet-20250514-thinking-64k95.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 2o3-2025-04-16-high93.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 3deepseek-r1-052891.1Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 4o3-2025-04-16-medium91Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 5claude-4-opus-20250514-thinking-32k90.5Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 6gemini-2.5-pro-preview-05-0688.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 7o4-mini-2025-04-16-high88.1Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 8grok-3-mini-beta-high87.6Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 9gemini-2.5-pro-preview-03-2587.5Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
- 10qwen3-32b-thinking83.1Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 11/06/25
Programming (LiveBench)
- 1o3-2025-04-16-high40.8Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 2o4-mini-2025-04-16-high40.8Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 3chatgpt-4o-latest-2025-03-2739.4Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 4o4-mini-2025-04-16-medium39.4Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 5o3-2025-04-16-medium38.7Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 6claude-3-5-sonnet-2024102238Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 7deepseek-r138Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 8gpt-4.5-preview-2025-02-2738Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 9claude-3-7-sonnet-20250219-base37.3Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
- 10claude-3-7-sonnet-20250219-thinking-64k37.3Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 11/06/25
MMLU
- 1DeepSeek-R1-052893.4Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 2o192.3Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 3o1-preview92.3Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 4o1-2024-12-1791.8Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 5Pangu Ultra MoE91.5Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 6o391.2Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 7DeepSeek-R190.8Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 8R1 177690.5Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 9Claude 3.5 Sonnet (new)90.5Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
- 10GPT-4.190.2Performance on MMLU benchmark (via lifearchitect.ai/models-table). 11/06/25
GPQA
- 1o3-preview87.7Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 2Gemini 2.5 Pro 06-0586.4Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 3Claude 3.7 Sonnet84.8Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 4Grok-384.6Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 5Gemini 2.5 Pro Preview84Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 6Claude Opus 483.3Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 7o383.3Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 8o4-mini81.4Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 9DeepSeek-R1-052881Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
- 10o179Performance on GPQA benchmark (via lifearchitect.ai/models-table). 11/06/25
Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 6/11/2025
Note: SOTA models are now achieving SWE-bench (72.5%) and Terminal-bench (43.2%) scores. Full benchmark details coming soon.
AI Model Specifications
Model | Size | Training Data | AGI Level | Access |
---|---|---|---|---|
o1 | 200B | 20T | Level 3 | Access |
o1-preview | 200B | 20T | Level 3 | Access |
DeepSeek-R1 | 685B | 14.8T | Level 3 | Access |
Claude 3.5 Sonnet (new) | 175B | 20T | Level 2 | Access |
Gemini 2.0 Flash exp | 30B | 30T | Level 2 | Access |
Claude 3.5 Sonnet | 70B | 15T | Level 2 | Access |
Gemini-1.5-Pro-002 | 1500B | 30T | Level 2 | Access |
MiniMax-Text-01 | 456B | 7.2T | Level 2 | Access |
Grok-2 | 400B | 15T | Level 2 | Access |
Llama 3.1 405B | 405B | 15.6T | Level 2 | Access |
Sonus-1 Reasoning | 405B | 15T | Level 2 | Access |
GPT-4o | 200B | 20T | Level 2 | Access |
InternVL 2.5 | 78B | 18.12T | Level 2 | Access |
Qwen2.5 | 72B | 18T | Level 2 | Access |
When you see "13B (Size) on 5.6T tokens (Training Data)", it means:
- 13B: 13 billion parameters (think of these as the AI's "brain cells")
- 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)
📊 Understanding the Benchmarks
- HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
- MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
- MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
- GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.
These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).
Performance Milestones
As of Q1 2025, the theoretical performance ceilings were:
- GPQA: 74%
- MMLU: 90%
- HLE: 20%
These ceilings were notably surpassed:
- GPT-03 achieved 87.7% on GPQA
- OpenAI's
o1
model surpassed both benchmarks in Q3 2025[¹]
In Q1 2025, these ceilings were surpassed with OpenAI's o1
model[¹].
Access & Details
For detailed information on each model, including:
- Technical specifications
- Use cases
- Access procedures
- Deployment guidelines
Please refer to our Models Access page.
Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.
[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/
Related Links
Subscribe to AI Spectrum
Stay updated with weekly AI News and Insights delivered to your inbox