Top AI Models Q1 2025
Comprehensive comparison of frontier AI models (Q4 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.
๐ Model Leaderboards
Reasoning
- 1o3-2025-04-16-medium62.7Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 2o3-2025-04-16-high62.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 3o4-mini-2025-04-16-high61Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 4o1-2024-12-17-high58.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 5gemini-2.5-pro-exp-03-2557.1Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 6grok-3-mini-beta-high56.5Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 7o3-mini-2025-01-31-high56.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 8o4-mini-2025-04-16-medium55.9Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 9claude-3-7-sonnet-20250219-thinking-64k54.5Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
- 10o3-mini-2025-01-31-medium53.7Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench.
Programming
- 1o4-mini-2025-04-16-high74.3Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 2o3-2025-04-16-high73.3Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 3o3-2025-04-16-medium72.6Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 4o3-mini-2025-01-31-high65.5Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 5o4-mini-2025-04-16-medium61.8Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 6o3-mini-2025-01-31-medium58.4Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 7gemini-2.5-flash-preview-04-1758.4Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 8gemini-2.5-pro-exp-03-2558.1Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 9o1-2024-12-17-high57.1Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
- 10deepseek-r1-distill-qwen-32b52.3Average performance on programming tasks (LCB Generation, Coding Completion) from LiveBench.
MMLU
- 1o192.3Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 2o391.2Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 3DeepSeek-R190.8Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 4Claude 3.5 Sonnet (new)90.5Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 5R1 177690.5Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 6GPT-4.190.2Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 7Sonus-1 Reasoning90.2Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 8Hunyuan-Large89.9Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 9GPT-4.589.6Performance on MMLU benchmark (via lifearchitect.ai aggregation).
- 10Hunyuan Turbo S89.5Performance on MMLU benchmark (via lifearchitect.ai aggregation).
GPQA
- 1o3-preview87.7Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 2Claude 3.7 Sonnet84.8Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 3Grok-384.6Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 4Gemini 2.5 Pro84Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 5o383.3Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 6o4-mini81.4Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 7o179Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 8Gemini 2.5 Flash Preview78.3Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 9Seed-Thinking-v1.577.3Performance on GPQA benchmark (via lifearchitect.ai aggregation).
- 10o3-mini77Performance on GPQA benchmark (via lifearchitect.ai aggregation).
HLE
- 1o324.9Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 2Gemini 2.5 Pro18.8Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 3Agentic-Tx14.5Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 4o4-mini14.3Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 5o3-mini14Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 6Gemini 2.5 Flash Preview12.1Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 7Claude 3.7 Sonnet8.9Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 8o18.8Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 9DeepSeek-R18.6Performance on HLE benchmark (via lifearchitect.ai aggregation).
- 10R1 17768.6Performance on HLE benchmark (via lifearchitect.ai aggregation).
Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 4/27/2025
AI Model Specifications
Model | Size | Training Data | AGI Level | Access |
---|---|---|---|---|
o1 | 200B | 20T | Level 3 | Access |
o1-preview | 200B | 20T | Level 3 | Access |
DeepSeek-R1 | 685B | 14.8T | Level 3 | Access |
Claude 3.5 Sonnet (new) | 175B | 20T | Level 2 | Access |
Gemini 2.0 Flash exp | 30B | 30T | Level 2 | Access |
Claude 3.5 Sonnet | 70B | 15T | Level 2 | Access |
Gemini-1.5-Pro-002 | 1500B | 30T | Level 2 | Access |
MiniMax-Text-01 | 456B | 7.2T | Level 2 | Access |
Grok-2 | 400B | 15T | Level 2 | Access |
Llama 3.1 405B | 405B | 15.6T | Level 2 | Access |
Sonus-1 Reasoning | 405B | 15T | Level 2 | Access |
GPT-4o | 200B | 20T | Level 2 | Access |
InternVL 2.5 | 78B | 18.12T | Level 2 | Access |
Qwen2.5 | 72B | 18T | Level 2 | Access |
When you see "13B (Size) on 5.6T tokens (Training Data)", it means:
- 13B: 13 billion parameters (think of these as the AI's "brain cells")
- 5.6T: 5.6 trillion tokens of training data (each token โ 4 characters)
๐ Understanding the Benchmarks
- HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
- MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
- MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
- GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.
These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).
Performance Milestones
As of Q1 2025, the theoretical performance ceilings were:
- GPQA: 74%
- MMLU: 90%
- HLE: 20%
These ceilings were notably surpassed:
- GPT-03 achieved 87.7% on GPQA
- OpenAI's
o1
model surpassed both benchmarks in Q3 2025[ยน]
In Q1 2025, these ceilings were surpassed with OpenAI's o1
model[ยน].
Access & Details
For detailed information on each model, including:
- Technical specifications
- Use cases
- Access procedures
- Deployment guidelines
Please refer to our Models Access page.
Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.
[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/
Related Links
New: Claude 3.7 Released!
Claude 3.7 Sonnet, the first hybrid reasoning model, combines quick responses and deep reflection capabilities. With extended thinking mode and improved coding abilities, it represents a significant advancement in AI technology.Learn how to access Claude 3.7 and Claude Code โ
Subscribe to AI Spectrum
Stay updated with weekly AI News and Insights delivered to your inbox