Top AI Models July 2025
Comprehensive comparison of frontier AI models (July 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.
🏆 Model Leaderboards
HLE
- 1Grok 444.4Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 2GPT-542Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 3o324.9Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 4Gemini 2.5 Pro 06-0521.6Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 5gpt-oss-120b19Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 6Gemini 2.5 Pro Preview18.8Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 7DeepSeek-R1-052817.7Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 8gpt-oss-20b17.3Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 9Agentic-Tx14.5Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
- 10GLM-4.514.4Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 07/08/25
Reasoning (LiveBench)
- 1grok-4-070997.8Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 2claude-4-sonnet-20250514-thinking-64k95.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 3o3-2025-04-16-high94.7Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 4o3-pro-2025-06-10-high94.7Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 5gemini-2.5-pro-preview-06-05-highthinking94.3Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 6gemini-2.5-pro-preview-06-05-default93.7Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 7claude-4-1-opus-20250805-thinking-32k93.2Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 8qwen3-235b-a22b-thinking-250791.6Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 9deepseek-r1-052891.1Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
- 10o3-2025-04-16-medium91Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 07/08/25
Programming (LiveBench)
- 1o3-2025-04-16-high40.8Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 2o4-mini-2025-04-16-high40.8Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 3chatgpt-4o-latest-2025-03-2739.4Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 4o4-mini-2025-04-16-medium39.4Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 5o3-2025-04-16-medium38.7Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 6o3-pro-2025-06-10-high38.7Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 7grok-4-070938.7Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 8claude-3-5-sonnet-2024102238Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 9claude-4-sonnet-20250514-base38Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
- 10deepseek-r138Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 07/08/25
MMLU
- 1Qwen3-235B-A22B-Thinking-250793.8Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 2DeepSeek-R1-052893.4Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 3Qwen3-235B-A22B-Instruct-250793.1Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 4EXAONE 4.092.3Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 5o192.3Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 6o1-preview92.3Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 7o1-2024-12-1791.8Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 8Pangu Ultra MoE91.5Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 9o391.2Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
- 10Cogito 70B91Performance on MMLU benchmark (via lifearchitect.ai/models-table). 07/08/25
GPQA
- 1GPT-589.4Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 2Grok 488.9Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 3o3-preview87.7Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 4Gemini 2.5 Pro 06-0586.4Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 5Claude 3.7 Sonnet84.8Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 6Grok-384.6Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 7Gemini 2.5 Pro Preview84Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 8Claude Opus 483.3Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 9o383.3Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
- 10o4-mini81.4Performance on GPQA benchmark (via lifearchitect.ai/models-table). 07/08/25
Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 8/7/2025
Note: SOTA models are now achieving SWE-bench (72.5%) and Terminal-bench (43.2%) scores. Full benchmark details coming soon.
AI Model Specifications
Model | Size | Training Data | AGI Level | Access |
---|---|---|---|---|
o1 | 200B | 20T | Level 3 | Access |
o1-preview | 200B | 20T | Level 3 | Access |
DeepSeek-R1 | 685B | 14.8T | Level 3 | Access |
Claude 3.5 Sonnet (new) | 175B | 20T | Level 2 | Access |
Gemini 2.0 Flash exp | 30B | 30T | Level 2 | Access |
Claude 3.5 Sonnet | 70B | 15T | Level 2 | Access |
Gemini-1.5-Pro-002 | 1500B | 30T | Level 2 | Access |
MiniMax-Text-01 | 456B | 7.2T | Level 2 | Access |
Grok-2 | 400B | 15T | Level 2 | Access |
Llama 3.1 405B | 405B | 15.6T | Level 2 | Access |
Sonus-1 Reasoning | 405B | 15T | Level 2 | Access |
GPT-4o | 200B | 20T | Level 2 | Access |
InternVL 2.5 | 78B | 18.12T | Level 2 | Access |
Qwen2.5 | 72B | 18T | Level 2 | Access |
When you see "13B (Size) on 5.6T tokens (Training Data)", it means:
- 13B: 13 billion parameters (think of these as the AI's "brain cells")
- 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)
📊 Understanding the Benchmarks
- HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
- MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
- MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
- GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.
These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).
Performance Milestones
As of Q1 2025, the theoretical performance ceilings were:
- GPQA: 74%
- MMLU: 90%
- HLE: 20%
These ceilings were notably surpassed:
- GPT-03 achieved 87.7% on GPQA
- OpenAI's
o1
model surpassed both benchmarks in Q3 2025[¹]
In Q1 2025, these ceilings were surpassed with OpenAI's o1
model[¹].
Access & Details
For detailed information on each model, including:
- Technical specifications
- Use cases
- Access procedures
- Deployment guidelines
Please refer to our Models Access page.
Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.
Mathematics Competition Benchmarks
AIME25, USAMO25, and HMMT25 are prestigious American high school mathematics competitions expected to be held in 2025.
AIME25 (American Invitational Mathematics Examination): An intermediate competition for students who excel on the AMC 10/12 exams. It features 15 complex problems with integer answers, and top scorers may advance to the USAMO.
USAMO25 (United States of America Mathematical Olympiad): The premier national math olympiad in the US. It is a highly selective, proof-based exam for the top performers from the AIME. The USAMO is a key step in selecting the U.S. team for the International Mathematical Olympiad (IMO).
HMMT25 (Harvard-MIT Mathematics Tournament): A challenging and popular competition run by students from Harvard and MIT. It occurs twice a year (February at MIT, November at Harvard) and includes a mix of individual and team-based rounds, attracting top students from around the world.
These competitions, along with others, have recently been used as benchmarks to test the capabilities of advanced AI models.
[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/
Related Links
Subscribe to AI Spectrum
Stay updated with weekly AI News and Insights delivered to your inbox