Top AI Models July 2025

Comprehensive comparison of frontier AI models (July 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.

⚠️

Models that perform nice on standard tests like HLE scoring can perform terribly on other tests.

🏆 Model Leaderboards

HLE

1Grok 4
44.4
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
2GPT-5
42
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
3MiniMax-M2
31.8
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
4GLM-4.6
30.4
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
5DeepSeek-V3.1-Base
29.8
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
6o3
24.9
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
7DeepSeek-V3.1-Terminus
21.7
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
8Gemini 2.5 Pro 06-05
21.6
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
9Grok 4 Fast
20
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25
10gpt-oss-120b
19
Performance on HLE (Human Language Evaluation) benchmark (source: scale.com, data via lifearchitect.ai/models-table). 06/11/25

Reasoning (LiveBench)

1gpt-5-codex
98.7
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
2gpt-5-2025-08-07-high
98.2
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
3grok-4-0709
97.8
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
4gpt-5-pro-2025-10-06
96.7
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
5gpt-5-2025-08-07
96.6
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
6grok-4-fast-reasoning
95.4
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
7claude-sonnet-4-5-20250929-thinking-64k
95.3
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
8claude-4-sonnet-20250514-thinking-64k
95.3
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
9gemini-2.5-pro-06-05-highthinking
94.3
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25
10claude-4-1-opus-20250805-thinking-32k
93.2
Average performance on reasoning tasks (Web of Lies v2, Zebra Puzzle, Spatial) from LiveBench. 06/11/25

Programming (LiveBench)

1claude-sonnet-4-5-20250929-thinking-64k
40.1
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
2gpt-5-2025-08-07-high
40.1
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
3claude-4-sonnet-20250514-base
39.4
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
4claude-4-sonnet-20250514-thinking-64k
39.4
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
5gpt-5-chat-latest
39.4
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
6grok-4-0709
39.4
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
7grok-code-fast-1-0825
39.4
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
8gemini-2.5-pro-06-05-highthinking
38.7
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
9claude-3-7-sonnet-20250219-base
38
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25
10claude-3-7-sonnet-20250219-thinking-64k
38
Average performance on programming tasks (Code Generation, Coding Completion) from LiveBench. 06/11/25

MMLU

1Qwen3-235B-A22B-Thinking-2507
93.8
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
2DeepSeek-V3.1-Base
93.7
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
3DeepSeek-R1-0528
93.4
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
4Qwen3-235B-A22B-Instruct-2507
93.1
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
5EXAONE 4.0
92.3
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
6o1
92.3
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
7o1-preview
92.3
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
8o1-2024-12-17
91.8
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
9Pangu Ultra MoE
91.5
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25
10o3
91.2
Performance on MMLU benchmark (via lifearchitect.ai/models-table). 06/11/25

GPQA

1GPT-5
89.4
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
2Grok 4
88.9
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
3o3-preview
87.7
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
4Gemini 2.5 Pro 06-05
86.4
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
5Grok 4 Fast
85.7
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
6Qwen3-Max
85.4
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
7Claude 3.7 Sonnet
84.8
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
8Grok-3
84.6
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
9Gemini 2.5 Pro Preview
84
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25
10Claude Sonnet 4.5
83.4
Performance on GPQA benchmark (via lifearchitect.ai/models-table). 06/11/25

Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 11/6/2025

Note: SOTA models are now achieving SWE-bench (72.5%) and Terminal-bench (43.2%) scores. Full benchmark details coming soon.

AI Model Specifications

Model	Size	Training Data	AGI Level	Access
o1	200B	20T	Level 3	Access
o1-preview	200B	20T	Level 3	Access
DeepSeek-R1	685B	14.8T	Level 3	Access
Claude 3.5 Sonnet (new)	175B	20T	Level 2	Access
Gemini 2.0 Flash exp	30B	30T	Level 2	Access
Claude 3.5 Sonnet	70B	15T	Level 2	Access
Gemini-1.5-Pro-002	1500B	30T	Level 2	Access
MiniMax-Text-01	456B	7.2T	Level 2	Access
Grok-2	400B	15T	Level 2	Access
Llama 3.1 405B	405B	15.6T	Level 2	Access
Sonus-1 Reasoning	405B	15T	Level 2	Access
GPT-4o	200B	20T	Level 2	Access
InternVL 2.5	78B	18.12T	Level 2	Access
Qwen2.5	72B	18T	Level 2	Access

When you see "13B (Size) on 5.6T tokens (Training Data)", it means:

13B: 13 billion parameters (think of these as the AI's "brain cells")
5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)

📊 Understanding the Benchmarks

HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.

MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.

MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.

GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.

These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).

Performance Milestones

As of Q1 2025, the theoretical performance ceilings were:

GPQA: 74%
MMLU: 90%
HLE: 20%

These ceilings were notably surpassed:

GPT-03 achieved 87.7% on GPQA
OpenAI's o1 model surpassed both benchmarks in Q3 2025[¹]

In Q1 2025, these ceilings were surpassed with OpenAI's o1 model[¹].

Access & Details

For detailed information on each model, including:

Technical specifications
Use cases
Access procedures
Deployment guidelines

Please refer to our Models Access page.

Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.

Mathematics Competition Benchmarks

AIME25, USAMO25, and HMMT25 are prestigious American high school mathematics competitions expected to be held in 2025.

AIME25 (American Invitational Mathematics Examination): An intermediate competition for students who excel on the AMC 10/12 exams. It features 15 complex problems with integer answers, and top scorers may advance to the USAMO.

USAMO25 (United States of America Mathematical Olympiad): The premier national math olympiad in the US. It is a highly selective, proof-based exam for the top performers from the AIME. The USAMO is a key step in selecting the U.S. team for the International Mathematical Olympiad (IMO).

HMMT25 (Harvard-MIT Mathematics Tournament): A challenging and popular competition run by students from Harvard and MIT. It occurs twice a year (February at MIT, November at Harvard) and includes a mix of individual and team-based rounds, attracting top students from around the world.

These competitions, along with others, have recently been used as benchmarks to test the capabilities of advanced AI models.

[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/