Top AI Models June 2025

Comprehensive comparison of frontier AI models (June 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.

⚠️
Models that perform nice on standard tests like HLE scoring can perform terribly on other tests.

🏆 Model Leaderboards

HLE

  1. 1o3
    24.9
  2. 2Gemini 2.5 Pro 06-05
    21.6
  3. 3Gemini 2.5 Pro Preview
    18.8
  4. 4DeepSeek-R1-0528
    17.7
  5. 5Agentic-Tx
    14.5
  6. 6o4-mini
    14.3
  7. 7o3-mini
    14
  8. 8Gemini 2.5 Flash Preview
    12.1
  9. 9Claude 3.7 Sonnet
    8.9
  10. 10o1-2024-12-17
    8.8

Reasoning (LiveBench)

  1. 1claude-4-sonnet-20250514-thinking-64k
    95.3
  2. 2o3-2025-04-16-high
    93.3
  3. 3deepseek-r1-0528
    91.1
  4. 4o3-2025-04-16-medium
    91
  5. 5claude-4-opus-20250514-thinking-32k
    90.5
  6. 6gemini-2.5-pro-preview-05-06
    88.3
  7. 7o4-mini-2025-04-16-high
    88.1
  8. 8grok-3-mini-beta-high
    87.6
  9. 9gemini-2.5-pro-preview-03-25
    87.5
  10. 10qwen3-32b-thinking
    83.1

Programming (LiveBench)

  1. 1o3-2025-04-16-high
    40.8
  2. 2o4-mini-2025-04-16-high
    40.8
  3. 3chatgpt-4o-latest-2025-03-27
    39.4
  4. 4o4-mini-2025-04-16-medium
    39.4
  5. 5o3-2025-04-16-medium
    38.7
  6. 6claude-3-5-sonnet-20241022
    38
  7. 7deepseek-r1
    38
  8. 8gpt-4.5-preview-2025-02-27
    38
  9. 9claude-3-7-sonnet-20250219-base
    37.3
  10. 10claude-3-7-sonnet-20250219-thinking-64k
    37.3

MMLU

  1. 1DeepSeek-R1-0528
    93.4
  2. 2o1
    92.3
  3. 3o1-preview
    92.3
  4. 4o1-2024-12-17
    91.8
  5. 5Pangu Ultra MoE
    91.5
  6. 6o3
    91.2
  7. 7DeepSeek-R1
    90.8
  8. 8R1 1776
    90.5
  9. 9Claude 3.5 Sonnet (new)
    90.5
  10. 10GPT-4.1
    90.2

GPQA

  1. 1o3-preview
    87.7
  2. 2Gemini 2.5 Pro 06-05
    86.4
  3. 3Claude 3.7 Sonnet
    84.8
  4. 4Grok-3
    84.6
  5. 5Gemini 2.5 Pro Preview
    84
  6. 6Claude Opus 4
    83.3
  7. 7o3
    83.3
  8. 8o4-mini
    81.4
  9. 9DeepSeek-R1-0528
    81
  10. 10o1
    79

Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 6/11/2025

Note: SOTA models are now achieving SWE-bench (72.5%) and Terminal-bench (43.2%) scores. Full benchmark details coming soon.

AI Model Specifications

ModelSizeTraining DataAGI LevelAccess
o1200B20TLevel 3Access
o1-preview200B20TLevel 3Access
DeepSeek-R1685B14.8TLevel 3Access
Claude 3.5 Sonnet (new)175B20TLevel 2Access
Gemini 2.0 Flash exp30B30TLevel 2Access
Claude 3.5 Sonnet70B15TLevel 2Access
Gemini-1.5-Pro-0021500B30TLevel 2Access
MiniMax-Text-01456B7.2TLevel 2Access
Grok-2400B15TLevel 2Access
Llama 3.1 405B405B15.6TLevel 2Access
Sonus-1 Reasoning405B15TLevel 2Access
GPT-4o200B20TLevel 2Access
InternVL 2.578B18.12TLevel 2Access
Qwen2.572B18TLevel 2Access

When you see "13B (Size) on 5.6T tokens (Training Data)", it means:

  • 13B: 13 billion parameters (think of these as the AI's "brain cells")
  • 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)

📊 Understanding the Benchmarks

  • HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
  • MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
  • MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
  • GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.

These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).

Performance Milestones

As of Q1 2025, the theoretical performance ceilings were:

  • GPQA: 74%
  • MMLU: 90%
  • HLE: 20%

These ceilings were notably surpassed:

  • GPT-03 achieved 87.7% on GPQA
  • OpenAI's o1 model surpassed both benchmarks in Q3 2025[¹]

In Q1 2025, these ceilings were surpassed with OpenAI's o1 model[¹].

Access & Details

For detailed information on each model, including:

  • Technical specifications
  • Use cases
  • Access procedures
  • Deployment guidelines

Please refer to our Models Access page.

Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.


[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/

Related Links

Subscribe to AI Spectrum

Stay updated with weekly AI News and Insights delivered to your inbox