Top AI Models Q1 2025

Comprehensive comparison of frontier AI models (Q4 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.

๐Ÿ† Model Leaderboards

Reasoning

  1. 1o3-2025-04-16-medium
    62.7
  2. 2o3-2025-04-16-high
    62.3
  3. 3o4-mini-2025-04-16-high
    61
  4. 4o1-2024-12-17-high
    58.3
  5. 5gemini-2.5-pro-exp-03-25
    57.1
  6. 6grok-3-mini-beta-high
    56.5
  7. 7o3-mini-2025-01-31-high
    56.3
  8. 8o4-mini-2025-04-16-medium
    55.9
  9. 9claude-3-7-sonnet-20250219-thinking-64k
    54.5
  10. 10o3-mini-2025-01-31-medium
    53.7

Programming

  1. 1o4-mini-2025-04-16-high
    74.3
  2. 2o3-2025-04-16-high
    73.3
  3. 3o3-2025-04-16-medium
    72.6
  4. 4o3-mini-2025-01-31-high
    65.5
  5. 5o4-mini-2025-04-16-medium
    61.8
  6. 6o3-mini-2025-01-31-medium
    58.4
  7. 7gemini-2.5-flash-preview-04-17
    58.4
  8. 8gemini-2.5-pro-exp-03-25
    58.1
  9. 9o1-2024-12-17-high
    57.1
  10. 10deepseek-r1-distill-qwen-32b
    52.3

MMLU

  1. 1o1
    92.3
  2. 2o3
    91.2
  3. 3DeepSeek-R1
    90.8
  4. 4Claude 3.5 Sonnet (new)
    90.5
  5. 5R1 1776
    90.5
  6. 6GPT-4.1
    90.2
  7. 7Sonus-1 Reasoning
    90.2
  8. 8Hunyuan-Large
    89.9
  9. 9GPT-4.5
    89.6
  10. 10Hunyuan Turbo S
    89.5

GPQA

  1. 1o3-preview
    87.7
  2. 2Claude 3.7 Sonnet
    84.8
  3. 3Grok-3
    84.6
  4. 4Gemini 2.5 Pro
    84
  5. 5o3
    83.3
  6. 6o4-mini
    81.4
  7. 7o1
    79
  8. 8Gemini 2.5 Flash Preview
    78.3
  9. 9Seed-Thinking-v1.5
    77.3
  10. 10o3-mini
    77

HLE

  1. 1o3
    24.9
  2. 2Gemini 2.5 Pro
    18.8
  3. 3Agentic-Tx
    14.5
  4. 4o4-mini
    14.3
  5. 5o3-mini
    14
  6. 6Gemini 2.5 Flash Preview
    12.1
  7. 7Claude 3.7 Sonnet
    8.9
  8. 8o1
    8.8
  9. 9DeepSeek-R1
    8.6
  10. 10R1 1776
    8.6

Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 4/27/2025

AI Model Specifications

ModelSizeTraining DataAGI LevelAccess
o1200B20TLevel 3Access
o1-preview200B20TLevel 3Access
DeepSeek-R1685B14.8TLevel 3Access
Claude 3.5 Sonnet (new)175B20TLevel 2Access
Gemini 2.0 Flash exp30B30TLevel 2Access
Claude 3.5 Sonnet70B15TLevel 2Access
Gemini-1.5-Pro-0021500B30TLevel 2Access
MiniMax-Text-01456B7.2TLevel 2Access
Grok-2400B15TLevel 2Access
Llama 3.1 405B405B15.6TLevel 2Access
Sonus-1 Reasoning405B15TLevel 2Access
GPT-4o200B20TLevel 2Access
InternVL 2.578B18.12TLevel 2Access
Qwen2.572B18TLevel 2Access

When you see "13B (Size) on 5.6T tokens (Training Data)", it means:

  • 13B: 13 billion parameters (think of these as the AI's "brain cells")
  • 5.6T: 5.6 trillion tokens of training data (each token โ‰ˆ 4 characters)

๐Ÿ“Š Understanding the Benchmarks

  • HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
  • MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
  • MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
  • GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.

These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).

Performance Milestones

As of Q1 2025, the theoretical performance ceilings were:

  • GPQA: 74%
  • MMLU: 90%
  • HLE: 20%

These ceilings were notably surpassed:

  • GPT-03 achieved 87.7% on GPQA
  • OpenAI's o1 model surpassed both benchmarks in Q3 2025[ยน]

In Q1 2025, these ceilings were surpassed with OpenAI's o1 model[ยน].

Access & Details

For detailed information on each model, including:

  • Technical specifications
  • Use cases
  • Access procedures
  • Deployment guidelines

Please refer to our Models Access page.

Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.


[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/