Top AI Models December 2025

Comprehensive comparison of frontier AI models (December 2025): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.

⚠️
Models that perform nice on standard tests like HLE scoring can perform terribly on other tests.

🏆 Text Model Leaderboards

HLE

  1. 1GPT-5.2
    50
  2. 2Gemini 3
    45.8
  3. 3Grok 4
    44.4
  4. 4Kimi K2 Thinking
    44
  5. 5Gemini 3 Flash
    43.5
  6. 6Claude Opus 4.5
    43.2
  7. 7GPT-5
    42
  8. 8Orchestrator-8B
    37.1
  9. 9MiniMax-M2
    31.8
  10. 10DeepSeek-V3.2-Speciale
    30.6

Reasoning (LiveBench)

  1. 1gemini-3-pro-preview-11-2025-high
    98.8
  2. 2gpt-5-codex
    98.7
  3. 3claude-opus-4-5-20251101-thinking-medium-effort
    98.7
  4. 4gpt-5-high
    98.2
  5. 5gpt-5.1-codex
    98
  6. 6claude-opus-4-5-20251101-thinking-high-effort
    98
  7. 7grok-4-0709
    97.8
  8. 8gpt-5-pro-2025-10-06
    96.7
  9. 9gpt-5
    96.6
  10. 10gemini-3-pro-preview-11-2025-low
    96.5

Programming (LiveBench)

  1. 1claude-opus-4-5-20251101-medium-effort
    41.5
  2. 2claude-opus-4-5-20251101-high-effort
    40.8
  3. 3claude-sonnet-4-5-20250929-thinking-64k
    40.1
  4. 4gpt-5-high
    40.1
  5. 5claude-opus-4-5-20251101-low-effort
    40.1
  6. 6claude-4-sonnet-20250514-base
    39.4
  7. 7claude-4-sonnet-20250514-thinking-64k
    39.4
  8. 8gpt-5-chat
    39.4
  9. 9grok-4-0709
    39.4
  10. 10grok-code-fast-1-0825
    39.4

MMLU

  1. 1Kimi K2 Thinking
    94.4
  2. 2Qwen3-235B-A22B-Thinking-2507
    93.8
  3. 3DeepSeek-V3.1-Base
    93.7
  4. 4DeepSeek-R1-0528
    93.4
  5. 5Qwen3-235B-A22B-Instruct-2507
    93.1
  6. 6EXAONE 4.0
    92.3
  7. 7o1
    92.3
  8. 8o1-preview
    92.3
  9. 9o1-2024-12-17
    91.8
  10. 10Pangu Ultra MoE
    91.5

GPQA

  1. 1Gemini 3
    93.8
  2. 2GPT-5.2
    93.2
  3. 3Gemini 3 Flash
    90.4
  4. 4GPT-5
    89.4
  5. 5Grok 4
    88.9
  6. 6GPT-5.1
    88.1
  7. 7o3-preview
    87.7
  8. 8Claude Opus 4.5
    87
  9. 9Gemini 2.5 Pro 06-05
    86.4
  10. 10DeepSeek-V3.2-Speciale
    85.7

Sources: livebench.ai (Reasoning, Programming), lifearchitect.ai/models-table (MMLU, GPQA), scale.com (HLE) | Fetched: 12/22/2025

AI Image Generation Models

The Fréchet Inception Distance (FID) score is a key metric for evaluating AI image generation quality, where lower scores indicate better performance. Below are comprehensive benchmarks across multiple metrics including CLIP Score, FID, F1, Precision, and Recall.

CLIP Score

Measures how closely a generated image matches its text prompt

  1. 1
    PhotonLuma Labs
    0.265
  2. 2
    Flux ProBlack Forest Labs
    0.263
  3. 3
    Dall-E 3OpenAI
    0.259
  4. 4
    Nano BananaGoogle Gemini
    0.258
  5. 5
    Runway Gen 4Runway AI
    0.251
  6. 6
    Ideogram V3Ideogram
    0.250
  7. 7
    Stability SD TurboStability AI
    0.249

FID Score

Assesses how close AI-generated images are to real images (lower is better)

  1. 1
    Ideogram V3Ideogram
    305.600
  2. 2
    Dall-E 3OpenAI
    306.080
  3. 3
    Runway Gen 4Runway AI
    317.520
  4. 4
    PhotonLuma Labs
    318.550
  5. 5
    Flux ProBlack Forest Labs
    318.630
  6. 6
    Nano BananaGoogle Gemini
    318.800
  7. 7
    Stability SD TurboStability AI
    321.750

F1 Score

Combines precision and recall to show overall image accuracy

  1. 1
    PhotonLuma Labs
    0.463
  2. 2
    Stability SD TurboStability AI
    0.447
  3. 3
    Runway Gen 4Runway AI
    0.445
  4. 4
    Flux ProBlack Forest Labs
    0.421
  5. 5
    Ideogram V3Ideogram
    0.415
  6. 6
    Dall-E 3OpenAI
    0.380
  7. 7
    Nano BananaGoogle Gemini
    0.351

Precision

Measures how many AI-images came out correct vs total generated

  1. 1
    PhotonLuma Labs
    0.448
  2. 2
    Stability SD TurboStability AI
    0.432
  3. 3
    Runway Gen 4Runway AI
    0.423
  4. 4
    Flux ProBlack Forest Labs
    0.406
  5. 5
    Ideogram V3Ideogram
    0.397
  6. 6
    Dall-E 3OpenAI
    0.358
  7. 7
    Nano BananaGoogle Gemini
    0.339

Recall

Measures how many correct images the AI produced vs all possible correct images

  1. 1
    Stability SD TurboStability AI
    0.533
  2. 2
    PhotonLuma Labs
    0.532
  3. 3
    Runway Gen 4Runway AI
    0.522
  4. 4
    Ideogram V3Ideogram
    0.497
  5. 5
    Flux ProBlack Forest Labs
    0.495
  6. 6
    Dall-E 3OpenAI
    0.477
  7. 7
    Nano BananaGoogle Gemini
    0.415

Source: dreamlayer.io/research | Fetched: 12/9/2025

AI Model Specifications

ModelSizeTraining DataAGI LevelAccess
o1200B20TLevel 3Access
o1-preview200B20TLevel 3Access
DeepSeek-R1685B14.8TLevel 3Access
Claude 3.5 Sonnet (new)175B20TLevel 2Access
Gemini 2.0 Flash exp30B30TLevel 2Access
Claude 3.5 Sonnet70B15TLevel 2Access
Gemini-1.5-Pro-0021500B30TLevel 2Access
MiniMax-Text-01456B7.2TLevel 2Access
Grok-2400B15TLevel 2Access
Llama 3.1 405B405B15.6TLevel 2Access
Sonus-1 Reasoning405B15TLevel 2Access
GPT-4o200B20TLevel 2Access
InternVL 2.578B18.12TLevel 2Access
Qwen2.572B18TLevel 2Access

When you see "13B (Size) on 5.6T tokens (Training Data)", it means:

  • 13B: 13 billion parameters (think of these as the AI's "brain cells")
  • 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)

These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).

Performance Milestones

As of Q1 2025, the theoretical performance ceilings were:

  • GPQA: 74%
  • MMLU: 90%
  • HLE: 20%

These ceilings were notably surpassed:

  • GPT-03 achieved 87.7% on GPQA
  • OpenAI's o1 model surpassed both benchmarks in Q3 2025[¹]

In Q1 2025, these ceilings were surpassed with OpenAI's o1 model[¹].

Access & Details

For detailed information on each model, including:

  • Technical specifications
  • Use cases
  • Access procedures
  • Deployment guidelines

Please refer to our Models Access page.

Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.

Understanding the Benchmarks

Text Model Benchmarks

  • HLE (Humanity's Last Exam): Designed as the most difficult closed-ended academic exam for AI. Aims to rigorously test models at the frontier of human knowledge, as benchmarks like MMLU are becoming too easy (~90%+ scores for top models). Consists of 2,500 questions across >100 subjects from ~1000 experts. Current top models score ~20%, highlighting the gap to human expert level.
  • MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
  • MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
  • GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.

Image Generation Benchmarks

  • CLIP Score: Measures how closely a generated image matches its text prompt. Higher scores indicate better text-to-image alignment.
  • FID Score: Assesses how close AI-generated images are to real images by comparing feature distributions. Lower scores are better.
  • F1 Score: Combines precision and recall to show overall image generation accuracy. Balances false positives and false negatives.
  • Precision: Measures how many AI-generated images came out correct compared to the total number of images generated.
  • Recall: Measures how many of the correct images the AI was able to produce out of all the possible correct images it could have generated.

For more details on specific image generation models like Nano Banana, see our dedicated model pages.

Mathematics Competition Benchmarks

AIME25, USAMO25, and HMMT25 are prestigious American high school mathematics competitions expected to be held in 2025.

AIME25 (American Invitational Mathematics Examination): An intermediate competition for students who excel on the AMC 10/12 exams. It features 15 complex problems with integer answers, and top scorers may advance to the USAMO.

USAMO25 (United States of America Mathematical Olympiad): The premier national math olympiad in the US. It is a highly selective, proof-based exam for the top performers from the AIME. The USAMO is a key step in selecting the U.S. team for the International Mathematical Olympiad (IMO).

HMMT25 (Harvard-MIT Mathematics Tournament): A challenging and popular competition run by students from Harvard and MIT. It occurs twice a year (February at MIT, December at Harvard) and includes a mix of individual and team-based rounds, attracting top students from around the world.

These competitions, along with others, have recently been used as benchmarks to test the capabilities of advanced AI models.


[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2025. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/