Top AI Models Q4 2024

Comprehensive comparison of frontier AI models (Q4 2024): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.

🏆 Top General-Purpose Models (Ranked by MMLU-Pro)

ModelMMLU-ProMMLUGPQASizeTraining DataAGI LevelAccess
o191.092.379.0200B20TLevel 3Access
o1-preview91.092.378.3200B20TLevel 3Access
Claude 3.5 Sonnet (new)78.090.565.0N/AN/ALevel 2Access
Gemini 2.0 Flash exp76.487.062.1N/AN/ALevel 2Access
Claude 3.5 Sonnet76.188.767.2N/AN/ALevel 2Access
Gemini-1.5-Pro-00275.8N/A59.1N/AN/ALevel 2Access
Grok-275.587.556.0600B15TLevel 2Access
Llama 3.1 405B73.388.651.1405B15.6TLevel 2Access
GPT-4o72.688.753.6200B20TLevel 2Access
InternVL 2.571.186.149.0N/AN/ALevel 2Access
Qwen2.571.186.149.0N/AN/ALevel 2Access
phi-470.484.856.1N/AN/ALevel 2Access

When you see "13B (Size) on 5.6T tokens (Training Data)", it means:

  • 13B: 13 billion parameters (think of these as the AI's "brain cells")
  • 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)

📊 Understanding the Benchmarks

  • MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
  • MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
  • GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.

These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).

Performance Milestones

As of Q1 2024, the theoretical performance ceilings were:

  • GPQA: 74%
  • MMLU: 90%

These ceilings were notably surpassed:

  • GPT-03 achieved 87.7% on GPQA
  • OpenAI's o1 model surpassed both benchmarks in Q3 2024[¹]

In Q3 2024, these ceilings were surpassed with OpenAI's o1 model[¹].

Access & Details

For detailed information on each model, including:

  • Technical specifications
  • Use cases
  • Access procedures
  • Deployment guidelines

Please refer to our Models Access page.

Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.


[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2024. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/

Related Links

Subscribe to AI Spectrum

Stay updated with weekly AI News and Insights delivered to your inbox