Top AI Models Q4 2024
Comprehensive comparison of frontier AI models (Q4 2024): MMLU-Pro, MMLU, and GPQA benchmark scores for leading models including OpenAI, Claude, Gemini, Grok, and open-source LLMs. Updated performance rankings and capabilities assessment.
🏆 Top General-Purpose Models (Ranked by MMLU-Pro)
Model | MMLU-Pro | MMLU | GPQA | Size | Training Data | AGI Level | Access |
---|---|---|---|---|---|---|---|
o1 | 91.0 | 92.3 | 79.0 | 200B | 20T | Level 3 | Access |
o1-preview | 91.0 | 92.3 | 78.3 | 200B | 20T | Level 3 | Access |
Claude 3.5 Sonnet (new) | 78.0 | 90.5 | 65.0 | N/A | N/A | Level 2 | Access |
Gemini 2.0 Flash exp | 76.4 | 87.0 | 62.1 | N/A | N/A | Level 2 | Access |
Claude 3.5 Sonnet | 76.1 | 88.7 | 67.2 | N/A | N/A | Level 2 | Access |
Gemini-1.5-Pro-002 | 75.8 | N/A | 59.1 | N/A | N/A | Level 2 | Access |
Grok-2 | 75.5 | 87.5 | 56.0 | 600B | 15T | Level 2 | Access |
Llama 3.1 405B | 73.3 | 88.6 | 51.1 | 405B | 15.6T | Level 2 | Access |
GPT-4o | 72.6 | 88.7 | 53.6 | 200B | 20T | Level 2 | Access |
InternVL 2.5 | 71.1 | 86.1 | 49.0 | N/A | N/A | Level 2 | Access |
Qwen2.5 | 71.1 | 86.1 | 49.0 | N/A | N/A | Level 2 | Access |
phi-4 | 70.4 | 84.8 | 56.1 | N/A | N/A | Level 2 | Access |
When you see "13B (Size) on 5.6T tokens (Training Data)", it means:
- 13B: 13 billion parameters (think of these as the AI's "brain cells")
- 5.6T: 5.6 trillion tokens of training data (each token ≈ 4 characters)
📊 Understanding the Benchmarks
- MMLU-Pro: Advanced version of MMLU focusing on expert-level knowledge. Currently considered the most reliable indicator of model capabilities.
- MMLU: Tests knowledge across 57 subjects with a 90% theoretical ceiling and 9% error rate.
- GPQA: PhD-level science benchmark across biology, chemistry, physics, and astronomy. Has a 74% ceiling, with 20% error rate. Notable: even scientists only agree on ~78% of answers.
These models represent the current state-of-the-art in AI language technology (General Purpose Frontier Models).
Performance Milestones
As of Q1 2024, the theoretical performance ceilings were:
- GPQA: 74%
- MMLU: 90%
These ceilings were notably surpassed:
- GPT-03 achieved 87.7% on GPQA
- OpenAI's
o1
model surpassed both benchmarks in Q3 2024[¹]
In Q3 2024, these ceilings were surpassed with OpenAI's o1
model[¹].
Access & Details
For detailed information on each model, including:
- Technical specifications
- Use cases
- Access procedures
- Deployment guidelines
Please refer to our Models Access page.
Note: Performance metrics and rankings are based on publicly available data and may evolve as new models and evaluations emerge.
[1]: AI Research Community. "Language Model Leaderboard." Google Sheets, 2024. https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/
Related Links
Subscribe to AI Spectrum
Stay updated with weekly AI News and Insights delivered to your inbox