LLM Arena - AI Model Rankings

Compare LLM performance with transparent ELO rankings. GPT-4, Claude, Llama, Mistral and more.

How It Works

  1. Filter by category: Select the task type: general, code, image, etc.
  2. Compare scores: Visualize ELO ranking and model evolution.
  3. Choose your model: Identify the best model based on your criteria: performance, price, open source.

Frequently Asked Questions

How are ELO scores calculated?
We aggregate scores from recognized benchmarks (LMSYS Chatbot Arena, MMLU, HumanEval, MATH, etc.) and convert them to a unified ELO scale. Data is updated daily.
Which LLM should I choose for my RAG?
For RAG, prioritize models strong in instruction following: GPT-4o, Claude 3, or Llama 3.1 70B. The ability to follow instructions and cite sources is more important than raw score.
Are open source models as good?
In 2024, Llama 3.1 405B and Mixtral rival GPT-4 on many tasks. For RAG, Llama 3.1 70B offers excellent value in self-hosted setups.
What's the difference between GPT-4 and GPT-4 Turbo?
GPT-4 Turbo is faster, cheaper (3x), and has 128K context vs 8K. Performance is similar. Prefer GPT-4 Turbo or GPT-4o for RAG.
Claude vs GPT-4: which is better?
Claude 3 Opus surpasses GPT-4 on some benchmarks and has 200K context. GPT-4o is faster and cheaper. For RAG, both are excellent - test on your use case.
Which model for code generation?
For code, the best are: GPT-4o (generalist), Claude 3.5 Sonnet (excellent at code), and DeepSeek Coder V2 (specialized open source). Filter by "Code" category in the arena.

Arena

ELO rankings updated daily.

#ModelELOType
1
Gemini 3 Pronew
Google
1512+8
Prop
2
Gemini 3 Deep Thinknew
Google
1498+6
Prop
3
GPT-5.1 Thinkingnew
OpenAI
1467+4
Prop
4
Sora Turbonew
OpenAI
1467+4
Prop
5
Claude Opus 4.5new
Anthropic
1456+5
Prop
6
Claude Sonnet 4.5new
Anthropic
1456+5
Prop
7
Veo 3new
Google
1456+5
Prop
8
GPT-5.1new
OpenAI
1434+3
Prop
9
Midjourney v7new
Midjourney
1434+4
Prop
10
Claude Sonnet 4.5new
Anthropic
1423+4
Prop
11
o3-mininew
OpenAI
1423+4
Prop
12
DALL-E 4new
OpenAI
1412+3
Prop
13
Runway Gen-4new
Runway
1412+4
Prop
14
Llama 4 Mavericknew
Meta
1401+5
Open
15
DeepSeek Coder V3new
DeepSeek
1401+4
Open
16
GPT-5
OpenAI
1398-2
Prop
17
Grok 3
xAI
1389+3
Prop
18
DeepSeek V3.2new
DeepSeek
1389+6
Open
19
DeepSeek R1new
DeepSeek
1389+5
Open
20
Imagen 3new
Google
1389+3
Prop
21
Kling 1.6new
Kuaishou
1378+3
Prop
22
Gemini 2.5 Pro
Google
1367+1
Prop
23
Flux 1.1 Pro
Black Forest
1367+1
Prop
24
Llama 4 Scoutnew
Meta
1356+3
Open
25
Codestral 25.01new
Mistral
1356+2
Open
26
Pika 2.0new
Pika Labs
1356+3
Prop
27
Gemini 2.5 Flash
Google
1345-1
Prop
28
Qwen 3 235Bnew
Alibaba
1345+3
Open
29
QwQ 32Bnew
Alibaba
1345+3
Open
30
Ideogram v2
Ideogram
1345+2
Prop
31
Claude Haiku 4.5new
Anthropic
1323+2
Prop
32
Mistral Large 3
Mistral
1323+2
Open
33
Mistral Medium 3new
Mistral
1289+1
Open
34
GPT-4o
OpenAI
1278-4
Prop
35
Llama 3.3 70B
Meta
1245-3
Open
36
GPT-4o Mini
OpenAI
1234-
Prop
ELO scores based on MMLU, HumanEval, GPQA and LMSYS Chatbot Arena.

Use these models with Ailog.

Try it

How It Works

  1. 1

    Filter by category

    Select the task type: general, code, image, etc.

  2. 2

    Compare scores

    Visualize ELO ranking and model evolution.

  3. 3

    Choose your model

    Identify the best model based on your criteria: performance, price, open source.

More Tools

Frequently Asked Questions

We aggregate scores from recognized benchmarks (LMSYS Chatbot Arena, MMLU, HumanEval, MATH, etc.) and convert them to a unified ELO scale. Data is updated daily.

For RAG, prioritize models strong in instruction following: GPT-4o, Claude 3, or Llama 3.1 70B. The ability to follow instructions and cite sources is more important than raw score.

In 2024, Llama 3.1 405B and Mixtral rival GPT-4 on many tasks. For RAG, Llama 3.1 70B offers excellent value in self-hosted setups.

GPT-4 Turbo is faster, cheaper (3x), and has 128K context vs 8K. Performance is similar. Prefer GPT-4 Turbo or GPT-4o for RAG.

Claude 3 Opus surpasses GPT-4 on some benchmarks and has 200K context. GPT-4o is faster and cheaper. For RAG, both are excellent - test on your use case.

For code, the best are: GPT-4o (generalist), Claude 3.5 Sonnet (excellent at code), and DeepSeek Coder V2 (specialized open source). Filter by "Code" category in the arena.