LLM Arena - AI Model Rankings
Compare LLM performance with transparent ELO rankings. GPT-4, Claude, Llama, Mistral and more.
How It Works
- Filter by category: Select the task type: general, code, image, etc.
- Compare scores: Visualize ELO ranking and model evolution.
- Choose your model: Identify the best model based on your criteria: performance, price, open source.
Frequently Asked Questions
- How are ELO scores calculated?
- We aggregate scores from recognized benchmarks (LMSYS Chatbot Arena, MMLU, HumanEval, MATH, etc.) and convert them to a unified ELO scale. Data is updated daily.
- Which LLM should I choose for my RAG?
- For RAG, prioritize models strong in instruction following: GPT-4o, Claude 3, or Llama 3.1 70B. The ability to follow instructions and cite sources is more important than raw score.
- Are open source models as good?
- In 2024, Llama 3.1 405B and Mixtral rival GPT-4 on many tasks. For RAG, Llama 3.1 70B offers excellent value in self-hosted setups.
- What's the difference between GPT-4 and GPT-4 Turbo?
- GPT-4 Turbo is faster, cheaper (3x), and has 128K context vs 8K. Performance is similar. Prefer GPT-4 Turbo or GPT-4o for RAG.
- Claude vs GPT-4: which is better?
- Claude 3 Opus surpasses GPT-4 on some benchmarks and has 200K context. GPT-4o is faster and cheaper. For RAG, both are excellent - test on your use case.
- Which model for code generation?
- For code, the best are: GPT-4o (generalist), Claude 3.5 Sonnet (excellent at code), and DeepSeek Coder V2 (specialized open source). Filter by "Code" category in the arena.
