AI Model Leaderboard

Benchmark Leaderboards

Sort AI models, filter by pricing brackets, review confidence levels, and inspect benchmark trajectories.

What benchmark categories are ranked?

AI-Ladder organizes model rankings across text, code, vision, document, image, and video benchmarks so developers can compare capability slices instead of relying on a single blended score.

How should this leaderboard be used?

Use filters to narrow by provider, pricing, and context window, then open provenance traces or move selected models into the comparison sandbox. Treat rankings as decision evidence rather than a final answer: confidence intervals, benchmark category coverage, and source timestamps matter when model scores are close.

Coding benchmark caveats

SWE-bench Verified is still useful as a historical and scaffold-specific signal, but OpenAI now treats it as increasingly contaminated for frontier-model reporting and recommends SWE-bench Pro for cleaner coding capability claims.
SWE-bench Pro and Bash Only results should not be mixed without labels: the same base model can move materially when the scaffold, tool budget, context strategy, or agent harness changes.
Since 2025-11-18, SWE-bench Verified and Multilingual submissions are limited to academic teams and research institutions with open methods and a publication or technical report, so new product-agent results may appear elsewhere first.

AI Model Leaderboard

Benchmark Leaderboards

Sort AI models, filter by pricing brackets, review confidence levels, and inspect benchmark trajectories.

What benchmark categories are ranked?

AI-Ladder organizes model rankings across text, code, vision, document, image, and video benchmarks so developers can compare capability slices instead of relying on a single blended score.

How should this leaderboard be used?

Coding benchmark caveats

SWE-bench Verified is still useful as a historical and scaffold-specific signal, but OpenAI now treats it as increasingly contaminated for frontier-model reporting and recommends SWE-bench Pro for cleaner coding capability claims.
SWE-bench Pro and Bash Only results should not be mixed without labels: the same base model can move materially when the scaffold, tool budget, context strategy, or agent harness changes.
Since 2025-11-18, SWE-bench Verified and Multilingual submissions are limited to academic teams and research institutions with open methods and a publication or technical report, so new product-agent results may appear elsewhere first.

Rank	Model	ELO Score	Price ($/1M)	Context	Votes
#1	claude-fable-5 Anthropic Proprietary	1508±9	Unknown	—	4,366
#2	Claude Opus 4.6 Thinking Anthropic Proprietary	1503±4	Unknown	—	51,769
#3	Claude Opus 4.7 Thinking Anthropic Proprietary	1502±4	Unknown	—	38,326
#4	Claude Opus 4.6 Anthropic Proprietary	1499±4	Unknown	—	55,027
#5	Claude Opus 4.7 Anthropic Proprietary	1494±4	Unknown	—	39,550
#6	Muse Spark Meta Proprietary	1487±6	Unknown	—	13,598
#7	Gemini 3.1 Pro Preview Google Proprietary	1486±4	Unknown	—	68,291
#8	gemini-3-pro Google Proprietary	1486±4	Unknown	—	41,298
#9	claude-opus-4-8-thinking Anthropic Proprietary	1484±6	Unknown	—	18,680
#10	gpt-5.5-high OpenAI Proprietary	1481±5	Unknown	—	33,718
#11	Claude Opus 4.8 Anthropic Proprietary	1479±6	Unknown	—	19,038
#12	gpt-5.4-high OpenAI Proprietary	1478±4	Unknown	—	46,702
#13	Gemini 3.5 Flash Google Proprietary	1476±7	Unknown	—	10,159
#14	gpt-5.2-chat-latest-20260210 OpenAI Proprietary	1476±4	Unknown	—	34,532
#15	grok-4.20-beta-0309-reasoning xAI Proprietary	1476±4	Unknown	—	48,117
#16	qwen3.7-max-preview Alibaba Proprietary	1475±10	Unknown	—	3,731
#17	GPT-5.5 OpenAI Proprietary	1475±5	Unknown	—	34,794
#18	grok-4.20-beta1 xAI Proprietary	1474±5	Unknown	—	26,945
#19	GLM-5.1 Z.AI Proprietary	1473±5	Unknown	—	19,620
#20	gemini-3-flash Google Proprietary	1473±4	Unknown	—	30,704

Rank	Model	ELO Score	Price ($/1M)	Context	Votes
#1	claude-fable-5 Anthropic Proprietary	1508±9	Unknown	—	4,366
#2	Claude Opus 4.6 Thinking Anthropic Proprietary	1503±4	Unknown	—	51,769
#3	Claude Opus 4.7 Thinking Anthropic Proprietary	1502±4	Unknown	—	38,326
#4	Claude Opus 4.6 Anthropic Proprietary	1499±4	Unknown	—	55,027
#5	Claude Opus 4.7 Anthropic Proprietary	1494±4	Unknown	—	39,550
#6	Muse Spark Meta Proprietary	1487±6	Unknown	—	13,598
#7	Gemini 3.1 Pro Preview Google Proprietary	1486±4	Unknown	—	68,291
#8	gemini-3-pro Google Proprietary	1486±4	Unknown	—	41,298
#9	claude-opus-4-8-thinking Anthropic Proprietary	1484±6	Unknown	—	18,680
#10	gpt-5.5-high OpenAI Proprietary	1481±5	Unknown	—	33,718
#11	Claude Opus 4.8 Anthropic Proprietary	1479±6	Unknown	—	19,038
#12	gpt-5.4-high OpenAI Proprietary	1478±4	Unknown	—	46,702
#13	Gemini 3.5 Flash Google Proprietary	1476±7	Unknown	—	10,159
#14	gpt-5.2-chat-latest-20260210 OpenAI Proprietary	1476±4	Unknown	—	34,532
#15	grok-4.20-beta-0309-reasoning xAI Proprietary	1476±4	Unknown	—	48,117
#16	qwen3.7-max-preview Alibaba Proprietary	1475±10	Unknown	—	3,731
#17	GPT-5.5 OpenAI Proprietary	1475±5	Unknown	—	34,794
#18	grok-4.20-beta1 xAI Proprietary	1474±5	Unknown	—	26,945
#19	GLM-5.1 Z.AI Proprietary	1473±5	Unknown	—	19,620
#20	gemini-3-flash Google Proprietary	1473±4	Unknown	—	30,704

Benchmark Leaderboards

What benchmark categories are ranked?

How should this leaderboard be used?

Coding benchmark caveats

Benchmark Leaderboards

What benchmark categories are ranked?

How should this leaderboard be used?

Coding benchmark caveats

Filters

Filters