Math Reasoning Benchmark Leaderboard

Leaderboard for the Math Reasoning Benchmark — evaluating LLMs on chained multi-step mathematical reasoning.

Models are scored on exact-match accuracy across hard and medium difficulty problems. Submit results by opening a discussion on the dataset page.

Difficulty
1
Llama 4 Maverick
Anthropic
78.4
72.1
84.7
85.2
Yes

Dataset: sumeetrm/math-reasoning-benchmark

To submit results, open a discussion on the dataset page with your model name, scores, and evaluation details.