Math Reasoning Benchmark Leaderboard
Leaderboard for the Math Reasoning Benchmark — evaluating LLMs on chained multi-step mathematical reasoning.
Models are scored on exact-match accuracy across hard and medium difficulty problems. Submit results by opening a discussion on the dataset page.
Difficulty
1 | Llama 4 Maverick | Anthropic | 78.4 | 72.1 | 84.7 | 85.2 | Yes |
Dataset: sumeetrm/math-reasoning-benchmark
To submit results, open a discussion on the dataset page with your model name, scores, and evaluation details.