Math Reasoning Benchmark Leaderboard

Leaderboard for the Math Reasoning Benchmark — evaluating LLMs on chained multi-step mathematical reasoning.

Models are scored on exact-match accuracy across hard and medium difficulty problems. Submit results by opening a discussion on the dataset page.

Difficulty

Open Source Only


1	Llama 4 Maverick	Anthropic	78.4	72.1	84.7	85.2	Yes

Dataset: sumeetrm/math-reasoning-benchmark

To submit results, open a discussion on the dataset page with your model name, scores, and evaluation details.