Files
gemma4-research/CORPUS_benchmarks.md
T
Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00

1.4 KiB

Gemma 4 Benchmarks

Source: Google DeepMind model card, HuggingFace blog, LMArena Released: April 2, 2026

Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)

Benchmark Gemma 3 27B Gemma 4 31B Gemma 4 26B A4B Delta (31B vs G3)
MMLU Pro 67.6% 85.2% 82.6% +17.6
AIME 2026 (no tools) 20.8% 89.2% 88.3% +68.4
GPQA Diamond 42.4% 84.3% 82.3% +41.9
BigBench Extra Hard 19.3% 74.4% 64.8% +55.1
LiveCodeBench v6 29.1% 80.0% 77.1% +50.9
Codeforces ELO 110 2150 1718 +2040
MMMU Pro (vision) 49.7% 76.9% 73.8% +27.2
MATH-Vision 46.0% 85.6% 82.4% +39.6
OmniDocBench (lower=better) 0.365 0.131 0.149 -0.234
MRCR v2 128K 13.5% 66.4% 44.1% +52.9
MMMLU (multilingual) 70.7% 88.4% 86.3% +17.7

Arena Scores

Model LMArena Score Rank
Gemma 4 31B 1452 #3
Gemma 4 26B A4B 1441 #6

Agentic Benchmark (tau2-bench)

Model Score
31B 86.4%
26B A4B 85.5%
E4B 57.5%
E2B 29.4%

Takeaway

The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.