Files

T

Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis

Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-12 18:14:19 -04:00

1.4 KiB

Raw Blame History

Gemma 4 Benchmarks

Source: Google DeepMind model card, HuggingFace blog, LMArena Released: April 2, 2026

Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)

Benchmark	Gemma 3 27B	Gemma 4 31B	Gemma 4 26B A4B	Delta (31B vs G3)
MMLU Pro	67.6%	85.2%	82.6%	+17.6
AIME 2026 (no tools)	20.8%	89.2%	88.3%	+68.4
GPQA Diamond	42.4%	84.3%	82.3%	+41.9
BigBench Extra Hard	19.3%	74.4%	64.8%	+55.1
LiveCodeBench v6	29.1%	80.0%	77.1%	+50.9
Codeforces ELO	110	2150	1718	+2040
MMMU Pro (vision)	49.7%	76.9%	73.8%	+27.2
MATH-Vision	46.0%	85.6%	82.4%	+39.6
OmniDocBench (lower=better)	0.365	0.131	0.149	-0.234
MRCR v2 128K	13.5%	66.4%	44.1%	+52.9
MMMLU (multilingual)	70.7%	88.4%	86.3%	+17.7

Arena Scores

Model	LMArena Score	Rank
Gemma 4 31B	1452	#3
Gemma 4 26B A4B	1441	#6

Agentic Benchmark (tau2-bench)

Model	Score
31B	86.4%
26B A4B	85.5%
E4B	57.5%
E2B	29.4%

Takeaway

The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.