Files
gemma4-research/CORPUS_benchmarks.md
T
Mortdecai 5011059f5d docs: initial Gemma 4 research corpus and synthesis
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling
format, and implementation patterns from Simon and AI_Visualizer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 18:14:19 -04:00

41 lines
1.4 KiB
Markdown

# Gemma 4 Benchmarks
> Source: Google DeepMind model card, HuggingFace blog, LMArena
> Released: April 2, 2026
## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
|-----------|------------|------------|----------------|-------------------|
| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
| Codeforces ELO | 110 | 2150 | 1718 | +2040 |
| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |
## Arena Scores
| Model | LMArena Score | Rank |
|-------|--------------|------|
| Gemma 4 31B | 1452 | #3 |
| Gemma 4 26B A4B | 1441 | #6 |
## Agentic Benchmark (tau2-bench)
| Model | Score |
|-------|-------|
| 31B | 86.4% |
| 26B A4B | 85.5% |
| E4B | 57.5% |
| E2B | 29.4% |
## Takeaway
The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.