gemma4-research/CORPUS_benchmarks.md

# Gemma 4 Benchmarks

> Source: Google DeepMind model card, HuggingFace blog, LMArena
> Released: April 2, 2026

## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)

| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
|-----------|------------|------------|----------------|-------------------|
| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
| Codeforces ELO | 110 | 2150 | 1718 | +2040 |
| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |

## Arena Scores

| Model | LMArena Score | Rank |
|-------|--------------|------|
| Gemma 4 31B | 1452 | #3 |
| Gemma 4 26B A4B | 1441 | #6 |

## Agentic Benchmark (tau2-bench)

| Model | Score |
|-------|-------|
| 31B | 86.4% |
| 26B A4B | 85.5% |
| E4B | 57.5% |
| E2B | 29.4% |

## Takeaway

The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.