5011059f5d
Architecture specs, benchmarks, gotchas, Ollama settings, tool calling format, and implementation patterns from Simon and AI_Visualizer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
41 lines
1.4 KiB
Markdown
41 lines
1.4 KiB
Markdown
# Gemma 4 Benchmarks
|
|
|
|
> Source: Google DeepMind model card, HuggingFace blog, LMArena
|
|
> Released: April 2, 2026
|
|
|
|
## Gemma 4 vs Gemma 3 (biggest single-version jump in Gemma family)
|
|
|
|
| Benchmark | Gemma 3 27B | Gemma 4 31B | Gemma 4 26B A4B | Delta (31B vs G3) |
|
|
|-----------|------------|------------|----------------|-------------------|
|
|
| MMLU Pro | 67.6% | 85.2% | 82.6% | +17.6 |
|
|
| AIME 2026 (no tools) | 20.8% | 89.2% | 88.3% | +68.4 |
|
|
| GPQA Diamond | 42.4% | 84.3% | 82.3% | +41.9 |
|
|
| BigBench Extra Hard | 19.3% | 74.4% | 64.8% | +55.1 |
|
|
| LiveCodeBench v6 | 29.1% | 80.0% | 77.1% | +50.9 |
|
|
| Codeforces ELO | 110 | 2150 | 1718 | +2040 |
|
|
| MMMU Pro (vision) | 49.7% | 76.9% | 73.8% | +27.2 |
|
|
| MATH-Vision | 46.0% | 85.6% | 82.4% | +39.6 |
|
|
| OmniDocBench (lower=better) | 0.365 | 0.131 | 0.149 | -0.234 |
|
|
| MRCR v2 128K | 13.5% | 66.4% | 44.1% | +52.9 |
|
|
| MMMLU (multilingual) | 70.7% | 88.4% | 86.3% | +17.7 |
|
|
|
|
## Arena Scores
|
|
|
|
| Model | LMArena Score | Rank |
|
|
|-------|--------------|------|
|
|
| Gemma 4 31B | 1452 | #3 |
|
|
| Gemma 4 26B A4B | 1441 | #6 |
|
|
|
|
## Agentic Benchmark (tau2-bench)
|
|
|
|
| Model | Score |
|
|
|-------|-------|
|
|
| 31B | 86.4% |
|
|
| 26B A4B | 85.5% |
|
|
| E4B | 57.5% |
|
|
| E2B | 29.4% |
|
|
|
|
## Takeaway
|
|
|
|
The jump from Gemma 3 to 4 is enormous — AIME went from 20.8% to 89.2%, Codeforces from 110 to 2150 ELO. This is not an incremental update. The 26B MoE nearly matches 31B Dense on most benchmarks while using ~4B active params.
|