Gemma 4: Google DeepMind's Most Intelligent Open Models
Gemma 4: Google DeepMind's Most Intelligent Open Models
On April 2, 2026, Google DeepMind released Gemma 4 — the fourth generation of its open model family, licensed under Apache 2.0. Built on research and technology from Gemini 3, Gemma 4 is not just an incremental upgrade. It introduces a fundamentally different architecture per deployment target, ranging from on-device edge models that understand audio and video to a 31B-parameter dense model that achieves frontier-level benchmarks.
What makes Gemma 4 remarkable is its intelligence-per-parameter efficiency. The 31B dense model scores 1452 on LMArena — competitive with models many times its size — while the 26B MoE variant achieves 1441 with only 4 billion active parameters per token. This is a 7.5x compute reduction for near-identical performance.
Model Variants
Gemma 4 ships in four sizes, each designed for a specific deployment regime:
| Model | Total Params | Active Params | Context | Architecture | Modalities |
|---|---|---|---|---|---|
| Gemma 4 31B | ~31B | 31B (dense) | 256K | GQA + GeGLU | Text, Image |
| Gemma 4 26B-A4B | ~26B | ~4B (MoE) | 256K | MoE + dense FFN | Text, Image |
| Gemma 4 E4B | ~8B | ~4.5B | 128K | GQA + PLE + KV sharing | Text, Image, Audio, Video |
| Gemma 4 E2B | ~5.1B | ~2.3B | 128K | GQA + PLE + KV sharing | Text, Image, Audio, Video |
The E-series models (E2B, E4B) are any-to-any — they natively process text, images, video with audio, and standalone audio. The larger 31B and 26B models focus on text and image input.
Key Benchmarks
The benchmark numbers speak for themselves:
| Benchmark | Gemma 4 31B | Gemma 4 26B-A4B | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| LMArena (text) | 1452 | 1441 | — | — | 1365 |
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
| MMMU Pro (vision) | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |
| τ²-bench (agentic) | 86.4% | 85.5% | 57.5% | 29.4% | 6.6% |
The 26B-A4B MoE model deserves special attention: it achieves 98% of the 31B's Arena score while activating only 4B parameters per token. For coding tasks (LiveCodeBench), it scores 77.1% — ahead of the previous-generation Gemma 3 27B at 29.1%.
Architecture: What's New
Gemma 4 moves beyond the "same architecture, different scale" approach. Each variant has a distinct architecture optimized for its deployment target, but they share key innovations:
Dual-Config Attention
The single biggest structural change from Gemma 3. Layers alternate between sliding window (local) and full context (global) attention, but now these layer types have completely different geometries:
- Sliding layers:
head_dim=256, more KV heads, standard RoPE (theta=10K). Optimized for fine-grained local patterns. - Full layers:
head_dim=512, fewer KV heads, proportional RoPE (theta=1M, 25% partial rotation). Optimized for long-range semantic attention.
In Gemma 3, all layers had the same attention configuration. In Gemma 4, every 6th layer operates with a fundamentally different attention geometry.
Proportional RoPE (p-RoPE)
Full attention layers apply RoPE to only 25% of dimensions (the high-frequency positional channels), leaving 75% as pure semantic channels that never carry positional information. This is grounded in research showing that low-frequency RoPE dimensions carry semantic content and degrade at long context lengths.
The result: robust 256K context windows for the 31B and 26B models, double the 128K of Gemma 3.
K=V Weight Sharing
In full attention layers, the V (value) projection is eliminated entirely. The key tensor is cloned as the value, then K and V diverge through separate normalization paths. This reduces parameters without quality loss — combined with fewer KV heads (4 vs 16), it dramatically cuts per-layer parameters and KV cache memory.
Parallel Dense FFN + MoE (26B-A4B)
The MoE variant is architecturally unusual. Instead of replacing the dense FFN with experts (the standard approach), Gemma 4 runs both in parallel: a dense GeGLU FFN alongside 128 experts (top-8 routing). Their outputs are summed and normalized together. This gives each layer always-on dense capacity plus sparse expert specialization.
Per-Layer Embeddings (E-series)
The E2B model introduces Per-Layer Embeddings (PLE): a second embedding table that maps each token to a unique 256-dim vector for every decoder layer. This gives each layer its own channel to receive token-specific information. It accounts for the gap between total parameters (~5.1B) and effective parameters (~2.3B) — the embedding table is loaded but most of it isn't "active" per token.
KV Cache Sharing (E-series)
In the E2B model, 20 of 35 layers reuse KV caches from earlier layers of the same attention type. This eliminates redundant KV projections and dramatically reduces memory. The shared layers compensate with 2x wider MLPs. The result: a model that can run full 128K context on a laptop GPU.
Multimodal Capabilities
Gemma 4 is natively multimodal across the board:
- Vision: All models use a ViT vision encoder with 2D RoPE and learned positional embeddings. Images can be encoded to different token budgets (70, 140, 280, 560, 1120) for speed/quality tradeoffs.
- Audio: The E-series models include a USM-style Conformer audio encoder (12 layers) for speech understanding, transcription, and audio Q&A.
- Video: All models support video input; E-series models process video with audio tracks.
- Function Calling: Native support for agentic workflows with tool use. The 31B model scores 86.4% on the τ²-bench agentic benchmark.
- 140 Languages: Support for 140 languages with cultural context understanding.
Hardware Requirements
One of Gemma 4's design goals is accessibility. Here's what you need to run each variant:
| Configuration | 31B | 26B-A4B | E2B |
|---|---|---|---|
| FP16, 4K ctx | 1× H100 80GB | 1× H100 80GB | RTX 4070 12GB |
| INT4, 4K ctx | RTX 4090 24GB | RTX 4070 Ti 16GB | Any GPU |
| INT4, 128K ctx | 2× H100 160GB | A6000 48GB | RTX 4060 8GB |
The E2B model at INT4 with 128K context fits in under 5GB of VRAM — it runs on a laptop. The 26B-A4B MoE at INT4 + 128K context fits in a single A6000, while the 31B requires two H100s for the same configuration.
Getting Started
Ollama (Fastest)
ollama run gemma4:31b
ollama run gemma4:26b-a4b
Hugging Face Transformers
from transformers import AutoModelForMultimodalLM, AutoProcessor
model = AutoModelForMultimodalLM.from_pretrained(
"google/gemma-4-31b-it", device_map="auto"
)
processor = AutoProcessor.from_pretrained("google/gemma-4-31b-it")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://example.com/photo.jpg"},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
Google AI Studio
Try Gemma 4 31B IT directly in Google AI Studio without any setup.
Other Platforms
Gemma 4 weights are available on:
Fine-Tuning
Gemma 4 is supported by major fine-tuning frameworks:
- TRL (Hugging Face): Full fine-tuning, LoRA, QLoRA
- Unsloth Studio: Fast fine-tuning with quantized models
- Keras: Native Google framework integration
- JAX: For TPU and large-scale training
The models are licensed under Apache 2.0, meaning you can use them commercially, modify them, and distribute derivatives without restrictions.
Practical Use Cases
For developers and organizations considering Gemma 4:
- Local coding assistants: The 26B-A4B runs on consumer GPUs with 77.1% on LiveCodeBench — competitive with much larger proprietary models.
- Edge AI: The E2B and E4B models run on phones, Raspberry Pi, and Jetson Nano with near-zero latency and offline capability.
- Agentic workflows: Native function calling and tool use across all model sizes.
- Multimodal applications: OCR, GUI element detection, document understanding, and audio transcription with a single model.
- Self-hosted AI: Apache 2.0 licensing means full sovereignty over your AI infrastructure.
What This Means for the Open-Source Ecosystem
Gemma 4 continues Google DeepMind's strategy of releasing increasingly capable open models. The combination of Apache 2.0 licensing, frontier-level benchmarks, and deployment flexibility from edge to server makes this a significant release. The MoE variant in particular — achieving near-31B quality with 4B active parameters — demonstrates that efficient architectures can compete with brute-force scaling.
For teams building with LLMs, Gemma 4 offers a practical path to self-hosted, production-quality AI without vendor lock-in or per-token pricing.