Gemma 4: Google DeepMind's Most Intelligent Open Models

On April 2, 2026, Google DeepMind released Gemma 4 — the fourth generation of its open model family, licensed under Apache 2.0. Built on research and technology from Gemini 3, Gemma 4 is not just an incremental upgrade. It introduces a fundamentally different architecture per deployment target, ranging from on-device edge models that understand audio and video to a 31B-parameter dense model that achieves frontier-level benchmarks.

What makes Gemma 4 remarkable is its intelligence-per-parameter efficiency. The 31B dense model scores 1452 on LMArena — competitive with models many times its size — while the 26B MoE variant achieves 1441 with only 4 billion active parameters per token. This is a 7.5x compute reduction for near-identical performance.

Model Variants

Gemma 4 ships in four sizes, each designed for a specific deployment regime:

Model	Total Params	Active Params	Context	Architecture	Modalities
Gemma 4 31B	~31B	31B (dense)	256K	GQA + GeGLU	Text, Image
Gemma 4 26B-A4B	~26B	~4B (MoE)	256K	MoE + dense FFN	Text, Image
Gemma 4 E4B	~8B	~4.5B	128K	GQA + PLE + KV sharing	Text, Image, Audio, Video
Gemma 4 E2B	~5.1B	~2.3B	128K	GQA + PLE + KV sharing	Text, Image, Audio, Video

The E-series models (E2B, E4B) are any-to-any — they natively process text, images, video with audio, and standalone audio. The larger 31B and 26B models focus on text and image input.

Key Benchmarks

The benchmark numbers speak for themselves:

Benchmark	Gemma 4 31B	Gemma 4 26B-A4B	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B
LMArena (text)	1452	1441	—	—	1365
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026 (math)	89.2%	88.3%	42.5%	37.5%	20.8%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
MMMU Pro (vision)	76.9%	73.8%	52.6%	44.2%	49.7%
τ²-bench (agentic)	86.4%	85.5%	57.5%	29.4%	6.6%

The 26B-A4B MoE model deserves special attention: it achieves 98% of the 31B's Arena score while activating only 4B parameters per token. For coding tasks (LiveCodeBench), it scores 77.1% — ahead of the previous-generation Gemma 3 27B at 29.1%.

Architecture: What's New

Gemma 4 moves beyond the "same architecture, different scale" approach. Each variant has a distinct architecture optimized for its deployment target, but they share key innovations:

Dual-Config Attention

The single biggest structural change from Gemma 3. Layers alternate between sliding window (local) and full context (global) attention, but now these layer types have completely different geometries:

Sliding layers: head_dim=256, more KV heads, standard RoPE (theta=10K). Optimized for fine-grained local patterns.
Full layers: head_dim=512, fewer KV heads, proportional RoPE (theta=1M, 25% partial rotation). Optimized for long-range semantic attention.

In Gemma 3, all layers had the same attention configuration. In Gemma 4, every 6th layer operates with a fundamentally different attention geometry.

Proportional RoPE (p-RoPE)

Full attention layers apply RoPE to only 25% of dimensions (the high-frequency positional channels), leaving 75% as pure semantic channels that never carry positional information. This is grounded in research showing that low-frequency RoPE dimensions carry semantic content and degrade at long context lengths.

The result: robust 256K context windows for the 31B and 26B models, double the 128K of Gemma 3.

In full attention layers, the V (value) projection is eliminated entirely. The key tensor is cloned as the value, then K and V diverge through separate normalization paths. This reduces parameters without quality loss — combined with fewer KV heads (4 vs 16), it dramatically cuts per-layer parameters and KV cache memory.

Parallel Dense FFN + MoE (26B-A4B)

The MoE variant is architecturally unusual. Instead of replacing the dense FFN with experts (the standard approach), Gemma 4 runs both in parallel: a dense GeGLU FFN alongside 128 experts (top-8 routing). Their outputs are summed and normalized together. This gives each layer always-on dense capacity plus sparse expert specialization.

Per-Layer Embeddings (E-series)

The E2B model introduces Per-Layer Embeddings (PLE): a second embedding table that maps each token to a unique 256-dim vector for every decoder layer. This gives each layer its own channel to receive token-specific information. It accounts for the gap between total parameters (~5.1B) and effective parameters (~2.3B) — the embedding table is loaded but most of it isn't "active" per token.

In the E2B model, 20 of 35 layers reuse KV caches from earlier layers of the same attention type. This eliminates redundant KV projections and dramatically reduces memory. The shared layers compensate with 2x wider MLPs. The result: a model that can run full 128K context on a laptop GPU.

Multimodal Capabilities

Gemma 4 is natively multimodal across the board:

Vision: All models use a ViT vision encoder with 2D RoPE and learned positional embeddings. Images can be encoded to different token budgets (70, 140, 280, 560, 1120) for speed/quality tradeoffs.
Audio: The E-series models include a USM-style Conformer audio encoder (12 layers) for speech understanding, transcription, and audio Q&A.
Video: All models support video input; E-series models process video with audio tracks.
Function Calling: Native support for agentic workflows with tool use. The 31B model scores 86.4% on the τ²-bench agentic benchmark.
140 Languages: Support for 140 languages with cultural context understanding.

Hardware Requirements

One of Gemma 4's design goals is accessibility. Here's what you need to run each variant:

Configuration	31B	26B-A4B	E2B
FP16, 4K ctx	1× H100 80GB	1× H100 80GB	RTX 4070 12GB
INT4, 4K ctx	RTX 4090 24GB	RTX 4070 Ti 16GB	Any GPU
INT4, 128K ctx	2× H100 160GB	A6000 48GB	RTX 4060 8GB

The E2B model at INT4 with 128K context fits in under 5GB of VRAM — it runs on a laptop. The 26B-A4B MoE at INT4 + 128K context fits in a single A6000, while the 31B requires two H100s for the same configuration.

Getting Started

Ollama (Fastest)

ollama run gemma4:31b
ollama run gemma4:26b-a4b
```text

### Hugging Face Transformers

```python
from transformers import AutoModelForMultimodalLM, AutoProcessor

model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-31b-it", device_map="auto"
)
processor = AutoProcessor.from_pretrained("google/gemma-4-31b-it")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://example.com/photo.jpg"},
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
```text

### Google AI Studio

Try Gemma 4 31B IT directly in [Google AI Studio](https://aistudio.google.com/) without any setup.

### Other Platforms

Gemma 4 weights are available on:

- [Hugging Face](https://huggingface.co/collections/google/gemma-4)
- [Kaggle](https://www.kaggle.com/models/google/gemma-4)
- [LM Studio](https://lmstudio.ai/models/gemma-4)
- [Docker Hub](https://hub.docker.com/r/ai/gemma4)

## Fine-Tuning

Gemma 4 is supported by major fine-tuning frameworks:

- **TRL** (Hugging Face): Full fine-tuning, LoRA, QLoRA
- **Unsloth Studio**: Fast fine-tuning with quantized models
- **Keras**: Native Google framework integration
- **JAX**: For TPU and large-scale training

The models are licensed under **Apache 2.0**, meaning you can use them commercially, modify them, and distribute derivatives without restrictions.

## Practical Use Cases

For developers and organizations considering Gemma 4:

- **Local coding assistants**: The 26B-A4B runs on consumer GPUs with 77.1% on LiveCodeBench — competitive with much larger proprietary models.
- **Edge AI**: The E2B and E4B models run on phones, Raspberry Pi, and Jetson Nano with near-zero latency and offline capability.
- **Agentic workflows**: Native function calling and tool use across all model sizes.
- **Multimodal applications**: OCR, GUI element detection, document understanding, and audio transcription with a single model.
- **Self-hosted AI**: Apache 2.0 licensing means full sovereignty over your AI infrastructure.

## What This Means for the Open-Source Ecosystem

Gemma 4 continues Google DeepMind's strategy of releasing increasingly capable open models. The combination of Apache 2.0 licensing, frontier-level benchmarks, and deployment flexibility from edge to server makes this a significant release. The MoE variant in particular — achieving near-31B quality with 4B active parameters — demonstrates that efficient architectures can compete with brute-force scaling.

For teams building with LLMs, Gemma 4 offers a practical path to self-hosted, production-quality AI without vendor lock-in or per-token pricing.

## Resources

- [Official Documentation](https://ai.google.dev/gemma/docs)
- [Model Card](https://ai.google.dev/gemma/docs/core/model_card_4)
- [Hugging Face Collection](https://huggingface.co/collections/google/gemma-4)
- [Architecture Deep Dive](https://g4.si5.pl/)