Google DeepMind did not just release another model. With Gemma 4, they released four — spanning edge devices to workstation-grade inference — and quietly made the strongest case yet for open-weight models in production agentic systems.
If you are building AI agents that need to run locally, process documents and images, reason through multi-step plans, and do all of this without sending a single byte to an external API, Gemma 4 is the model family to evaluate.
This guide covers the full Gemma 4 lineup: what each variant does, how much hardware you actually need, how to deploy locally, and why it matters for the agentic AI systems we build at dcode.
The Gemma 4 Model Family
Gemma 4 ships four variants, each targeting a different deployment profile. Two things stand out immediately: every variant supports multimodal input natively, and every variant includes a built-in thinking mode for chain-of-thought reasoning.
| Model | Architecture | Total Parameters | Active Parameters | Context Window | Modalities |
|---|---|---|---|---|---|
| E2B | Dense + PLE | 2B | 2B | 128K | Text, Image, Audio |
| E4B | Dense + PLE | 4B | 4B | 128K | Text, Image, Audio |
| 26B-A4B | MoE | 26B | 4B | 256K | Text, Image |
| 31B | Dense | 31B | 31B | 256K | Text, Image |
The naming convention tells you the deployment story. E stands for edge — E2B and E4B are designed for phones, embedded systems, and lightweight local deployments. The 26B-A4B is the efficiency play: 26 billion total parameters, but only 4 billion active on any given token thanks to Mixture-of-Experts routing. The 31B is the dense powerhouse — every parameter fires on every token, maximum quality, maximum compute.
Which Variant to Choose
E2B — Pick it if you need an agent on a mobile device, a Raspberry Pi, or any environment with less than 8 GB of memory. Surprisingly capable for its size, with audio processing that the larger models lack.
E4B — Pick it if you want a step up from E2B without leaving the edge category. Strong enough for local assistants, document summarisation, and simple tool-calling agents. Also handles audio input.
26B-A4B — Pick it if you want near-frontier performance on a single workstation or Mac. The MoE architecture means you get 26B-class quality at 4B-class speed and memory cost. This is the sweet spot for most local agent deployments.
31B — Pick it if accuracy is the priority and you have the hardware to match. The strongest open model under 35B parameters. Choose this for agents that handle high-stakes decisions — legal review, financial analysis, compliance assessments — where every percentage point of accuracy matters.
Benchmarks
Numbers matter more than marketing. Here is how the Gemma 4 family performs on standard benchmarks:
| Model | MMLU Pro | AIME 2026 | LiveCodeBench | MMMU Pro |
|---|---|---|---|---|
| 31B | 85.2% | 89.2% | 80.0% | 76.9% |
| 26B-A4B | 82.6% | 88.3% | 77.1% | 73.8% |
| E4B | 69.4% | 42.5% | 52.0% | 52.6% |
| E2B | 60.0% | 37.5% | 44.0% | 44.2% |
The standout number: 26B-A4B scores 82.6% on MMLU Pro while activating only 4 billion parameters per token. To put that in context, models that score in this range typically require 70B+ dense parameters and a multi-GPU setup. The MoE architecture makes this level of quality accessible on a single machine.
For agentic workloads, the AIME and LiveCodeBench scores are particularly relevant — they measure the kind of multi-step reasoning and code generation that agents need for tool use, planning, and autonomous task execution.
Hardware Requirements
This is the table that actually determines whether you can run Gemma 4. Memory requirements vary significantly by quantisation level:
| Variant | 4-bit | 8-bit | BF16 (full precision) |
|---|---|---|---|
| E2B | 4 GB | 5–8 GB | 10 GB |
| E4B | 5.5–6 GB | 9–12 GB | 16 GB |
| 26B-A4B | 16–18 GB | 28–30 GB | 52 GB |
| 31B | 17–20 GB | 34–38 GB | 62 GB |
For Mac users: unified memory is your advantage. An M2 Pro with 32 GB handles the 26B-A4B at 4-bit comfortably. An M4 Max with 64 GB runs the 31B at 8-bit. Apple Silicon’s memory bandwidth makes inference surprisingly fast compared to equivalent RAM on x86 systems.
For GPU servers: the 26B-A4B fits on a single RTX 4090 (24 GB) at 4-bit. The 31B at 8-bit needs an A100 40 GB or two consumer GPUs. For production multi-agent systems serving concurrent requests, budget for at least 2x the single-inference requirement.
Our recommendation for agent deployments: start with the 26B-A4B at 4-bit quantisation. The quality-to-resource ratio is exceptional, and 4-bit quantisation on modern architectures introduces negligible quality loss for agentic tasks like tool selection, planning, and text generation.
Deploy with Ollama
Ollama is the fastest path to running Gemma 4 locally. One command, no configuration:
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Gemma 4 26B-A4B (recommended for agents)
ollama run gemma4:27b
# Or the smaller variants
ollama run gemma4:4b
ollama run gemma4:2b
# Full precision 31B (requires 62+ GB RAM)
ollama run gemma4:31b
Ollama automatically selects the appropriate quantisation for your hardware. On a 32 GB Mac, it will pull the 4-bit quantised version of the 27B model.
Ollama as an Agent Backend
Ollama exposes an OpenAI-compatible API on localhost:11434. This means any agent framework that supports the OpenAI API format — LangChain, CrewAI, AutoGen, or your own custom code — can use Gemma 4 as its local model with zero changes:
# Test the API
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:27b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the key provisions of the EU AI Act?"}
],
"temperature": 1.0,
"top_p": 0.95
}'
For agent deployments, set OLLAMA_KEEP_ALIVE=-1 to prevent the model from being unloaded between requests:
export OLLAMA_KEEP_ALIVE=-1
ollama serve
Deploy with llama.cpp
For maximum control — custom quantisation, batch processing, specific hardware tuning — build llama.cpp from source:
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# macOS (Metal acceleration)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j
# Linux with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
# Run the model
./build/bin/llama-cli \
-hf google/gemma-4-27b-it-GGUF \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-c 32768 \
--conversation
Recommended Inference Parameters
Google’s recommended defaults for Gemma 4:
| Parameter | Value | Notes |
|---|---|---|
temperature | 1.0 | Higher than typical — Gemma 4 is calibrated for it |
top_p | 0.95 | Nucleus sampling |
top_k | 64 | Token candidates |
context | 32768 | Default; extend to 256K if needed and RAM allows |
Important: Gemma 4 is trained with temperature: 1.0 as default — not the 0.7 you might be used to from other models. Using lower temperatures can actually reduce output quality. Trust the calibration.
Thinking Mode
Every Gemma 4 variant includes a built-in thinking mode — the model produces explicit chain-of-thought reasoning before generating its answer. This is not a prompt hack; it is trained into the model weights.
For agentic systems, thinking mode is transformative. An agent that can reason through its tool selection, evaluate multiple approaches, and explain its plan before executing produces dramatically better results — and dramatically better audit trails.
Enabling Thinking Mode
Add the <|think|> token at the start of your system prompt to activate thinking:
<|system|>
<|think|>
You are a task-planning agent. Break down complex requests into actionable steps,
select the appropriate tools for each step, and explain your reasoning.
<|end|>
The model will output its reasoning in <|channel>thought blocks before delivering the final answer. In production, you can parse these blocks separately — log them for audit, display them in a debug view, or use them for agent self-correction.
When to Use Thinking Mode
Enable for: multi-step planning, tool selection, complex reasoning, compliance-sensitive decisions, anything where you need an audit trail of the agent’s logic.
Disable for: simple Q&A, high-throughput chat, latency-sensitive interactions where the thinking overhead is not justified.
Thinking mode roughly doubles the token output per request. Budget accordingly for both latency and cost (if using metered infrastructure).
Multimodal Capabilities
All Gemma 4 variants process images natively. The E2B and E4B variants also handle audio. This is not a bolted-on adapter — multimodal understanding is trained into the base model.
For agent deployments, this unlocks:
- Document-processing agents — feed invoices, contracts, or reports as images; the agent extracts structured data without OCR pipelines
- Visual inspection agents — quality control, site documentation, inventory management from photos
- Audio-processing agents (E2B/E4B) — meeting transcription, voice command parsing, call centre analysis on edge devices
- Multimodal RAG — agents that reason over both text and visual content from knowledge bases
Why Gemma 4 Matters for Agentic AI in Europe
We build and operate multi-agent systems for European businesses. Three aspects of Gemma 4 are directly relevant to this work:
1. Data Sovereignty
With the EU AI Act entering enforcement in August 2026 and GDPR already in full effect, the ability to run inference locally — with zero data leaving your network — is not a nice-to-have. It is a compliance requirement for many use cases.
Gemma 4 running on Ollama or llama.cpp on EU-hosted infrastructure (Hetzner, OVH, or on-premises) gives you a fully sovereign AI layer. No API calls to US cloud providers. No data residency questions. No third-party processor agreements for your inference pipeline.
2. Cost Economics for Always-On Agents
Agents that run 24/7 — monitoring systems, processing emails, managing pipelines — accumulate significant API costs with cloud models. A single agent making 1,000 calls per day at $0.003 per 1K input tokens adds up quickly across a multi-agent fleet.
Local Gemma 4 deployment converts variable API costs into fixed infrastructure costs. Once your hardware is provisioned, marginal inference cost is effectively zero. For our 8-agent system at Inscape, this kind of economics is the difference between sustainable operations and runaway cloud bills.
3. Latency and Availability
Local inference eliminates network latency and API availability as failure modes. Your agents do not go down because a cloud provider has an outage. They do not slow down because you hit a rate limit. They do not queue because of peak-hour congestion.
For agents that need to respond in real-time — customer-facing assistants, monitoring watchdogs, financial processors — this reliability is essential.
Getting Started
- Evaluate your hardware — check the requirements table above against your available memory
- Install Ollama — one command, works on macOS, Linux, and Windows
- Pull the 26B-A4B — the best quality-to-resource ratio for most agent use cases
- Test with thinking mode — enable
<|think|>and observe the reasoning quality - Integrate with your agent framework — Ollama’s OpenAI-compatible API works with any framework
- Benchmark on your workload — run your actual agent tasks, not just generic benchmarks
If you are evaluating open models for production agent deployment — particularly in regulated European environments — Gemma 4 should be at the top of your shortlist. The combination of MoE efficiency, 256K context, native multimodal support, and built-in reasoning makes it the most complete open model family available today.
At dcode, we design, build, and operate multi-agent systems for European businesses. If you are evaluating local model deployment for your agentic AI infrastructure, get in touch — we have done this before and we can help you do it right.