Skip to main content
12 min read tech

Gemma 4: Deploy Google's Best Open Models for Agentic AI

Gemma 4 brings MoE efficiency, 256K context, and multimodal reasoning to local deployment. Complete guide with benchmarks, hardware specs, and agent use cases.

Google DeepMind did not just release another model. With Gemma 4, they released four — spanning edge devices to workstation-grade inference — and quietly made the strongest case yet for open-weight models in production agentic systems.

If you are building AI agents that need to run locally, process documents and images, reason through multi-step plans, and do all of this without sending a single byte to an external API, Gemma 4 is the model family to evaluate.

This guide covers the full Gemma 4 lineup: what each variant does, how much hardware you actually need, how to deploy locally, and why it matters for the agentic AI systems we build at dcode.

The Gemma 4 Model Family

Gemma 4 ships four variants, each targeting a different deployment profile. Two things stand out immediately: every variant supports multimodal input natively, and every variant includes a built-in thinking mode for chain-of-thought reasoning.

ModelArchitectureTotal ParametersActive ParametersContext WindowModalities
E2BDense + PLE2B2B128KText, Image, Audio
E4BDense + PLE4B4B128KText, Image, Audio
26B-A4BMoE26B4B256KText, Image
31BDense31B31B256KText, Image

The naming convention tells you the deployment story. E stands for edge — E2B and E4B are designed for phones, embedded systems, and lightweight local deployments. The 26B-A4B is the efficiency play: 26 billion total parameters, but only 4 billion active on any given token thanks to Mixture-of-Experts routing. The 31B is the dense powerhouse — every parameter fires on every token, maximum quality, maximum compute.

Which Variant to Choose

E2B — Pick it if you need an agent on a mobile device, a Raspberry Pi, or any environment with less than 8 GB of memory. Surprisingly capable for its size, with audio processing that the larger models lack.

E4B — Pick it if you want a step up from E2B without leaving the edge category. Strong enough for local assistants, document summarisation, and simple tool-calling agents. Also handles audio input.

26B-A4B — Pick it if you want near-frontier performance on a single workstation or Mac. The MoE architecture means you get 26B-class quality at 4B-class speed and memory cost. This is the sweet spot for most local agent deployments.

31B — Pick it if accuracy is the priority and you have the hardware to match. The strongest open model under 35B parameters. Choose this for agents that handle high-stakes decisions — legal review, financial analysis, compliance assessments — where every percentage point of accuracy matters.

Benchmarks

Numbers matter more than marketing. Here is how the Gemma 4 family performs on standard benchmarks:

ModelMMLU ProAIME 2026LiveCodeBenchMMMU Pro
31B85.2%89.2%80.0%76.9%
26B-A4B82.6%88.3%77.1%73.8%
E4B69.4%42.5%52.0%52.6%
E2B60.0%37.5%44.0%44.2%

The standout number: 26B-A4B scores 82.6% on MMLU Pro while activating only 4 billion parameters per token. To put that in context, models that score in this range typically require 70B+ dense parameters and a multi-GPU setup. The MoE architecture makes this level of quality accessible on a single machine.

For agentic workloads, the AIME and LiveCodeBench scores are particularly relevant — they measure the kind of multi-step reasoning and code generation that agents need for tool use, planning, and autonomous task execution.

Hardware Requirements

This is the table that actually determines whether you can run Gemma 4. Memory requirements vary significantly by quantisation level:

Variant4-bit8-bitBF16 (full precision)
E2B4 GB5–8 GB10 GB
E4B5.5–6 GB9–12 GB16 GB
26B-A4B16–18 GB28–30 GB52 GB
31B17–20 GB34–38 GB62 GB

For Mac users: unified memory is your advantage. An M2 Pro with 32 GB handles the 26B-A4B at 4-bit comfortably. An M4 Max with 64 GB runs the 31B at 8-bit. Apple Silicon’s memory bandwidth makes inference surprisingly fast compared to equivalent RAM on x86 systems.

For GPU servers: the 26B-A4B fits on a single RTX 4090 (24 GB) at 4-bit. The 31B at 8-bit needs an A100 40 GB or two consumer GPUs. For production multi-agent systems serving concurrent requests, budget for at least 2x the single-inference requirement.

Our recommendation for agent deployments: start with the 26B-A4B at 4-bit quantisation. The quality-to-resource ratio is exceptional, and 4-bit quantisation on modern architectures introduces negligible quality loss for agentic tasks like tool selection, planning, and text generation.

Deploy with Ollama

Ollama is the fastest path to running Gemma 4 locally. One command, no configuration:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Gemma 4 26B-A4B (recommended for agents)
ollama run gemma4:27b

# Or the smaller variants
ollama run gemma4:4b
ollama run gemma4:2b

# Full precision 31B (requires 62+ GB RAM)
ollama run gemma4:31b

Ollama automatically selects the appropriate quantisation for your hardware. On a 32 GB Mac, it will pull the 4-bit quantised version of the 27B model.

Ollama as an Agent Backend

Ollama exposes an OpenAI-compatible API on localhost:11434. This means any agent framework that supports the OpenAI API format — LangChain, CrewAI, AutoGen, or your own custom code — can use Gemma 4 as its local model with zero changes:

# Test the API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:27b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What are the key provisions of the EU AI Act?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95
  }'

For agent deployments, set OLLAMA_KEEP_ALIVE=-1 to prevent the model from being unloaded between requests:

export OLLAMA_KEEP_ALIVE=-1
ollama serve

Deploy with llama.cpp

For maximum control — custom quantisation, batch processing, specific hardware tuning — build llama.cpp from source:

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# macOS (Metal acceleration)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j

# Linux with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

# Run the model
./build/bin/llama-cli \
  -hf google/gemma-4-27b-it-GGUF \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  -c 32768 \
  --conversation

Google’s recommended defaults for Gemma 4:

ParameterValueNotes
temperature1.0Higher than typical — Gemma 4 is calibrated for it
top_p0.95Nucleus sampling
top_k64Token candidates
context32768Default; extend to 256K if needed and RAM allows

Important: Gemma 4 is trained with temperature: 1.0 as default — not the 0.7 you might be used to from other models. Using lower temperatures can actually reduce output quality. Trust the calibration.

Thinking Mode

Every Gemma 4 variant includes a built-in thinking mode — the model produces explicit chain-of-thought reasoning before generating its answer. This is not a prompt hack; it is trained into the model weights.

For agentic systems, thinking mode is transformative. An agent that can reason through its tool selection, evaluate multiple approaches, and explain its plan before executing produces dramatically better results — and dramatically better audit trails.

Enabling Thinking Mode

Add the <|think|> token at the start of your system prompt to activate thinking:

<|system|>
<|think|>
You are a task-planning agent. Break down complex requests into actionable steps,
select the appropriate tools for each step, and explain your reasoning.
<|end|>

The model will output its reasoning in <|channel>thought blocks before delivering the final answer. In production, you can parse these blocks separately — log them for audit, display them in a debug view, or use them for agent self-correction.

When to Use Thinking Mode

Enable for: multi-step planning, tool selection, complex reasoning, compliance-sensitive decisions, anything where you need an audit trail of the agent’s logic.

Disable for: simple Q&A, high-throughput chat, latency-sensitive interactions where the thinking overhead is not justified.

Thinking mode roughly doubles the token output per request. Budget accordingly for both latency and cost (if using metered infrastructure).

Multimodal Capabilities

All Gemma 4 variants process images natively. The E2B and E4B variants also handle audio. This is not a bolted-on adapter — multimodal understanding is trained into the base model.

For agent deployments, this unlocks:

  • Document-processing agents — feed invoices, contracts, or reports as images; the agent extracts structured data without OCR pipelines
  • Visual inspection agents — quality control, site documentation, inventory management from photos
  • Audio-processing agents (E2B/E4B) — meeting transcription, voice command parsing, call centre analysis on edge devices
  • Multimodal RAG — agents that reason over both text and visual content from knowledge bases

Why Gemma 4 Matters for Agentic AI in Europe

We build and operate multi-agent systems for European businesses. Three aspects of Gemma 4 are directly relevant to this work:

1. Data Sovereignty

With the EU AI Act entering enforcement in August 2026 and GDPR already in full effect, the ability to run inference locally — with zero data leaving your network — is not a nice-to-have. It is a compliance requirement for many use cases.

Gemma 4 running on Ollama or llama.cpp on EU-hosted infrastructure (Hetzner, OVH, or on-premises) gives you a fully sovereign AI layer. No API calls to US cloud providers. No data residency questions. No third-party processor agreements for your inference pipeline.

2. Cost Economics for Always-On Agents

Agents that run 24/7 — monitoring systems, processing emails, managing pipelines — accumulate significant API costs with cloud models. A single agent making 1,000 calls per day at $0.003 per 1K input tokens adds up quickly across a multi-agent fleet.

Local Gemma 4 deployment converts variable API costs into fixed infrastructure costs. Once your hardware is provisioned, marginal inference cost is effectively zero. For our 8-agent system at Inscape, this kind of economics is the difference between sustainable operations and runaway cloud bills.

3. Latency and Availability

Local inference eliminates network latency and API availability as failure modes. Your agents do not go down because a cloud provider has an outage. They do not slow down because you hit a rate limit. They do not queue because of peak-hour congestion.

For agents that need to respond in real-time — customer-facing assistants, monitoring watchdogs, financial processors — this reliability is essential.

Getting Started

  1. Evaluate your hardware — check the requirements table above against your available memory
  2. Install Ollama — one command, works on macOS, Linux, and Windows
  3. Pull the 26B-A4B — the best quality-to-resource ratio for most agent use cases
  4. Test with thinking mode — enable <|think|> and observe the reasoning quality
  5. Integrate with your agent framework — Ollama’s OpenAI-compatible API works with any framework
  6. Benchmark on your workload — run your actual agent tasks, not just generic benchmarks

If you are evaluating open models for production agent deployment — particularly in regulated European environments — Gemma 4 should be at the top of your shortlist. The combination of MoE efficiency, 256K context, native multimodal support, and built-in reasoning makes it the most complete open model family available today.


At dcode, we design, build, and operate multi-agent systems for European businesses. If you are evaluating local model deployment for your agentic AI infrastructure, get in touch — we have done this before and we can help you do it right.

Frequently Asked Questions

What is Gemma 4?
Gemma 4 is Google DeepMind's latest family of open-weight language models, released in 2026. It includes four variants ranging from 2B to 31B parameters, all supporting multimodal input (text and images) and featuring built-in thinking/reasoning capabilities. The models are open-weight, meaning you can download and run them locally without any API dependency or data leaving your infrastructure.
What is the difference between Gemma 4 26B-A4B and 31B?
The 26B-A4B uses a Mixture-of-Experts (MoE) architecture — it has 26B total parameters but only activates 4B per token, making it significantly faster and more memory-efficient. The 31B is a dense model that activates all parameters on every token, delivering higher accuracy (85.2% vs 82.6% on MMLU Pro) at the cost of roughly 4x more compute and memory. Choose 26B-A4B when speed and efficiency matter; choose 31B when you need maximum quality and have the hardware.
Can I run Gemma 4 on a Mac?
Yes. The E2B and E4B variants run comfortably on any modern Mac. The 26B-A4B at 4-bit quantisation requires 16-18 GB of unified memory, which fits on M1 Pro/Max and later. The 31B at 4-bit needs 17-20 GB. For M1/M2 base models with 8 GB, stick to E2B or E4B. All variants work with Ollama or llama.cpp on macOS.
Is Gemma 4 good for AI agents?
Yes — Gemma 4 is one of the strongest open model families for agentic workloads. The built-in thinking mode enables structured reasoning for tool selection and multi-step planning. The 256K context window handles long agent conversations and large document processing. Function calling works reliably on the 26B and 31B variants. And local deployment means your agent data never leaves your infrastructure — critical for EU compliance and sensitive business operations.
How does Gemma 4 compare to Llama and Qwen?
Gemma 4's 26B-A4B MoE model is competitive with or exceeds Llama 3.3 70B and Qwen 2.5 72B on key benchmarks — while using a fraction of the compute. The 31B dense model sets a new bar for open models under 35B parameters. The key differentiators are native multimodal support, 256K context, and the built-in thinking mode — features that require additional tooling or prompting with Llama and Qwen.
What hardware do I need for Gemma 4 in production?
For production agent deployments, we recommend: E4B at 8-bit for edge/embedded agents (9-12 GB RAM); 26B-A4B at 4-bit for general-purpose agents on workstations or small servers (16-18 GB RAM); 31B at 8-bit for high-accuracy agents on GPU servers (34-38 GB VRAM). For multi-agent systems serving concurrent requests, add a GPU with sufficient VRAM or use multiple Mac Studios with unified memory.
Tags: Gemma 4 Google AI open models local deployment agentic AI sovereign AI MoE multimodal llama.cpp Ollama

Share this article

Related Articles