Skip to main content
graphwiz.aigraphwiz.ai
← Back to AI

Running Gemma 4 on a Raspberry Pi 5 with the Hailo-8: What Actually Works

AIEdge Computing
gemma-4raspberry-pi-5hailo-8edge-aillmwavesharellama-cpp

Running Gemma 4 on a Raspberry Pi 5 with the Hailo-8: What Actually Works

You cannot run Gemma 4 on a Hailo-8. That is not a headline designed to disappoint you; it is simply a hardware fact. The Hailo-8 has 18 megabytes of on-chip SRAM and no external memory interface. Gemma 4 E2B, the smallest variant, needs roughly 5 gigabytes of RAM at 4-bit quantisation. There is no software trick that closes a 270x memory gap.

What you can do is build a genuinely useful hybrid system: the Hailo-8 handles vision tasks and embeddings at 26 TOPS while the Raspberry Pi 5's ARM Cortex-A76 cores run Gemma 4 through llama.cpp at a usable 3 to 8 tokens per second. It is not a marketing demo. It is a real, deployable edge AI pipeline.

This article walks through the hardware, the honest limitations, the setup process, and the upgrade path to Hailo's newer chips that actually do support LLM inference.

The Hardware

Two components make up this build.

Raspberry Pi 5 (8GB model recommended). The quad-core Cortex-A76 at 2.4GHz is no slouch for integer-heavy LLM inference, and 8GB of LPDDR4X gives you just enough headroom for Gemma 4 E2B at Q4 quantisation alongside the operating system. The Pi 5 also brings a proper PCIe 2.0 x1 lane, which is how you connect the accelerator.

Waveshare Hailo-8 M.2 AI Accelerator Module (SKU 27812, kit SKU 27841, around $188). This is an M.2 Key M module, 22x42mm, that slots into a HAT or a PCIe-to-M.2 adapter on the Pi 5. It delivers 26 TOPS at INT8 precision while drawing just 2.5 watts under typical load (8.65W max). For computer vision workloads, those numbers are impressive. YOLOv8n runs at 270 FPS and ResNet-50 at 1332 FPS.

The card connects via PCIe Gen3 x4 on paper, but the Pi 5 only exposes a single lane. That does not matter much for the Hailo-8's target workloads, but it is worth knowing.

The Limitation

The Hailo-8 was designed for CNN-based vision models, not transformer-based language models. Its architecture centres on 18MB of on-chip SRAM: 16MB for weights and 2MB for activations. There is no path to external DRAM. Every layer of the network must fit within that budget.

Gemma 4 E2B has 2.3 billion effective parameters. Even at aggressive 4-bit quantisation, that is roughly 1.15 gigabytes of weights alone. The activation memory for a transformer with 128K context window dwarfs the 2MB available on-chip. The architecture simply does not match the workload.

For reference, even the smallest models the Hailo-8 can handle are tightly optimised vision networks with weight budgets measured in single-digit megabytes. A text transformer of any useful size is orders of magnitude beyond what the chip can address.

If you are buying a Hailo-8 expecting to offload Gemma 4 inference onto it, you are buying the wrong tool. Save your money, or buy it knowing you will use it for what it was built for.

Setting Up Hailo-8 on Raspberry Pi 5

Assuming you are running Raspberry Pi OS 64-bit Bookworm, the installation is straightforward.

First, ensure PCIe Gen3 is enabled. The Hailo-8 needs it. Edit the boot configuration:

sudo nano /boot/firmware/config.txt

Add this line if it is not already present:

dtparam=pciex1_gen=3

Reboot, then install the Hailo software stack:

sudo apt update
sudo apt install hailo-all

Verify the hardware is detected:

lspci | grep Hailo

You should see the Hailo device listed on the PCIe bus. Confirm the firmware is loaded and the chip is responsive:

hailortcli fw-control identify

This returns the device ID, firmware version, and thermal status. If you get output here, the hardware is working and ready for inference.

The Hailo Dataflow Compiler (Hailo DFC) is included in the package. You will need it to compile models into the Hailo runtime format (HEF files). The compiler runs on the Pi itself for smaller models, though for complex networks you may want to cross-compile on a desktop machine.

Running Gemma 4 on Raspberry Pi 5 CPU

With the Hailo-8 handling vision tasks, the Pi 5's CPU handles language model inference. For Gemma 4, the smallest practical variant on this hardware is E2B (5.1 billion total parameters, 2.3 billion effective). At Q4_K_M quantisation through llama.cpp, it produces 3 to 8 tokens per second and uses roughly 5GB of RAM.

You can find GGUF quantised files on Hugging Face from either unsloth or bartowski. Download the Q4_K_M file:

# From unsloth
wget https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf

# Or bartowski for E4B if you have 8GB RAM and patience
wget https://huggingface.co/bartowski/google_gemma-4-E4B-it-GGUF/resolve/main/google_gemma-4-E4B-it-Q4_K_M.gguf

Build llama.cpp from source. The Pi 5's ARM cores benefit from NEON optimisations:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4

Run inference:

./llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 256 -t 4

The -t 4 flag uses all four cores. Expect your first prompt evaluation to be slow; subsequent tokens stream at 3 to 8 per second depending on context length and system load.

For an API endpoint that other processes or network clients can call, start the server:

./llama-server -m gemma-4-E2B-it-Q4_K_M.gguf \
  --port 8080 --host 0.0.0.0 -t 4

This exposes an OpenAI-compatible API on port 8080. Your Hailo-8 vision pipeline can feed extracted text or image embeddings directly into Gemma 4 through this endpoint.

The Hybrid Architecture

This is where the build actually becomes interesting. The two components serve distinct roles in a pipeline.

Hailo-8 processes visual input. It runs object detection with YOLOv8n at 270 FPS, image classification with ResNet-50 at 1332 FPS, OCR with PaddleOCR, or generates CLIP and SigLIP embeddings at 27.7 FPS for image search and retrieval tasks. The Whisper encoder can also run on the Hailo-8 for audio feature extraction.

RPi5 CPU runs Gemma 4 for reasoning and generation. The language model takes the structured output from the Hailo pipeline and produces natural language responses. Detected objects become descriptions. Extracted text becomes summaries. Audio features become transcriptions with context.

The glue between them is the llama-server endpoint. The Hailo runtime produces structured data (bounding boxes, labels, text strings, embedding vectors). A lightweight Python script posts that data to the Gemma 4 API and streams back the response. Total system power draw stays under 15 watts.

This is a realistic edge AI deployment. It will not set speed records, but it runs entirely on a $120 computer and a $188 accelerator, consuming less power than a single desktop monitor.

When to Upgrade

If your use case genuinely needs hardware-accelerated LLM inference on the Pi, wait for the right chip.

The Hailo-10H is available now via the Raspberry Pi AI HAT+ 2 (2026). It adds 4 to 8GB of LPDDR4 to the 40 TOPS INT4 compute engine, which means it can actually hold model weights. It runs Qwen2.5-1.5B at 9.45 tokens per second and Llama3.2-1B at 8.48 tokens per second. The catch: there is no Gemma 4 support yet. The Hailo compiler currently targets Qwen2, Llama, and DeepSeek architectures.

The Hailo-15H, expected in late 2026, targets 7B+ models and could support Gemma 4 variants with INT4 quantisation. That would mean running the E4B model entirely on the accelerator, or even the 26B MoE variant at aggressive quantisation. No firm benchmarks exist yet.

Here is a comparison of what each option offers for LLM workloads:

Configuration Gemma 4 Support Inference Memory Power Price
RPi5 CPU only (llama.cpp) E2B at Q4, 3-8 tok/s 5GB RAM ~7W (SoC) ~$120
RPi5 + Hailo-8 None (vision only) N/A N/A ~10W total ~$308
RPi5 + Hailo-10H None yet (Qwen/Llama only) 9.45 tok/s (Qwen2.5-1.5B) 4-8GB ~10W total ~$350
RPi5 + Hailo-15H (upcoming) Possible (7B+ INT4) TBD TBD TBD TBD

Conclusion

The Hailo-8 is an excellent vision accelerator and a poor fit for LLMs. That is not a criticism of the hardware; it is a description of what it was engineered to do. Running Gemma 4 on a Raspberry Pi 5 works, but it runs on the CPU, not the accelerator.

What you get from combining both is a capable, low-power edge system that handles perception and reasoning in a single box. Vision models run at hundreds of frames per second on the Hailo-8. Gemma 4 E2B provides language understanding and generation at 3 to 8 tokens per second on the Pi's CPU. Together, they form a pipeline that would have required a desktop GPU two years ago.

If your priority is running Gemma 4 specifically, the Hailo-8 is optional. Save the $188, buy an 8GB Pi 5, and run llama.cpp directly. If you need both vision and language on an edge device with minimal power draw, the hybrid setup is a genuinely practical architecture. Just go in with realistic expectations.

For a full breakdown of Gemma 4 model variants, quantisation options, and deployment targets, see the Gemma 4 model guide.