# Benchmarking gemma4:26b on a Mac mini M4 Pro

Date: 2026-06-23
Model: `gemma4:26b` (Ollama, 17 GB on disk)
Hardware: Apple Mac mini, M4 Pro (14-core: 10P + 4E), 64 GB unified memory
Runtime: Ollama, 100% GPU (Metal), context window 262144

## Key finding: it's a thinking model

`gemma4:26b` emits its chain-of-thought into a separate `message.thinking`
field before producing the final answer in `message.content`. With a small
`num_predict` budget it can spend all its tokens reasoning and never emit a
visible answer. For real tasks, set `num_predict >= 768` and parse both fields.

## Steady-state performance

| Metric                       | Value                        |
|------------------------------|------------------------------|
| Generation rate              | ~60 tok/s (59.5–61.5)        |
| Prompt eval rate             | ~135–270 tok/s               |
| Cold model load              | ~8.7 s (first call)          |
| Warm model load              | ~0.23 s                      |
| Memory footprint             | 17 GB, 100% on GPU           |
| Latency per output token     | ~16.7 ms/tok (linear)        |

Generation throughput is remarkably consistent across task types (factual,
code, math reasoning, creative) — it does not degrade with longer outputs or
more complex prompts in these runs.

## Per-benchmark results

| Benchmark        | Prompt tok | Gen tok | Gen tok/s | Total wall | done_reason |
|------------------|------------|---------|-----------|------------|-------------|
| factual (Paris)  | 28         | 126     | 61.5      | 2.5 s      | stop        |
| fibonacci code   | 38         | 1024    | 59.9      | 17.5 s     | length      |
| train reasoning  | 65         | 1349    | 59.5      | 23.2 s     | stop        |
| 4-stanza poem    | 30         | 791     | 60.2      | 13.6 s     | stop        |

## Quality spot-checks

- **Fibonacci**: produced a correct, Pythonic solution using
  `functools.lru_cache` with a docstring.
- **Train problem**: correctly combined speeds (60 + 45 = 105 mph), computed
  meet time ~2.05 h after 3 pm, and derived the meeting location.
- **Poem**: delivered a genuine 4-stanza poem with coherent ocean/moon
  imagery.

## Practical guidance

- **Interactive Q&A**: `num_predict: 512`, expect ~2–8 s latency.
- **Coding / reasoning**: `num_predict: 1024–1536`, budget ~15–25 s.
- **Latency** scales linearly with output tokens at ~16.7 ms/tok; no
  measurable KV-cache penalty in these runs.
- **Memory**: the 64 GB machine fits the 17 GB model with plenty of headroom
  for a large context window — you are not memory-bound.

## How these numbers were measured

Requests were sent to Ollama's `/api/chat` endpoint with `stream: false`. The
response JSON exposes `prompt_eval_count`, `prompt_eval_duration`,
`eval_count`, and `eval_duration`, from which the token rates above are
derived as `eval_count / (eval_duration / 1e9)`.

```bash
curl -s http://localhost:11434/api/chat -d "$(jq -nc '{
  model: "gemma4:26b",
  messages: [{role:"user", content:"..."}],
  stream: false,
  options: {temperature: 0.3, num_predict: 1024}
}')"
```

All runs executed fully on-device via Apple Metal with no CPU fallback.
