# OCR Benchmark: qwen2.5vl vs gemma4:26b on clarapapa.jpg

Date: 2026-06-23
Image: `~/Downloads/clarapapa.jpg` (4032×3024, 11 MB JPEG)
Hardware: Apple Mac mini, M4 Pro (14-core: 10P + 4E), 64 GB unified memory
Runtime: Ollama, 100% GPU (Metal)

## Setup

Prompt: *"Read all the text visible in this image. Transcribe it exactly as written. If there is no text, say so."*
Options: `temperature: 0.1`, `num_predict: 512`, `stream: false`
Endpoint: `/api/chat` with the image passed base64-encoded in `messages[].images`.

## Result

| Model              | Prompt tok | Prompt tok/s | Gen tok | Gen tok/s | Gen time | Wall    | Output            |       |
|--------------------|-----------:|-------------:|--------:|----------:|---------:|--------:|--------------------|-------|
| `qwen2.5vl:latest` |       4061 |      34,624  |       7 |     53.4  |   0.13 s |  0.91 s | `SPOOKY HOLLOW`    | ❌    |
| `gemma4:26b`       |        309 |         212  |     131 |     62.2  |   2.11 s | 10.36 s | `CLARA PAPA`       | ✅    |

The ground-truth text (confirmed by the filename `clarapapa.jpg`) is **CLARA PAPA**.

## Takeaways

- **Accuracy**: gemma4:26b correctly read "CLARA PAPA". qwen2.5vl hallucinated "SPOOKY HOLLOW" — and it also produced "SPOOKY" when captioning this same image in an earlier run, so the misread is consistent on this photo (likely the sand texture/noise fooling the smaller model).
- **Speed**: qwen2.5vl is dramatically faster end-to-end (0.91 s vs 10.36 s) — it's a 6 GB / 8.3B model vs gemma4's 17 GB / 26B, and it emitted only 7 output tokens. But it was wrong.
- **Prompt tokens differ hugely**: qwen2.5vl expanded the 11 MB image to **4061 tokens** (high-res vision token cost), but evaluated them at ~35k tok/s (cached/warm). gemma4:26b used only **309 tokens** for the image — far more compact visual encoding — but paid a 6.4 s cold model-load on this run.
- **Generation rate**: gemma4:26b (62.2 tok/s) edges out qwen2.5vl (53.4 tok/s) on raw decode speed despite being 3× larger; the M4 Pro handles the bigger model efficiently once loaded.
- **Thinking overhead**: gemma4:26b spent 124 of its 131 generated tokens on `thinking` (reasoning through the transcription) and just 7 on the final answer. This is why a "simple" OCR took 131 tokens — the model works through the problem before committing to an answer.

## Practical guidance

For OCR tasks where correctness matters, the bigger thinking model wins here despite being ~10× slower wall-clock. qwen2.5vl's speed advantage only pays off if its accuracy is sufficient — and on this image it wasn't.

- **Prefer gemma4:26b** for OCR / transcription where accuracy is critical and you can afford ~10 s per image.
- **Prefer qwen2.5vl** for high-throughput captioning / tagging where occasional hallucinations are acceptable and sub-second latency matters (e.g. batch-processing a PhotoPrism library).
- Always inspect a sample of VL-model outputs before trusting them at scale — a confident misread like "SPOOKY HOLLOW" is worse than no answer.

## Reproducing

```python
import json, urllib.request, base64, os

URL = "http://localhost:11434/api/chat"
IMG = os.path.expanduser("~/Downloads/clarapapa.jpg")
b64 = base64.b64encode(open(IMG, "rb").read()).decode()

payload = {
    "model": "gemma4:26b",  # or "qwen2.5vl:latest"
    "stream": False,
    "options": {"num_predict": 512, "temperature": 0.1},
    "messages": [{"role": "user",
                  "content": "Read all the text visible in this image. Transcribe it exactly as written. If there is no text, say so.",
                  "images": [b64]}],
}
req = urllib.request.Request(URL, data=json.dumps(payload).encode(),
                             headers={"Content-Type": "application/json"})
d = json.loads(urllib.request.urlopen(req).read())
print(d["message"].get("content", ""))
```

All runs executed fully on-device via Apple Metal with no CPU fallback.