# Benchmarking qwen2.5vl:latest on a Mac mini M4 Pro

Date: 2026-06-23
Model: `qwen2.5vl:latest` (Ollama, 6.0 GB on disk)
Architecture: `qwen25vl`, 8.3B parameters, Q4_K_M quantization, 128k context
Hardware: Apple Mac mini, M4 Pro (14-core: 10P + 4E), 64 GB unified memory
Runtime: Ollama, 100% GPU (Metal)

## Key finding: it's a fast, compact vision-language model

Unlike `gemma4:26b` (a thinking model), `qwen2.5vl:latest` emits answers
directly into `message.content` with no `thinking` field. It's less than half
the size of gemma4:26b (6 GB vs 17 GB) and roughly matches it on raw
throughput for text, while adding dedicated vision support. The tradeoff is
accuracy on harder perceptual tasks (see the OCR comparison in
`ocr-clarapapa-qwen-vs-gemma.md`).

## Steady-state performance

| Metric                       | Value                        |
|------------------------------|------------------------------|
| Text generation rate         | ~50 tok/s (50.1–56.5)        |
| Vision generation rate       | ~45–50 tok/s (44.7–50.0)     |
| Prompt eval rate (text)      | ~365–745 tok/s               |
| Prompt eval rate (vision)    | ~200 tok/s (cold), ~35k (warm)|
| Cold model load              | ~2.2 s (first call)          |
| Warm model load              | ~0.1–0.2 s                   |
| Memory footprint             | 6.0 GB, 100% on GPU          |
| Latency per output token     | ~20 ms/tok (text)            |

Generation throughput is very consistent across text task types. Vision runs
decode slightly slower (~45–50 tok/s) and incur a large prompt-token cost for
high-resolution images (see below).

## Text-only results (same prompts as gemma4:26b)

| Benchmark        | Prompt tok | Gen tok | Gen tok/s | Gen time | Wall   | done |
|------------------|-----------:|--------:|----------:|---------:|-------:|------|
| factual (Paris)  | 31         | 8       | 56.5      | 0.14 s   | 0.34 s | stop |
| fibonacci code   | 40         | 336     | 50.2      | 6.70 s   | 6.90 s | stop |
| train reasoning  | 68         | 730     | 50.1      | 14.58 s  | 14.83 s| stop |
| 4-stanza poem    | 33         | 127     | 50.1      | 2.53 s   | 2.73 s | stop |

All four text tasks stopped cleanly (`done: stop`) — no length-cap truncation.

## Vision results

Images were sent base64-encoded in `messages[].images` via `/api/chat`.

| Benchmark                          | Image                         | Prompt tok | Gen tok | Gen tok/s | Wall    |
|------------------------------------|-------------------------------|-----------:|--------:|----------:|--------:|
| caption (hires photo)              | clarapapa.jpg 4032×3024 11MB  | 4054       | 82      | 44.7      | 22.56 s |
| counting (hires photo)             | clarapapa.jpg 4032×3024 11MB  | 4051       | 17      | 49.2      |  1.10 s |
| caption (lores photo)              | nice-pic.jpg 1280×960 104KB   | 1592       | 26      | 50.0      |  7.23 s |
| OCR (hires chip die)               | MIPS-R3000A-die.jpg 3798×3270 | 4046       | 67      | 46.8      | 22.06 s |
| reasoning (hires chip die)         | MIPS-R3000A-die.jpg 3798×3270 | 4047       | 175     | 46.6      |  4.73 s |

### Image token cost

A high-resolution 11 MB JPEG expands to **~4050 prompt tokens** — a significant
chunk of context and the dominant cost for vision requests. The smaller
1280×960 image used only 1592 tokens. There's also a large wall-time penalty
on the first request for a given image (~22 s) that vanishes on the second
request for the same image (~1–5 s) thanks to prompt caching: the warm
prompt-eval rate jumps to ~35,000 tok/s.

## Quality spot-checks

- **Fibonacci**: produced a correct solution using a `memo={}` default-arg
  dictionary with a full docstring. Clean and idiomatic.
- **Train problem**: set up variables correctly, used relative speed
  (60 + 45 = 105 mph), and solved step by step. Sound reasoning.
- **Poem**: delivered a coherent ocean/moon poem, though it stopped at ~127
  tokens (3 stanzas) before reaching a full 4 stanzas — it needs a nudge
  (higher `num_predict` or an explicit "exactly 4 stanzas" instruction).
- **clarapapa caption**: **misread** the sand writing as "SPOOKY" (the actual
  text is "CLARA PAPA"). This is a consistent hallucination on this image —
  see `ocr-clarapapa-qwen-vs-gemma.md` for the head-to-head where gemma4:26b
  got it right.
- **MIPS die**: correctly identified the image as an integrated circuit die
  / microprocessor and gave sound reasoning (complex layout, rectangular
  active area). It correctly reported no legible text/labels on the die
  surface.

## Practical guidance

- **Interactive text Q&A**: `num_predict: 256`, expect sub-second to ~2 s.
- **Coding / reasoning**: `num_predict: 512–768`, budget ~3–15 s.
- **Image captioning (batch)**: `num_predict: 128`, expect ~1–2 s per image
  after the first pass (cache warm). Great throughput for a PhotoPrism-style
  library — but spot-check accuracy.
- **OCR**: verify against a known-good model on a sample. On clarapapa.jpg
  this model hallucinated; gemma4:26b did not.
- **High-res images** cost ~4000 prompt tokens each — factor this into
  context budgets when captioning many images in one conversation.
- **Memory**: at 6 GB it coexists comfortably with other models or a large
  context window on the 64 GB machine.

## How these numbers were measured

Requests were sent to Ollama's `/api/chat` endpoint with `stream: false`. The
response JSON exposes `prompt_eval_count`, `prompt_eval_duration`,
`eval_count`, and `eval_duration`, from which the token rates above are
derived as `eval_count / (eval_duration / 1e9)`.

```python
import json, urllib.request, base64

URL = "http://localhost:11434/api/chat"
b64 = base64.b64encode(open("image.jpg", "rb").read()).decode()
payload = {
    "model": "qwen2.5vl:latest",
    "stream": False,
    "options": {"num_predict": 256, "temperature": 0.1},
    "messages": [{"role": "user", "content": "Describe this image.", "images": [b64]}],
}
req = urllib.request.Request(URL, data=json.dumps(payload).encode(),
                             headers={"Content-Type": "application/json"})
d = json.loads(urllib.request.urlopen(req).read())
print(d["message"]["content"])
```

All runs executed fully on-device via Apple Metal with no CPU fallback.

## Comparison with gemma4:26b (text tasks, same prompts)

| Benchmark        | qwen2.5vl gen tok/s | gemma4:26b gen tok/s | qwen wall | gemma wall |
|------------------|--------------------:|---------------------:|----------:|-----------:|
| factual (Paris)  | 56.5                | 61.5                 | 0.34 s    | 2.49 s     |
| fibonacci code   | 50.2                | 59.9                 | 6.90 s    | 17.54 s    |
| train reasoning  | 50.1                | 59.5                 | 14.83 s   | 23.18 s    |
| 4-stanza poem    | 50.1                | 60.2                 | 2.73 s    | 13.59 s    |

gemma4:26b decodes ~10 tok/s faster and produces higher-quality / more
complete output, but qwen2.5vl is **significantly faster wall-clock** on every
text task (often 2–5×) thanks to its smaller size, near-zero warm load time,
and lower token counts (no thinking overhead). For pure throughput where
gemma4:26b-level quality isn't required, qwen2.5vl is the better pick.
