p.simianer.de/blog Ollama generated captions with PhotoPrism Vision

Ollama generated captions with PhotoPrism Vision

In my previous post I described how I set up photoprism-vision with Ollama, so that Photoprism can use a local vision LLM for captioning. Back then I was running Gemma3-4B and qwen2.5vl, and both were, well, not great. They hallucinated on uncommon subjects and were pretty much unable to read text in images. I've since switched to gemma4:26b (HF), and that actually works.

gemma4:26b

So gemma4:26b runs locally on my Mac Mini M4 Pro (64 GB), no GPU server required this time (which is nice, the tower server under my desk was loud). Where the smaller models happily invented captions for scenes they didn't understand, the 26B model mostly describes what's actually there. The biggest surprise to me was that it reads text reliably, which none of the smaller ones could.

This photo for example has "Clara Papa" written in the sand. The smaller models produced gibberish or ignored the writing entirely. qwen2.5vl often couldn't read text at all (see the previous post), and here it actually hallucinated "SPOOKY HOLLOW" (consistently, the sand texture seems to fool it). gemma4:26b transcribed it correctly and even worked it into the caption:

"Clara Papa" written in the sand. `gemma4:26b` read it correctly, `qwen2.5vl` saw "SPOOKY HOLLOW".

I put the two models head-to-head on this image; full numbers in the OCR benchmark notes. Short version: qwen2.5vl is fast (~1 s) but wrong, gemma4:26b is slow (~10 s, most of it thinking) but correct. Interestingly gemma4:26b used only 309 tokens for the image where qwen2.5vl expanded it to 4061.

Thinking Tokens And PhotoPrism

I knew gemma4:26b is a thinking model; it emits its chain-of-thought into a separate message.thinking field before the actual answer lands in message.content. What I didn't expect is that PhotoPrism would choke on those reasoning tokens. With a small num_predict budget the model can spend all of its tokens reasoning and never emit a visible answer at all, and when it does produce output, PhotoPrism stores the reasoning tokens verbatim, polluting the captions with the model's internal monologue. I couldn't find a clean way to configure that away, so I keep a fork on the fix/vision-strip-thinking-tokens branch that strips the thinking tokens from vision responses before they're persisted. Without it the captions are pretty much unusable; with it only the final caption lands in the database. For real tasks, set num_predict >= 768 and parse both fields.

I also took the opportunity to tune the vision.yaml properly this time. Captions run on gemma4:26b (with a system prompt asking for English + German, journalistic and unemotional), labels on gemma4:26b with structured JSON output. NSFW and face detection still use the built-in models. The full config is here. Note the Think: "true" flag on the caption service, which is what makes the reasoning-token stripping necessary.

Building The Fork

To run the patched PhotoPrism in Docker, building the image from the fork just worked for me:

git clone -b fix/vision-strip-thinking-tokens https://github.com/pks/photoprism
cd photoprism
make docker-local

This pulls the upstream photoprism/develop base image, compiles the binary inside the container, and tags the result photoprism/photoprism:local. Point your docker-compose.yaml at that tag and the reasoning-token fix is live, no need to wait for the patch to land upstream.

Benchmark

I benchmarked both models on this machine (Mac mini M4 Pro, 64 GB). gemma4:26b holds ~60 tok/s via Apple Metal with a 17 GB footprint, so a captioning request (a few hundred tokens) lands in well under 10 s. qwen2.5vl is smaller (6 GB) and much faster end-to-end (~1 s vs ~10 s for gemma4:26b, as the OCR test showed), but not accurate enough for text. In the previous post I worried that ~3 s per request was too slow for ~80,000 pictures. At ~10 s per image and a library that's since grown to ~100,000 photos, a full caption run is a multi-day job, so I just let it churn in the background — per year, mostly overnight. Full numbers and how I measured them in the gemma4:26b and qwen2.5vl benchmark notes.

Bulk Captioning

Captions are still generated explicitly via vision run, e.g. per year:

for year in {2017..2025}; do
  echo "generating captions for $year"
  photoprism vision run --models caption year:$year --force
  echo "done $year"
done

Instead of year:2025, one can use any of the other filters.

Conclusion

I'm pretty happy with gemma4:26b for captioning. It runs entirely on the Mac Mini, reads text (which opens up search over photos the smaller models couldn't parse at all), and the captions are good enough now that I trust them as a secondary signal alongside Photoprism's NASNet labels. The one loose end from the previous post — the ugly model:tag hack in parse_model_info_from_request — is now gone too, replaced by the Model: field in the vision.yaml shown above.