Best stack for Gemma 4 multimodal document analysis on a headless GPU server?

Posted by makingnoise@reddit | LocalLLaMA | View on Reddit | 19 comments

I’m trying to figure out the best stack for Gemma 4 multimodal document analysis and could use advice from people actually running it successfully.

Goal:
Use Gemma 4’s vision capabilities to read multi-page PDFs without building a bunch of fragile preprocessing pipelines (PNG conversion scripts, OCR chains, etc.). The model itself should be able to interpret the document — I’m trying to avoid toolchains that force me to “spoon-feed” pages as images. I want to just give the damn model a PDF and have it go to work, no hacky bullshit workarounds.

My environment

What I’ve tried

What I’m trying to find out

For people running Gemma 4 with vision:

  1. What model runner / inference stack are you using?
  2. Does anything currently allow clean multi-page PDF ingestion with no hacky workarounds?
  3. If not, what’s the least painful stack for document analysis with Gemma 4 right now?

I’m mainly trying to avoid large fragile pipelines just to get documents into the model.

If anyone has this working smoothly with Gemma 4, I’d love to hear what your setup looks like.