Best stack for Gemma 4 multimodal document analysis on a headless GPU server?

Posted by makingnoise@reddit | LocalLLaMA | View on Reddit | 19 comments

I’m trying to figure out the best stack for Gemma 4 multimodal document analysis and could use advice from people actually running it successfully.

Goal:
Use Gemma 4’s vision capabilities to read multi-page PDFs without building a bunch of fragile preprocessing pipelines (PNG conversion scripts, OCR chains, etc.). The model itself should be able to interpret the document — I’m trying to avoid toolchains that force me to “spoon-feed” pages as images. I want to just give the damn model a PDF and have it go to work, no hacky bullshit workarounds.

My environment

Headless Linux VM used as an inference server
GPU: RTX 3090 (24 GB VRAM)
Docker-based setup
Accessed remotely through a web UI or API (not running the model directly on my desktop)

What I’ve tried

Ollama + OpenWebUI
Gemma 4 runs, but multimodal/document handling feels half-implemented
Uploading PDFs doesn’t actually pass them through to the model in a useful way
Most advice I see online involves converting PDFs to PNGs first, which I’d like to avoid

What I’m trying to find out

For people running Gemma 4 with vision:

What model runner / inference stack are you using?
Does anything currently allow clean multi-page PDF ingestion with no hacky workarounds?
If not, what’s the least painful stack for document analysis with Gemma 4 right now?

I’m mainly trying to avoid large fragile pipelines just to get documents into the model.

If anyone has this working smoothly with Gemma 4, I’d love to hear what your setup looks like.

[-]

CATLLM@reddit

What kind of docs are you working with? Different doc complexities calls for different solutions

[-]

makingnoise@reddit (OP)

Multi-page PDFs, 20 pages or less, scanned but not OCR'd text. My understanding is that Gemma4 can handle them directly. But how to GET the damn PDF to the model?

[-]

CATLLM@reddit

Right but what kind of pdfs tho? Forms? Just pages with English text?

[-]

makingnoise@reddit (OP)

I catch your drift and I am aware that there are different PDF extraction options, and I am curious what you'd suggest. My point is that I was under the mistaken impression that the model itself had raw scan image-of-text multipage PDF processing capabilities, when in fact it requires something external to itself to feed it PNGs. Another commenter indicated that they use llama.cpp and it handles the pdfs directly - I looked this up, and apparently webui handles the injection.

[-]

CATLLM@reddit

Why not just say what kind of docs you are working with? Receipts? Forms? Legal docs? What is it?

[-]

makingnoise@reddit (OP)

Because your chat history is hidden and I am not interested in volunteering information when you've offered me nothing, and for all I know, you're an instance.

[-]

OsmanthusBloom@reddit

I don't think any LLM (even multimodal) can ingest PDFs directly. There's always some preprocessing, either text extraction or conversion to images.

The model itself sees only tokens as input. Text can be converted to tokens directly, while images go through mmproj to become tokens.

[-]

makingnoise@reddit (OP)

Then why does it say this on the model card: "Image Understanding – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions."

[-]

OsmanthusBloom@reddit

See the headline in bold? I think these are just examples of different types of images that the model can "understand".

I'm happy to be proved wrong but I know quite a lot about how LLMs work and I've not yet seen one that can process PDFs natively, without first converting to text/images.

[-]

makingnoise@reddit (OP)

I wish these model cards were more honest then. it should say capable of being spoon-fed pngs

[-]

OsmanthusBloom@reddit

It's indeed a bit misleading.

[-]

DinoAmino@reddit

No. It isn't. Multimodal has been around for a long time. It's a noob's misunderstanding. And that's ok - we've all been there. So now they know.

[-]

makingnoise@reddit (OP)

so you're saying I have to install the service that feeds the model PNG pages? that the model actually has no PDF capabilities?

[-]

DinoAmino@reddit

That's right. LLMs are text-in/text-out. They don't even handle the images - the model contains a multimodal projector with a vision encoder that transforms pixels onto visual features that can be tokenized for the LLM to "see". Some PDFs are just images, some are just text, some are both. Tables are another thing too.

When you use UIs that can drag and drop files like PDFs to add to your prompts context, it's converting it to markdown/text-embeddings for you. Bottom line is something has to preprocess non-text data before the LLM can use it.

[-]

makingnoise@reddit (OP)

Another commenter pointed out llama.cpp just works - you helped correct my understanding that the reason for this is that llama.cpp's webui handles the injection of pdf image data to the model (probably as PNGs, based on how everything else seems to work). All of this boiled down to "I need to use llama.cpp and webui". Thanks again.

[-]

makingnoise@reddit (OP)

Where does the multimodal projector live? Wait, is that what the mmproj files are? I’m a multimodal noob, I’ve done olm-ocr2 so i’m familiar with the png concept but this is my first time trying to expressly use a visual llm in a chat context. Could a model similarly have pdf handling services?

[-]

makingnoise@reddit (OP)

it's funny but Gemini is insisting that Gemma for is absolutely capable of native parsing of multi-page PDFs it is saying that the server software is the shortcoming. I'm riding and chatting at the moment, and I just really want to believe that Gemma 4 is capable of what I suspect it's capable of.

[-]

Present-Access-2260@reddit

i just use the llama.cpp server with the vision model and it handles pdfs directly through the api

[-]

makingnoise@reddit (OP)

do you use a ui front end? whats your stack, etc?