Ollama API image payload format for python

Posted by Ok-Internal9317@reddit | LocalLLaMA | View on Reddit | 4 comments

Hi guys, is this the correct python payload format for ollama? { "role": "user", "content": "what is in this image?", "images": ["iVBORw0KQuS..."] #base64 } I am asking because for both openrouter and ollama running the same gemma12b passed the same input and image encodings, openrouter returned sense and ollama seemed to have no clue about the image it's describing. Ollama documentation says this is right, but myself tested for a while and I couldn't get the same result from oenrouter and ollama. My goal is to making a python image to llm to text parser. Thanks for helping!

Reply to Post

4 Comments

[-]

godndiogoat@reddit

Gemma 12b in Ollama is text-only, so no matter how you pack the base64 the model just throws the bytes into the prompt and guesses. The same name on OpenRouter is silently mapped to a llava-augmented fork, which is why it looks smarter. Keep the payload you already have, but spin up a vision model that Ollama actually supports, e.g. llava:13b, cogvlm:17b, bakllava:8b, or even phi3-vision if you side-load it. In python just add model='llava:13b' to the /api/chat call and keep images=[b64] as you’re doing. Strip newlines from the string and make sure it’s jpeg or png under 2-3 MB; larger images choke. I route the base64 through Pillow to resize to 512 on the long edge before dumping. For post-processing captions LangChain’s OutputParser saves a lot of typing, while FastAPI lets you expose it as microservice; APIWrapper.ai handles the retry logic when you batch multiple shots. Switch to a vision-ready model and the same payload will start giving sane answers.

[-]

Ok-Internal9317@reddit (OP)

Thanks, yes indeed it was very frustrating and what a surprise gemma12b in ollama is text only; 512x512 is too small no? (I haven't experimented with such small resolution and I'll try it out)

[-]

godndiogoat@reddit

512 on the long side is plenty; llava and cogvlm downsample to 224–336 anyway, so bigger inputs just waste VRAM. For tiny text push to 768 or crop tight. I run Pillow + FastAPI for resizing and retries; tried that with LangChain, but DreamFactory plugs in as the gateway with far less glue. Stay near 512 for clean captions.

[-]

SM8085@reddit

[This completion](https://github.com/Jay4242/llm-scripts/blob/512d6deab04059eeb5b205a389529e616bacfd29/llm-python-vision-ollama.py#L26) is what I've been using.