Lemonade OmniRouter: unifying the best local AI engines for omni-modality

Posted by jfowers_amd@reddit | LocalLLaMA | View on Reddit | 28 comments

I’ve always liked how if I ask ChatGPT to make or edit an image, it just does it. Local AI should be this convenient! One install, one endpoint. Ask for an image of a cat and it appears. Ask for a hat on the cat, with a narrated story. Now we can easily build immersive experiences.

Lemonade's OmniRouter brings that same pattern to local through built-in tools:

Image generation/ editing through sd.cpp
Text-to-speech through kokoros
Transcription through whisper.cpp
Vision through llama.cpp

Your workflow talks to Lemonade running on your own NPU/GPU through OpenAI-compatible tool calling.

How it works:

Lemonade sets up all these local AI engines for your system.
Add Lemonade’s tool definitions to your workflows.
When your LLM triggers a tool call it gets routed to the corresponding engine (sd.cpp, whisper.cpp, kokoros).
Feed the result back into your loop.

That’s it. No custom orchestration layer, no new abstractions to learn. Check it out in this 181-line e2e Python example.

We’ve added support for OmniRouter in our reference web ui (also available as a Tauri app), which is what you’re seeing in the video. But I’m much more excited to see what people build on top.

I know my next project is going to be some kind of TTRPG-style adventure game. It’s already surprisingly fun to ask OmniRouter to be a dungeon master who illustrates and narrates the story, and I think it can be enhanced quite a bit if I build an app/harness around it.

If you find this interesting, please drop us a star and say hi! * GitHub: https://github.com/lemonade-sdk/lemonade * Discord: https://discord.gg/5xXzkMu8Zk

[-]

Dazzling_Equipment_9@reddit

I recently updated my Strix Halo system to Fedora 44 and upgraded Lemonade to version 10.3. After downloading and testing the Ultra Collection, I was impressed to find it utilizes less than 50GB of memory while delivering exceptional performance.

It effortlessly handles tasks that previously required complex, multi-step workflows—such as seamless image recognition, style-consistent image generation, and intuitive image editing. The fluidity of the experience significantly boosts the practical utility of local models. I truly appreciate the outstanding work that went into this release. 💯

As I explore more extensive use cases for this setup, I have two specific questions:

1.Model Customization: Is it possible to modify the default models within the collection? For instance, I’d like to swap Qwen 3.5 (35B A3B) for Qwen 3.6 or Gemma 4 to better explore the unique capabilities and nuances of those specific models.

2.API/Agent Integration: Can the Ultra Collection be called as a unified model entity from other clients or agents? I am interested in leveraging its capabilities to automate complex tasks, such as organizing and restoring large image libraries on my local storage.

[-]

jfowers_amd@reddit (OP)

Glad you’re enjoying it!

We haven’t put customization options into the Lemonade app yet, but we will. If you are under the hood in our code or writing your own app based on the python reference you can pick any tool calling LLM you like.
OmniRouter is built on OpenAI API tool calling, so it should be seamless to integrate in existing agents or build on our reference code. We’ve also considered adding a new endpoint that works seamlessly without any tool calling.

[-]

Dazzling_Equipment_9@reddit

Thank you for your reply to answer these two points. I look forward to it very much. At present, I have developed a skill to integrate omni-router capabilities in other ai agent, which can simply complete some tasks, hoping for a better integration method. It seems that cli form can be considered?

[-]

jfowers_amd@reddit (OP)

We spent a bunch of time yesterday talking about a skill for omnirouter! If you want please come by the Lemonade discord and show what you have in the show-and-tell channel. I would love to see it.

[-]

savagely-average007@reddit

Awesome. Interested to see how GAIA works with this. Will give it a try tonight.

[-]

Zhelgadis@reddit

Can you point it to - say - a custom build of Llama.cpp or the like? (Vulkan vs Rocm vs some bleeding edge not yet integrated patch)

Also, is there any constraint on the models you can run? Do they all have to fit into the memory (strix owner here) or can they be dinamically loaded?

[-]

BlackMetalB8hoven@reddit

Yes last week on Ubuntu 26.04 (Strix Halo here too), I built Vulkan llama cpp and pointed lemonade at it because it was still a few commits behind llama cpp. The lemonade UI tells you what commit it was built with so you can see what version is packaged with it. That being said I just switched over to llama cpp and deactivated the lemonade systemd. I found the extra. prefix for my gguf models super annoying. I have my own systemd running flm and llama cpp anyway. Lemonade is great for those who don't want the hassle of setting everything up themselves.

[-]

jfowers_amd@reddit (OP)

As long as you’re enjoying your Strix Halo I am happy! Glad lemonade helped you get started.

[-]

mikkoph@reddit

yes, you can. It can be configured to use either:

latest version validated by the team
track the latest version on github
use a specific release
use whatever binary you point it to (which is what you want

currently, I think when loading the "OmniRouter" bundle everything is loaded at once. But in general the Lemonade API allows loading/unloading models (any model) on demand so any model combination can be loaded/unloaded dynamically.

as a sidenote, you can download *any* model hosted on huggingface (well as long as it is supported by llamacpp etc), not what is listed there. Just type repo/model name and it'll find the quants, mmproj etc

[-]

Dazzling_Equipment_9@reddit

It looks great and I can't wait to try it.

[-]

jfowers_amd@reddit (OP)

Let me know how you like it!

[-]

MLDataScientist@reddit

!remindme this Saturday "try lemonade"

[-]

RemindMeBot@reddit

I will be messaging you in 2 days on 2026-05-02 00:00:00 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

dataexception@reddit

What kind of support is there for older GPUs like the mi100? (Slowly steps back, looking down sideways awkwardly)

[-]

Octopotree@reddit

Does it hold all the models on the vram at once? Could it be able to offload awaiting models on cpu ram and only move them to vram while they're working? Handling this swapping myself using scripts to close and open each model is tedious.

[-]

jfowers_amd@reddit (OP)

It’s currently designed with unified memory systems like Strix Halo in mind. A lot of could be done to optimize for CPU + dGPU systems in the future.

[-]

MammalFever@reddit

Be great to have a front end that handles a variety of STT & TTS (thinking parakeet & Vibevoice or chatterbox), and supports streaming, for as close to realtime dialogue as possible. Can you change the speech models?

[-]

RickyRickC137@reddit

I second this. They key point is streaming support for both STT (whisper) and TTS immediately when the LLM is typing.

[-]

overand@reddit

TTS "when the LLM is typing" is a bit tricky - you generally need to split it by the sentence or paragraph if you don't want extremely unnatural sounding speech. There's not enough context in the first few words of a sentence (when it's text only, not ideas in a human brain) to figure out the right intonation.

[-]

jfowers_amd@reddit (OP)

From u/krishna2910-amd : Yes, you can change any model in the collection, currently lemonade supports whisper cpp for STT and kororos for TTS. Whsipercpp is in the process of adding parakeet support though, I am looking forward to it as well.

[-]

Sanity_N0t_Included@reddit

Just what crap-ton of VRAM is this gonna require?

[-]

layer4down@reddit

Yeah that looks like 32GB with offloading of 48GB ti 64GB to fit comfortably.

[-]

jfowers_amd@reddit (OP)

So much! The ultra collection in the video is 39.6 GB. The Lite collection works well at 8.5 GB but it can't do image editing yet.

[-]

no_no_no_oh_yes@reddit

How hard would be to plug vllm into this so it can benefit for higher concurrency on text while having the remain capacity ad-hoc?

[-]

jfowers_amd@reddit (OP)

From u/krishna2910-amd : We have a branch with vllm support and have been testing it out. We plan to roll out as an experimental backend in the coming weeks :)

[-]

jfowers_amd@reddit (OP)

u/krishna2910-amd led this work with a dozen community maintainers/contributors and is here to answer questions!

[-]

jfowers_amd@reddit (OP)

Mods: his answers seem to be getting blocked, any chance you can let them through?

I'll transcribe for now.

[-]

Ok-Ad-8976@reddit

Yeah, I like where you're going with this.