Where local is lagging behind... Wish lists for the rest of 2025
Posted by nomorebuttsplz@reddit | LocalLLaMA | View on Reddit | 26 comments
It's a been a great 6 months to be using local AI as the performance delta has, on average, been very low for classic LLMs, with R1 typically being at or near SOTA, and smaller models consistently getting better and better benchmarks.
However, the below are all things where there has been a surprising lag between closed systems' release dates and the availability of high quality local alternatives
-
A voice mode that is on par with Chat Gpt. Most all the pieces are in place to have something akin to 4o with voice. Sesame, Kyutai, or Chatterbox for TTS, any local model for the LLM, decent STT is, I think, also a thing already. We just need the parts put together in a fairly user-friendly, fast streaming package.
-
Local deep research on the level of o3's web search. o3 is quite amazing now in its ability to rapidly search several web pages to answer questions. There are some solutions for local llms but none that I've tried seem to be fulfilling the potential of web search agents with clever and easily customizable workflows. I would be fine with a much slower process if the answers were as good. Something like Qwen 235b I believe could do a great job of being the foundation of such an agent.
-
A local visual llm that can reliably read any human-legible document. Maverick is quite good but not nearly as good as Gemini Pro or Chat GPT at this.
What else am I forgetting about?
ArsNeph@reddit
Are you sure about the vision part? Qwen 2.5 VL 72B according to most benchmarks is better than ChatGPT 4o in ocr tasks, and even in real world usage seems to be around on par. Have you tried it?
SlowFail2433@reddit
Its Gemini that leads in Vision
ArsNeph@reddit
No, I mean specifically OCR tasks, because that's specifically what OP is referring to. If you look at VLM leaderboards and sort by things like OCR bench, you'll see that it's one of the top
nomorebuttsplz@reddit (OP)
I find it cannot tell which boxes are checked on a scan of a document, making it useless
SlowFail2433@reddit
Sorry I misunderstood
OCR isn’t an area I follow, its interesting that Qwen is doing so well there
Square-Onion-1825@reddit
i'm curious, what is your h/w set up? i'm buying a system that i will dedicate to AI dev and RAG work, but i'm kinda new at this...
nomorebuttsplz@reddit (OP)
I have a mac studio 512 gb
Mkengine@reddit
I don't want to say I have a solution to your problems, but here are some links that may help for your problems:
https://github.com/QwenLM/Qwen2.5-Omni
https://github.com/DavidZWZ/Awesome-Deep-Research
https://github.com/GiftMungmeeprued/document-parsers-list
what are your thoughts?
nomorebuttsplz@reddit (OP)
Thanks, I will check out 2 and 3.
Qwen omni may be decent if you can get it running, which seems like a pain.
But what is needed is something modular so you can use a more intelligent model, as that one is about the level of an average 4b text only llm
Mkengine@reddit
Maybe Unmute could fill this modularity gap? https://kyutai.org/2025/05/22/unmute.html
Otherwise I think we have to wait a bit for new releases.
offlinesir@reddit
We are missing text to video at the level of Veo 2, 3, or even Sora. We are also missing personalized LLM's at the level of OpenAI's memory feature.
SlowFail2433@reddit
I found OpenAI’s usage of the memory feature crude but maybe I was unlucky. It would shoe-horn it in too much.
offlinesir@reddit
Yeah, it kinda sucks. I turned it off. But looking towards the future, personalization is likely going to be more important. Hopefully, not for advertising.
SlowFail2433@reddit
Some future better version could be good yeah
05032-MendicantBias@reddit
Local is lagging behind in high RAM size and bandwidth hardware.
It's cool that researchers are trying to push performance up, but it's more important for real world applications to push hardware requirements down.
E.g. Where are the sparce models that take advantage of CPU low latency random access execution and run with regular RAM?
SlowFail2433@reddit
People care enormously more about throughput than latency, which only dictates time to first token
LevianMcBirdo@reddit
The deep research part is probably mostly a setup problem. Perplexity uses R1 for its own deep research and it's at least on par with o3 web search, probably better. Also their labs isn't bad either and probably uses mostly R1
nomorebuttsplz@reddit (OP)
I know there's no purely technical reason why we can't have a good local deep research workflow, but I don't know what setup is the best.
SlowFail2433@reddit
IDK if the way the closed source deep research does it is even the best.
Uncle___Marty@reddit
Honestly agree. I've been REALLY looking forward to finding a smallish model that I can just use voice with. I get bored typing and reading when I know I don't *really* need to. There are amazing options out there but to cobble them together is a PITA. Lets hope someone makes a mini chatGPT with proper TTS/STT built in natively.
Lorian0x7@reddit
I wish for a local model that can run with 8 gb vram, or locally on smartphone and it's as good a 4o
pumukidelfuturo@reddit
this is all that matters.
My_Unbiased_Opinion@reddit
What I want is tool use web search inside the reasoning chain so the model can get information directly from the web while it reasons. Should massively increase quality without needing any more VRAM.
Secure_Reflection409@reddit
Hardware is our biggest issue.
dinerburgeryum@reddit
Seconding real-time voice mode. I kind of agree with your point that all the pieces are there; we have Qwen Omni, which absolutely supports streaming ingest and generation, but none of the code released demonstrates its usage. This is unfortunately a software problem, and not one that's easy to overcome.
SlowFail2433@reddit
For LLMs vision abilities and long context abilities Also image and audio generation by LLMs
For diffusion we are just not close to parity at all