Where local is lagging behind... Wish lists for the rest of 2025

[-]

ArsNeph@reddit

Are you sure about the vision part? Qwen 2.5 VL 72B according to most benchmarks is better than ChatGPT 4o in ocr tasks, and even in real world usage seems to be around on par. Have you tried it?

[-]

ArsNeph@reddit

No, I mean specifically OCR tasks, because that's specifically what OP is referring to. If you look at VLM leaderboards and sort by things like OCR bench, you'll see that it's one of the top

[-]

nomorebuttsplz@reddit (OP)

I find it cannot tell which boxes are checked on a scan of a document, making it useless

[-]

SlowFail2433@reddit

Sorry I misunderstood

OCR isn’t an area I follow, its interesting that Qwen is doing so well there

[-]

Square-Onion-1825@reddit

i'm curious, what is your h/w set up? i'm buying a system that i will dedicate to AI dev and RAG work, but i'm kinda new at this...

[-]

Mkengine@reddit

I don't want to say I have a solution to your problems, but here are some links that may help for your problems:

https://github.com/QwenLM/Qwen2.5-Omni
https://github.com/DavidZWZ/Awesome-Deep-Research
https://github.com/GiftMungmeeprued/document-parsers-list

what are your thoughts?

[-]

nomorebuttsplz@reddit (OP)

Thanks, I will check out 2 and 3.

Qwen omni may be decent if you can get it running, which seems like a pain.

But what is needed is something modular so you can use a more intelligent model, as that one is about the level of an average 4b text only llm

[-]

Mkengine@reddit

Maybe Unmute could fill this modularity gap? https://kyutai.org/2025/05/22/unmute.html

Otherwise I think we have to wait a bit for new releases.

[-]

offlinesir@reddit

We are missing text to video at the level of Veo 2, 3, or even Sora. We are also missing personalized LLM's at the level of OpenAI's memory feature.

[-]

SlowFail2433@reddit

I found OpenAI’s usage of the memory feature crude but maybe I was unlucky. It would shoe-horn it in too much.

[-]

offlinesir@reddit

Yeah, it kinda sucks. I turned it off. But looking towards the future, personalization is likely going to be more important. Hopefully, not for advertising.

[-]

SlowFail2433@reddit

Some future better version could be good yeah

[-]

05032-MendicantBias@reddit

Local is lagging behind in high RAM size and bandwidth hardware.

It's cool that researchers are trying to push performance up, but it's more important for real world applications to push hardware requirements down.

E.g. Where are the sparce models that take advantage of CPU low latency random access execution and run with regular RAM?

[-]

SlowFail2433@reddit

People care enormously more about throughput than latency, which only dictates time to first token

[-]

The deep research part is probably mostly a setup problem. Perplexity uses R1 for its own deep research and it's at least on par with o3 web search, probably better. Also their labs isn't bad either and probably uses mostly R1

[-]

nomorebuttsplz@reddit (OP)

I know there's no purely technical reason why we can't have a good local deep research workflow, but I don't know what setup is the best.

[-]

SlowFail2433@reddit

IDK if the way the closed source deep research does it is even the best.

[-]

Uncle___Marty@reddit

Honestly agree. I've been REALLY looking forward to finding a smallish model that I can just use voice with. I get bored typing and reading when I know I don't *really* need to. There are amazing options out there but to cobble them together is a PITA. Lets hope someone makes a mini chatGPT with proper TTS/STT built in natively.

[-]

Lorian0x7@reddit

I wish for a local model that can run with 8 gb vram, or locally on smartphone and it's as good a 4o

[-]

pumukidelfuturo@reddit

this is all that matters.

[-]

My_Unbiased_Opinion@reddit

What I want is tool use web search inside the reasoning chain so the model can get information directly from the web while it reasons. Should massively increase quality without needing any more VRAM.

[-]

Secure_Reflection409@reddit

Hardware is our biggest issue.

[-]

dinerburgeryum@reddit

Seconding real-time voice mode. I kind of agree with your point that all the pieces are there; we have Qwen Omni, which absolutely supports streaming ingest and generation, but none of the code released demonstrates its usage. This is unfortunately a software problem, and not one that's easy to overcome.

[-]

SlowFail2433@reddit

For LLMs vision abilities and long context abilities Also image and audio generation by LLMs

For diffusion we are just not close to parity at all