What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Posted by GodComplecs@reddit | LocalLLaMA | View on Reddit | 40 comments
Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour. Ofc the key is building a system around their weaknesses, and I've had already LLM systems doing expert work years ago when first ones came (shout out nous hermes 2 mistral!).
But yeah pretty neat, especially noonghunnas club 3090 and you can have 3.6 27B fly on a single 3090.
SkyFeistyLlama8@reddit
I think you just removed a reason to bill $200 an hour. Someone else can come along and do the same work with an LLM at $100 per hour, then $50, then $25, then burger-flipping money.
Actually it'll be worse. Some cloud giant will give away the capability for free as part of a larger subscription package.
thrownawaymane@reddit
They really are going to eat all of labor.
Any "knowing what button to push" job will get massively devalued if it hasn't already.
RetroPeel2025@reddit
Gemma4 is great for translation and creative writing.
Qwen3.6 outputs great games. I don't know what black magic they did to make the smaller models that capable in making cool games for the browser.
I remember when all we had was a unquanted pygmalion. Have 5 years passed yet? I don't think so right. Kinda reminds me of how fast games used to improve in the 90s. Each year there were so many improvements.
slvrsmth@reddit
Careful with Gemma4 translating to languages you don't understand, especially smaller ones.
I ran a small test with my native latvian. 31B understands the inputs well, but the outputs are 2010-google-translate level. I'm talking multiple spelling mistakes per word, and brutal direct translation of idioms. When specially prompted to review its output for correctness, it could identify and fix \~half of the mistakes.
I'm sure the picture is better with say, spanish, portugese or french, due to abundance of training materials.
I had high hopes for Gemma4, given Gemini 2.5 and later are actually halfway decent with latvian, and Gemma3 was previously the best open model. Now I'm looking forward to what mistral is cooking up next.
RetroPeel2025@reddit
Interesting. I was using gemma4 to translate old pc98 manuals and convert them to english.
For japanese its great. Of course it still makes mistakes sometimes but in general its pretty solid.
I have gemma4 26b q4 not just translate the manuals but also draw the boxes. On some pages it did a poor job with the boxes, but in general i would say the translation is great. It appears to be different for each language. I had the impression that google would be king in all sorts of languages. huh
2 examples of the translations:
https://litter.catbox.moe/jn5rng.html
https://litter.catbox.moe/242wye.html
The 31b one probably is even better but its too slow for that kinda stuff.
technobird22@reddit
Woah, that's amazing! May I ask how you got it to produce the bounding boxes for the text? Are you able to just prompt it and it'll find and return coordinates for you. That really is so impressive that we can have this kind of ability locally, especially for such a general model.
RetroPeel2025@reddit
Yeah, its pretty crazy right. Especially since the moe one with 4b experts at q4.
I just prompt it! Everything is automated.
a.)Extract a png image for each pdf page
b.)Send that image to gemma4 with the prompt below
c.)Save that XML and once I have a translation i recreate the pdf as html.
d.)Make the overlay with translation and positioning from gemma4.
My GUI sucks and doesnt display the boxes properly. Still, pretty cool. Manuals with like 20 Pages take 5 Minutes on my 5060ti. (gotta offload a bit because of the mmproj file)
Hope its useful to you!
technobird22@reddit
That's so awesome, I had no idea that such a small (general) local model could produce this level of quality of results in something so specific, thank you for sharing!
Out of curiosity, how did you set up the front end? Was it also created agentically?
RetroPeel2025@reddit
Thanks!
About how I vibe coded all this:
This might be very controversial because agents are the crave these days..
I don't like those huge sysprompts since they degrade the output. I did the good ol' copy/paste of the source. And only provide the parts that are needed. Always API. Thats just me though.
Gemma4 was strong enough for a python test script. As in: link a png, get the xml from my llama.cpp server. Then I check manually if the positions are correct.
For the code part where I dynamically create the html which converts the xml to the overlays I used gpt 5.5.
soldture@reddit
Finally, my computer has become the powerful machine which could not only help me with calculation, but also with knowledge, refining ideas and even code! I use these models locally on a daily basis now. And they are really good
VEHICOULE@reddit
Well you should try task specific fine tuned super small models like granites and nemotrons, it beats even frontier models at litterally no cost and you can load them on demand or manage them throught an agent orchestrator like the new multimodal nemotron model
Ell2509@reddit
Care to share some model names and uses?
GodComplecs@reddit (OP)
What kind of work is it used for? My uses are answering questions concerning software logic and general business like book keeping etc. and other expert knowledge based problems
phenotype001@reddit
I left an agent with Qwen 3.6 working overnight. I wake up, it still works. No looping on bullshit, no dumb decisions. It's a dream come true.
xeeff@reddit
what quant and context do you use? and 35b a3b or 27b?
phenotype001@reddit
27b q5_k_m. I gave it 120K context but it's slow as shit, 64 feels somewhat better.
ortegaalfredo@reddit
It will stabilize! It's under control control control control control control
L0ren_B@reddit
I was amazed yesterday after running some tests with 27BQ8 and 35Q8!
I've given my modem password and ask it to create a script to extract all the info (seen it done by someone on Youtube).
After about 1 hour and 128k tokens used, 27B was in!
35B failed even with help!
I've ran the test twice, as LLM as nondeterministic!
Gemini flash aced it, but cheated into searching online for the endpoints and scripts. Creating a new session where I've specifically forbid online research, refused to continue after failing!
I can wait for the new versions of Qwen! Hope they will copy DeepSeeks model of low Vram usage on high context!
nakedspirax@reddit
What's your workflow like?
L0ren_B@reddit
CUDA_VISIBLE_DEVICES=0,1 ./llama-server \
-m /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q8_0.gguf \
--mmproj /home/lolren/Local_LLMs/Qwen3.6-27B-GGUF/mmproj-Qwen3.6-27B-BF16.gguf \
--mmproj-offload \
--alias Qwen \
--host 0.0.0.0 \
--port 11338 \
--ctx-size 262144 \
--parallel 1 \
--threads 4 \
--threads-batch 4 \
--batch-size 4096 \
--ubatch-size 1024 \
--gpu-layers all \
--device CUDA0,CUDA1 \
--split-mode layer \
--tensor-split 1,1 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-fa on \
--kv-offload \
--no-warmup \
--jinja \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.6 \
--top-p 0.92 \
--top-k 20 \
--repeat-penalty 1.00 \
--n-predict 32768 \
--reasoning-budget 8192 \
--perf \
--metrics
The LLM is from LmStudio download!
Also, running on 2x3090. But this morning I've just discovered "https://github.com/noonghunna/club-3090/tree/master" so a lot of it can be improved, I am sure.
The 35B worflow is exactly the same, just changed model name!
Tried it in VLLM but fails on tool calling for me.
nakedspirax@reddit
Thank you. What coding agent did you use to crawl into your router and extraxt the data.
Cheers!!
L0ren_B@reddit
For 27B, PI worked amazing. For some reason, on 35B, it failed multiple times!
So, for 35B, I've used opencode. it failed retrieving the data from the router. Then I've ran the 27B with opencode which succeeded again!
For the record, the LLM is not ready for complex work. I've tried with 2 projects that both OPUS and GPT5.5 made and aced them, and both 27B and 35B failed by deleting blocks of code from very long files ( tens of thousands of lines). But so does Gemini pro and Flash 3.0. They both failed for me (when pro preview was free to use in CLI).
So, my honest take: We are about 1 year or so behind having the "Power of the sun in the palm of our hands". But if nothing changes, and lab still give us OSS models (i doubt they will), we will be there.
nakedspirax@reddit
That's awesome thanks for letting me know.
I am having the opposite problems to you. 35b works like a dream and 27b times out.
Interesting you were using pi to do the router task. I want to example more of that route
L0ren_B@reddit
Yes. For 27B pi worked perfectly! For 35B it worked, but on the router tasked, it started to output garbled data on reading the HTML file and stopped. For other tasks it works ok.
27B times out sometimes with VLLM in my case.
nakedspirax@reddit
I might try pi with 27b then. I was using 27b with opencode and the timeouts really do suck.
35b and opencode works great by the way. Has finished a dozen or so tasks for me start to end. No timeouts
L0ren_B@reddit
Out of curiosity, what is your hardware /configs?
Devatator_@reddit
I'm still waiting for SLMs that actually are good and fast. By small I mean sub 1B. Actually I'd go up to 2B if they actually manage to make them run really fast (at least fast enough for my CPU. I want to run that thing permanently, even when gaming. I have RAM to spare, not VRAM)
Klutzy_Pin9611@reddit
The "building a system around their weaknesses" part is where most of the real work is. The model is maybe 20% of it — context management, fallback handling, and knowing which tasks to route where account for the rest.
I've found the gap between "this works in a demo" and "this is stable enough to touch real work" keeps shrinking with each generation. But it's still there.
Medium_Chemist_4032@reddit
> noonghunnas club 3090 and you can have 3.6 27B fly on a single 3090
Pardon? I'm a 3090 enthusiast, but haven't been able to break 60tps yet (even dflash goes 35 max, if I turn on off the SWA).
LeonidasTMT@reddit
I can but at Q3_XXS lol
HyperWinX@reddit
Some guy went to 80tps with vLLM.
Medium_Chemist_4032@reddit
mtp?
Nepherpitu@reddit
You need MTP and int4 model. DFlash is a scam on long contexts. FP16 on 4x3090 breaks 100tps with MTP=3, up to 150tps for coding. I didn't tried int4 for 27B, but for 122B best one is autoround. It's fast as fuck, but I don't remember if you need to patch/compile vllm yourself or fix was merged.
Medium_Chemist_4032@reddit
122b is a speed demon, I pick 27b for the extended long context work. Confirming that for my usecases (agentic knowledgebase crunching) the 1m rope extension works reliably enough
dqUu3QlS@reddit
Of course. Without MTP, the theoretical maximum speed for the Q4_K_M quant on an RTX 3090 is 50-60 tokens/second.
miniocz@reddit
And here I am with 13tps on my p40...
GodComplecs@reddit (OP)
For local dense model 60tps is flying! But yes you can reach 140tks+ with MoE
shovepiggyshove_@reddit
Idk, I don't feel like it's worth throwing 2k euro at dual 3090 rig with a decent mobo for running these models. If they were at 2025 sonnet-level, then perhaps . I'm still on the fence about buying, but closer then ever
314kabinet@reddit
Qwen 3.6 27B is just under Sonnet 4.5, which was the model that made me pull the trigger on agentic coding. I imagine in a few months a single 24GB card will be able to run a model of at least that level.
GodComplecs@reddit (OP)
No just go single 3090 imo, I used to have a 4090, then dual 3090 and now single, that is how good the models and systems has become