Qwen3.5 35b is sure still one the best local model (pulling above its weight) - More Details
Posted by dreamai87@reddit | LocalLLaMA | View on Reddit | 47 comments
Last time I posted on how this model has performed in creating the webapp based on provided research paper. I got so much love to see people has appreciated the post and of-course the potential of this MOE model.
I am sharing detailed on how I used this model to create webapp just using prompt and step by step guiding it. Later I converted my guidance steps into skills using same qwen-code cli with this model, that helped to add more examples.
Here is github repo where I have added the research-webapp-skill that you all can use and validate potential of this model on different papers.
I have added examples in the repo research-webapp-skill/examples at main · statisticalplumber/research-webapp-skill
Below is the command that I use to run this model on 16GB VRAM RTX 5080 Laptop
:: Set the model path
set MODEL_PATH=C:\Users\test\.lmstudio\models\unsloth\Qwen3.5-35B-A3B-GGUF\Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
echo Starting Llama Server...
echo Model: %MODEL_PATH%
llama-server.exe -m "%MODEL_PATH%" --chat-template-kwargs "{\"enable_thinking\": false}" --jinja -fit on -c 90000 -b 4096 -b 1024 --reasoning off --presence-penalty 1.5 --repeat-penalty 1.0 --temp 0.6 --top-p 0.95 --min-p 0.0 --top-k 20 --context-shift --keep 1024
if %ERRORLEVEL% NEQ 0 (
echo.
echo [ERROR] Llama server exited with error code %ERRORLEVEL%
pause
)
I have tried gemma4 26b moe, its not able to make app where qwen is keeping hold of context even at 70 80K. I tried latest jinja template of gemma4 and latest models from unsloth but still its not able to pull this task.
Again, I might be doing somewhere wrong, as I like this model too which I am using running at llama-server native UI for other tasks.
Thanks
No_Split_5652@reddit
please can you guys help me with this project: https://github.com/ChrisX101010/training-arena 🙏❤️ https://github.com/abubakarsiddik31/axiom-wiki for reference I would appreciate it.
pauloeavf@reddit
!RemindMe 2 weeks
xeeff@reddit
mind testing a specific quant (https://huggingface.co/byteshape/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-Q3_K_S-2.89bpw.gguf) for me and seeing how it performs in your benchmark? fits nicely within my 16gb vram and 128k context (turbo3 KV cache), and i'm wondering if it's as capable as higher quants
would appreciate you getting back to me :)
dreamai87@reddit (OP)
Sure man will do and update you as well
xeeff@reddit
remindme! 3d
enrique-byteshape@reddit
We would love to see this too!
xeeff@reddit
didn't expect to see you here, hi :p
27b would be crazy 🙏
enrique-byteshape@reddit
Maybe we are, maybe we aren't already working on it :)
xeeff@reddit
qwen3.6 35b a3b released an hour ago. what a shame it would be if your to-do list increased by one ;)
enrique-byteshape@reddit
We sometimes think the Qwen team are working against us :(
xeeff@reddit
i doubt there'll be as good of a model as qwen3.6 for quite a while, so you have time to catch up we believe in you 🙏
Defilan@reddit
Been running this on dual 5060 Ti's and yeah it punches way above its weight for a 3B active model. How are you fitting 90K context on 16GB VRAM though? That seems super tight with Q4_K_L.
dreamai87@reddit (OP)
fit on manages its expert in gpu along with cache and rest layers on cpu
Defilan@reddit
Ahhh got it. That's a good way to go about it with the hybrid offloading for MoE. Cool stuff!
henk717@reddit
For me the 27B is my favorite model currently, way better than the 35B is and unlike Gemma it writes long when I ask it to. Its just a model that gets me. If only the 3.6 wasn't a hybrid model and fixed the looping issue. The hybridness of it is the only quirk that makes it trickier to use.
dreamai87@reddit (OP)
Sure it should be as 27b is dense model. I tried today IQ3XXS Model it’s good
DanielusGamer26@reddit
Why not using the 27B UD IQ_3_XXS? i run it on RTX 5060Ti and seems more intelligent even at 3bit
I run it with this command:
`--threads 9 --ctx-size 64385 -fa 1 --jinja -ctk q8_0 -ctv q8_0 -np 1` + all the others parameters like temp, min p etc.
dreamai87@reddit (OP)
https://i.redd.it/eehqw4upsivg1.gif
here is output from qwen27b size that you have suggested, its good holding it well
dreamai87@reddit (OP)
I don’t know, it’s just a habit of using models above or at 4 bit. Will give a try. Let me know how it’s working in the skill I shared.
reddoca@reddit
!RemindMe 2 weeks
RemindMeBot@reddit
Your default time zone is set to
Europe/Berlin. I will be messaging you in 14 days on 2026-04-30 09:58:47 CEST to remind you of this linkCLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Life-Screen-9923@reddit
IMHO, option "context_shift" does Not work for Qwen3.5 models
dreamai87@reddit (OP)
To be honest I have setting considering common for other models. Just had it there
Most-Trainer-8876@reddit
How does it compare with Gemma 4 26B A4B?
dreamai87@reddit (OP)
I mentioned in the post at bottom. It’s good model but not good for calling multiple tools
Havage@reddit
As someone applying AI to research specifically - Thank you! Going to play with this in the morning!
dreamai87@reddit (OP)
Please check the repo where i have posted all the webapp examples that i converted using the skill
saito_zt81@reddit
Same here. It works really fast on my 3090 ti, ~100 tps. I tried Gemma 4 26B, but it's a little bit slower, but tool callings is unusable and make context windows full with failures.
dreamai87@reddit (OP)
I agree, Gemma is good but really failing in calling the tools
admajic@reddit
Why do you have -b twice? Also 4096 uses a lot of vram could be why you can't get other models to load
dreamai87@reddit (OP)
Sorry 1024 is ub -b 4096 -ub 1024
ResponsibleTruck4717@reddit
How good is it compare to the 9b?
dreamai87@reddit (OP)
9b does more mistakes compared to this model. But 9b is good too
ResponsibleTruck4717@reddit
Really? I thought the 9b out perform the 35b a3b.
Going to test it now :)
External_Dentist1928@reddit
So you use that skill within Qwen Coder Cli?
dreamai87@reddit (OP)
Yes qwen code cli with this model and shared skill.
qubridInc@reddit
Qwen3.5 35B is still insanely good for local use handles long context and real tasks way better than most models its size.
dreamai87@reddit (OP)
Yes agree, it’s always on task and follows prompt lot better.
Mir4can@reddit
Your server settings a bit mixed. Normally qwen suggest these:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0Is it intentional?
dreamai87@reddit (OP)
I have reasoning disabled, I have noticed this model is good even when parameter disabled.
BannedGoNext@reddit
Makes sense. Qwen 3 Coder next has no reasoning, and it's a beast at programming.
Imaginary-Unit-3267@reddit
35B is my daily driver for most tasks because of your previous post! Thank you!
dreamai87@reddit (OP)
Thank you dear 🙏
silenceimpaired@reddit
Isn’t 35b dense? That would be far better than the MoE. Did you compare against the Gemma 31b dense
floconildo@reddit
35B is MoE
iphoneverge@reddit
That looks impressive. Thanks for sharing all this info. How quick is it on your laptop with 16GB VRAM? Also if you had to compare, what commercial LLM model would you say this is closest to in terms of capability and speed? Thanks.
dreamai87@reddit (OP)
It’s 700 to 800 pp and 58 tgp