QWEN3.6 + ik_llama is fast af
Posted by _BigBackClock@reddit | LocalLLaMA | View on Reddit | 33 comments
running qwen3.6 UD_Q_4_K_M on 16GB vram + 32GB ram with 200k cw @50+ tok/s
Posted by _BigBackClock@reddit | LocalLLaMA | View on Reddit | 33 comments
running qwen3.6 UD_Q_4_K_M on 16GB vram + 32GB ram with 200k cw @50+ tok/s
Pretend_Engineer5951@reddit
llama.cpp and AMD Strix Halo fresh build on huge context (agentic code work) Qwen 3.6 UDQ8_K_XL:
pp (143.6k): 8.22 tokens per second)
tg(2343): 27.26 tokens per second
philnm@reddit
this is inspirational, thanks for sharing. what IDE is is that, looks very clean.
libregrape@reddit
Looks like zed
https://flathub.org/apps/dev.zed.Zed
ikmalsaid@reddit
vs regular llama.cpp? do a speed comparison please
libregrape@reddit
I get around 52tps qwen3.6 Q4_K_S on rtx 5060 ti 16gb with 262144 context window and latest regular cuda llama.cpp build
Opteron67@reddit
170 tok/s on dual 5090 with vllm, 2K tok/s on batch
Opteron67@reddit
but it degrades fast as context grows
oxygen_addiction@reddit
Try TheTom's llama.cpp turbo quant branch
Opteron67@reddit
it's on vllm. i will try llamacpp
R_Duncan@reddit
On non- Core-Ultra seems slower. Every time you post these reports I recompile ik_llama to compare with plain llama.cpp, and every time it's delusional.
AcrobaticChain1846@reddit
Running q2_k_xl because q5_k_m models feels dumb to me.. IDK how is this possible??? Like the same settings and same context size and all but quality is better in q2 how??????
LateGameMachines@reddit
Fast indeed. I'm on 4070 Ti 12GB VRAM, 64 GB RAM. llama.cpp 140K ctx at 57 tok/s.
MuscleStriking9756@reddit
can any one pls tell if it can run on4gb vram , i am poor
usuallyalurker11@reddit
I'm running on a 32GB RAM LPDDR5X 8533MT/s laptop (no dedicated GPU) and Qwen 3.6 yields \~25 t/s which is quite amazing
Acceptable_Home_@reddit
I get same on 4060 8gb with 24gb 5600MT/s ddr5 ram (q4_K_M), am i missing something or this is genuinely all my system can give out (at 42k ctx window with kv cache at q8)
usuallyalurker11@reddit
We have the same model. Here's my setup:
I use LM Studio and toggle off "Try mmap()" and "Keep Model in Memory". GPU Offload 40 (max). 30k context window. Number of experts = 4. Everything else keeps default.
Memory usage: \~27-28/31.6GB system RAM.
Hytht@reddit
Linux or Windows?
sanjxz54@reddit
Did you notice any degradation of quality because of 4 experts? That's like a1.5b instead of a3b for this model
Divergence1900@reddit
prompt processing speed?
usuallyalurker11@reddit
Prompt: Can you make a snake game for me?
Thought for 2 minutes 48 seconds
20.73 tok/s. 5547 tokens. 0.58s
Divergence1900@reddit
by prompt processing i mean how long does it take for it to process a huge prompt (say 1000+ tokens in the prompt)
HopePupal@reddit
which CPU?
usuallyalurker11@reddit
Intel Core Ultra 7 258V 32 GB RAM LPDDR5X 8533MT/s running Qwen 3.6 35B A3B Q4_K_M
HopePupal@reddit
ah, i've heard good things about Lunar Lake
usuallyalurker11@reddit
It's power efficient too. iGPU utilized about 6-10W when answering prompt
Dry_Investment_4287@reddit
does it use TurboQuant?
usuallyalurker11@reddit
I will be honest I am not familiar with the term. Are you referring to "K cache quantization" and "V cache quantization" on LM Studio?
I had them off but it didn't change much in term of t/s when I had them on at F16 or Q4_0 or Q4_1
TommarrA@reddit
Does ik_llama have precompiled binaries like koboldcpp or docker?
HopePupal@reddit
it takes less than 5 minutes to build either mainline llama.cpp or ik_llama from source. maybe an extra 5 if you have to download the entire CUDA SDK first.
srigi@reddit
It does. But no official - https://github.com/Thireus/ik_llama.cpp/releases
Always be carefull when running alien code on your machine
Ill_Evidence_5833@reddit
Sharing your command would be amazing
No-Anchovies@reddit
Similar spec - finally managed to install qwen 3.6, its significantly faster than the 30b 3.5!
JsThiago5@reddit
Can you share the parameters and the card model?