QWEN3.6 + ik_llama is fast af

On non- Core-Ultra seems slower. Every time you post these reports I recompile ik_llama to compare with plain llama.cpp, and every time it's delusional.

[-]

AcrobaticChain1846@reddit

Running q2_k_xl because q5_k_m models feels dumb to me.. IDK how is this possible??? Like the same settings and same context size and all but quality is better in q2 how??????

[-]

LateGameMachines@reddit

Fast indeed. I'm on 4070 Ti 12GB VRAM, 64 GB RAM. llama.cpp 140K ctx at 57 tok/s.

[-]

MuscleStriking9756@reddit

can any one pls tell if it can run on4gb vram , i am poor

[-]

usuallyalurker11@reddit

I'm running on a 32GB RAM LPDDR5X 8533MT/s laptop (no dedicated GPU) and Qwen 3.6 yields \~25 t/s which is quite amazing

[-]

Acceptable_Home_@reddit

I get same on 4060 8gb with 24gb 5600MT/s ddr5 ram (q4_K_M), am i missing something or this is genuinely all my system can give out (at 42k ctx window with kv cache at q8)

[-]

usuallyalurker11@reddit

We have the same model. Here's my setup:

I use LM Studio and toggle off "Try mmap()" and "Keep Model in Memory". GPU Offload 40 (max). 30k context window. Number of experts = 4. Everything else keeps default.

Memory usage: \~27-28/31.6GB system RAM.

[-]

Hytht@reddit

Linux or Windows?

[-]

sanjxz54@reddit

Did you notice any degradation of quality because of 4 experts? That's like a1.5b instead of a3b for this model

[-]

Divergence1900@reddit

prompt processing speed?

[-]

usuallyalurker11@reddit

Prompt: Can you make a snake game for me?

Thought for 2 minutes 48 seconds

20.73 tok/s. 5547 tokens. 0.58s

[-]