LLM speed t/s

Posted by Lost-Health-8675@reddit | LocalLLaMA | View on Reddit | 44 comments

All I see is "it gives me **/s bla bla bla" all together with q4, q3... even when chatting with qwen3. 6 other day (q8) and we were chating about best llama. cpp command for my use case he suggested to go with q4 for better speeds (it runs with over 40t/s most of the times)

What would I like to know, are you really trading knowledge and reliability for speed?

I would always rather have him work 2x longer to have better output than trying again and debbuging - which with lower quants adds up to more time than q8 to make its thing in first or second try

[-]

ea_man@reddit

Then you should run 27B my friend.

[-]

Lost-Health-8675@reddit (OP)

whached it yesterday with 90k loading

[-]

ea_man@reddit

I see, it took you the whole day? /s :P

That's why we use lower quants + MoE + small models = it saves time.
Imagine waiting that whole day each time you hit /new.

> are you really trading knowledge and reliability for speed?

[-]

Lost-Health-8675@reddit (OP)

Time I have, nerve for debugging I have not :D

[-]

ea_man@reddit

That's the spirit mate, you let that prompt run whole night.

Yet some of us have to see things through by the light of day, now you what's the deal there :)

[-]

Lost-Health-8675@reddit (OP)

hahaha, I look at it as soon as I wake up - trust me, before coffee :D
but jokes aside I get what you mean

[-]

ea_man@reddit

Ye ye mate, we're fine, I also run QWEN 27B all time because if I can have just one model on my GPU it's gonna be the baddest ass ;)

I mean that's also why I'm not using Qwencode / Opencode with starting 12k prompt anymore :P

[-]

Lost-Health-8675@reddit (OP)

It took me about an hour

[-]

FullstackSensei@reddit

Running 3.6 35B Q8_K_XL with up to 180k without issues (configured 256k) time and time again the past week.

I have it running a long agentic task (document an entire project, one component at a time) for over 12 hours. About half a million output tokens so far. Hit 150k context on quite a few sub-tasks. Still going strong, zero interventions. Checked about half a dozen output docs so far (almost 40 written so far) and they match what's in the code.

[-]

suicidaleggroll@reddit

Yes it’s a tradeoff between intelligence and speed, but you’re making the wrong comparison. You shouldn’t be comparing a model at Q8 vs the same model at Q4, that’s only useful in determining if the Q4 is functioning properly or there’s something wrong with it. You should be comparing model at Q8 versus a model that’s twice the size at Q4. A 60B Q4 will wipe the floor with a 30B Q8 every time, all else being equal. When you drop below Q4 things start to get a little hairy, but Q4 is a good compromise.

[-]

Lost-Health-8675@reddit (OP)

I got that earlier, somebody already explained that to me earlier in comments :)
but I'm talking here strictly about same model because (until 45minutes ago) there wasn't better version out. There was qwen3.5 122b but it was said 3.6-35b-a3b was better
waiting for 27b gguf later today :)

[-]

LionStrange493@reddit

yeah the “just a tradeoff” answer is a bit misleading tbh

in practice it’s not just speed vs quality, lower quants tend to get way less stable, so you end up rerunning prompts or getting weird outputs

so even if q4 is faster per run, total time can actually be worse if you care about consistency

i’ve found this shows up more when prompts get longer or slightly complex — are you seeing that too or mostly with simple stuff?

[-]

Visual_Internal_6312@reddit

You're usually not building a house. So rerun, exploratoration and iterations beat trying to zero shot. Besides a lot of accuracy is due to prompting and agent/skill files where the lower quants benefit even more than the higher ones.

[-]

LionStrange493@reddit

that’s fair,

i’ve seen it get shaky once you add tools or longer chains, lower quants just start acting weird

are you mostly running simple prompts or more agent-type flows?

[-]

Lost-Health-8675@reddit (OP)

More agent type flows, I have a workflow that runs on and off for the last 2 days :)

[-]

LionStrange493@reddit

ah yeah that’s where it usually starts getting weird

long running + agents seems to amplify all the small inconsistencies over time

are you doing any retries / checks or just letting it run as is?

[-]

Lost-Health-8675@reddit (OP)

It is going so long because I like to check everything before moving forward, I always try to leave the big chunks working during my work or sleap hours and then I go trough it. I must say, there is no many retries with this model, it stays right most of the times.

[-]

LionStrange493@reddit

ah got it, makes sense if you’re reviewing it in chunks like that

feels fine until you try to rely on it without checking everything

have you had any cases where it looked right at first but broke something subtle later?

[-]

Lost-Health-8675@reddit (OP)

Yes, not on so big project but it was pain in the ass even so :)

[-]

LionStrange493@reddit

yeah those are the worst, do you just debug it manually or have any way to trace it?

[-]

Lost-Health-8675@reddit (OP)

manually

[-]

FullstackSensei@reddit

Having tried both side by side, Q8 understands intent a lot more often than Q4 of the same model, even with >100B models. I rarely need to refine my prompt when using a model in Q8, but find myself having to edit/tweak the prompt over and over when I run Q4.

The difference is so noticeable on more complex tasks that I just don't load Q4 anymore, even on something like minimax and even when I can run Q4 fully in VRAM at more than 3x the speed vs CPU+GPU. With Q8, I can fire the task and leave the room. No need to babysit.

[-]

Lost-Health-8675@reddit (OP)

exactly that is my worry, that it will lose consistency as we go pass certain point.

[-]

nickm_27@reddit

Not everyone uses LLMs for coding only, it depends on your use case. Using LLM for a voice assistant so speed is of equal importance to intelligence. If I am waiting 10 seconds for the weather forecast it becomes useless.

[-]

mtmttuan@reddit

It's just a trade off. Q4 is generally considered good enough. Q8 is twice as large and twice as slow so in many cases it's not an option.

Also I believe Q4 of a model twice as big will be better than Q8 of the smaller model?

[-]

FullstackSensei@reddit

For smaller models (<100B), Q4 is only good enough if you never compare with the output of the same model in Q8 on the same task.

[-]

Lost-Health-8675@reddit (OP)

Ok, that was my thought too, but don't know. If a model has 120b and it is in q4, is it in fact better? When you look at it logically, a model that knows less but knows how to use its knowledge is smarter than model that knows more but don't know how to use it properly... just a thought.

[-]

mtmttuan@reddit

Iirc data proves that your thought is not correct, at least for smaller models.

[-]

FullstackSensei@reddit

Do you have any sources for said data? On what type of tasks?

[-]

Lost-Health-8675@reddit (OP)

Thanks

[-]

Herr_Drosselmeyer@reddit

What you lose through quants isn't so much knowledge as it is depth, and it not negligible either. A weight in 4 bits can have 16 different states, one in 8 bit can have 256. 65,536 in 16 bit, though that's currently considered overkill. The weaker relationships between concepts can get lost in low quants and the output is more 'coarse', for lack of a better word.

[-]

Lost-Health-8675@reddit (OP)

Thank you

[-]

sleepy_quant@reddit

Did the swap a few days ago, Q4 to Q8 on Qwen 3.6 35B, M1 Max 64GB. Went from 50 to 35 t/s but retry rate on my eval flow dropped a lot. Your 2x longer math holds when the quality gap actually blocks workflow. Quick chats Q4 fine. Stuff where I'd have to dig 300 lines to find a bug, Q8 pays for itself. What's your main use case, chat or longer structured stuff?

[-]

Lost-Health-8675@reddit (OP)

exactly the same swap I made

[-]

dero_name@reddit

> What would I like to know, are you really trading knowledge and reliability for speed?

Of course. I'm maximizing the total utility of the model, not its capability at all costs.

- Qwen 3.6 A3B Q4 -> 150 tps (fits 24 GB VRAM of my 7090 XTX)

- Qwen 3.6 A3B Q8 -> 45 tps (spills over to RAM)

The lower quant is \~90% as capable as Q8. I mostly use it as a quick coding assistant to set up projects, write smaller scripts and personal apps. Q4 is totally adequate for these use cases. Why would I choose to run the model three times slower?

[-]

Puzzleheaded-Drama-8@reddit

How do you get that high tps with 7900xtx? I only get 85tk/s on ROCm and 115tk/s on vulkan with q4_k_s

[-]

dero_name@reddit

Yeah, seems a bit low. Interesting.

I'm on Windows, Vulkan backend. Here's my exact llama-server params:

-hf "unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_XS" `
            --alias "Qwen3.6-35B-A3B" `
            --host 0.0.0.0 `
            --port 7878 `
            --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --cache-ram 4096 --repeat_penalty 1.0 `
            --parallel 1 `
            -c 128000

The only thing I did was to overclock the VRAM, gave me +5 tps, nothing major.

[-]

screenslaver5963@reddit

Yes, as the quants get smaller, the models get stupider. If you can fit the higher quant models and don’t mind the slight speed decrease go ahead, though 4 bit is usually good enough. If you can’t fully fit the model as you increase then it’s a massive drop in speed rather than a smallish difference.

[-]

audioen@reddit

Yes, you are trading reliability for speed. q8 is maybe too high quality, and q4 is probably too low quality, and I personally strike the balance at q6_k_xl. Look up various unsloth's graphs about the Qwen3.5 models to see why I chose that point, as it sits right at the knee point where bigger model isn't much better anymore.

[-]

Hot-Employ-3399@reddit

Depends on task. If it's for gooning, then whatever.

If it's coding then task/seconds means more than token/second and moe are worse herem as solutions they provide don't often work.