Looking for Suggestions — Single 5090 & 64gb DDR5

[-]

uti24@reddit

>I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if that’s the case what would yall do to utilize the system memory? Potentially - yes. Qwen3.6 35B Q8 behaves much better than Qwen3.6 27B Q4 for my cases. It loops less and also fails tool calls less. So Q8 might be better than NVFP4 as well. I mean, in one shots 27B Q4 feels better, - smarter, can handle complicated things. But in multi-turn agentic load 31B Q8 clearly wins for me.

Reply

[-]

Last_Mastod0n@reddit

Unsloth q6 UD is the sweet spot imo

Reply

[-]

icedgz@reddit (OP)

how much context are you fitting w/ q6 UD? With q8 KV cache seems like you wouldn't get much?

Reply

[-]

Last_Mastod0n@reddit

Just 8k context because I use it for a personal project not coding. I dont use any cache quantization but im not opposed to 8 bit if I have to.

Reply

[-]

Anbeeld@reddit

>fails tool calls less Do you run them with same cache quantization?

Reply

[-]

LA_rent_Aficionado@reddit

General rule of thumb, don’t quant kv cache unless absolutely necessary - I don’t consider it, I’ll downgrade weights before touching it

Reply

[-]

Anbeeld@reddit

I prefer benchmarks and usage experience over rules of thumbs.

Reply

[-]

LA_rent_Aficionado@reddit

Well fortunately for you there are plenty of benchmarks and user experiences out there that laid the foundation this rule of thumb. LLM's are mathematic text prediction engines - more rounding over longer sequences of math will result in more errors/deviation at next token prediction. There are plenty of users who say they have not had any problems with Q8 KV cache and benchmarks that show it as not being noticable on SWE bench (very low context problems). But the general consensus out there is that that as context grows quantized KV cache has more holes. It's not different that quantizing weights, mathematically it will impact accuracy. How much and it's impact largely depends on use case. I am speaking on broad terms because there's no way of knowing which model/task/quant combination any user is contemplating.

Reply

[-]

Anbeeld@reddit

Bro, you really should stop talking down like it's first time I head of LLMs. I maintain a llama.cpp fork so I do plenty of benchmarking and testing. >It's not different that quantizing weights, mathematically it will impact accuracy. Which is exactly why these "rules of thumbs" defined by idk who bother me so much. You have one VRAM budget to split it between 2 entities, and you do Q4 + bf16? What sense it makes?

Reply

[-]

LA_rent_Aficionado@reddit

I say rule of thumb because since agentic flows/coding seem to surround like 90% of the conversations and these flows are often long context, multiturn, etc.and this is where KV cache quantization really shows its weaknesses. I've read enough anecdotal evidence on here and seen benches and papers showing fine quality degradation at longer sequences due to KV cache to know it's the last lever I touch when tuning. You're right though and apologies if my comment seemed condescending , it's not absolute and I appreciate the nuance that no config is one-size-fits all and everything has its tradeoff. As you state you need to find the ideal balance between both levers in your desired workflow within your VRAM and context budget. I would say Q4 + bf16 makes perfect sense in many flows, certainly more than the inverse. I wouldn't run Q4 for coding though (except perhaps a 200B+ model) but it works well in OCR/text based flows with lower parameter models. I certainly wouldn't advocate running a anything below a higher parameter Q3 model but that's just me, I personally don't bother but recognize my VRAM budget changes my perspective.

Reply

[-]

uti24@reddit

Even without cache quantization.

Reply

[-]

BitGreen1270@reddit

I'm getting around 100 t/s with Qwen 27B with MTP on my 5090. Minimal ram usage for low context conversations.

Reply

[-]

RMK137@reddit

I take a different approach, I am okay with many iterations with the 35B MoE even if it may be "dumber" than the 27b dense version as it is significantly faster. I basically never expect a solution on the first pass, even if the first solution looks good I always make the model do another 1-2 polish passes at the minimum. 5090 + unsloth-Q4_K_XL, KV at q8 and 131072 context. I can get up to 192k context or even more but the GPU also drives display so I leave some buffer. Most of the time I do one session for exploration and planning, clear context, then a fresh session for implementation.

Reply

[-]

amberdrake@reddit

Nah. Stick with 27b. The 35b has worse coding performance. I say just run the jackrong qwopus

Reply

[-]

PermanentLiminality@reddit

For me it is about speed, and I don't own a 5090. My ten year old pair of P40 GPUs, only gives me high single digit tokens/s on the 27B, but I get 45 tk/s with the 35B model. This is without dflash or mtp. I need to give those a try. I only run the 27b when the work is happening offline without me sitting there waiting for output. With a 5090 you should get good speed with the 27b model and the 35b will truly fly into the three digits. Hopefully we can get some 40b to 120b models that are better at coding than the qwen3.6 family we have now.

Reply

[-]

Current_Ferret_4981@reddit

Nope 27B is going to be better. Plenty fast and better performance.

Reply

[-]

ecl_55@reddit

I frequently read in this sub that people go for lower quants on 5090, is there a reason for that? 5090 has 32GB VRAM so Q6 works really well there, so why go lower?

Reply

[-]

Current_Ferret_4981@reddit

Q5 leaves more room for context. Q6 is tight tight. Q4 is just because everyone standardizes on Q4

Reply

[-]

sword-in-stone@reddit

5090, 64gb ram, same set up, 27b q4 with MTP on llama cpp is superior to the q6 moe, didn't try q8

Reply

[-]

Worldly-Plastic-2516@reddit

Have you tried q5 or q6 on the 27b?

Reply

[-]

ecl_55@reddit

I've had good experiences with qwen-27b-q8 via llama.cpp with KV-Cache q8/q8 on my 5090. Leaves enough room for 160k context and still going quite fast. Tried MTP, but the overhead means either lower model quantization or 50k less context, so passed on that.

Reply

[-]

wizoneway@reddit

ive been driving 5090 with q6@q8/q8 with great results and also dropped MTP due to overhead. With 3k pp and 50ish tg it feels good and quality on tool calling and code gen has been great.

Reply

[-]

ProfessionalSpend589@reddit

> would yall do to utilize the system memory You could have more concurrent users. Or run a smaller MoE model in parallel for simple tasks (or just for fun). When I played with ComfyUI it worked on the GPU, but during the exporting phase it used a lot of RAM (almost 90GB). I can’t say if I have misconfigured something, but RAM will be utilised most of the time. When I tried opencode for the first time I launched a VM with 8GB of RAM and learned inside it before i installed it on my raspberry pi.

Reply

[-]

fasti-au@reddit

Just buy 4 arc b70

Reply

[-]

icedgz@reddit (OP)

Thanks for this terrible suggestion

Reply

[-]

pand5461@reddit

qwen 3.5 122b @ iq4_nl with -ncmoe 39

Reply

[-]

romrick4@reddit

From what I read Qwen 3.6 27B benched pretty damn close to Opus. That would run pretty nice on a single 5090. And like others said don’t fall back to system memory, VRAM only

Reply

[-]

Qwen_os_has_died@reddit

Stick with the 27b , try diferent quants.

Reply

[-]

FullstackSensei@reddit

Your research is wrong. Q8 performs way better than Q4 on both models, and not that 35B isn't good but 27B is quite a bit better.

Reply

[-]

looselyhuman@reddit

Consider this 27b: https://huggingface.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm NVFP4 is smaller and faster, but you don't need that at 32GB unless you plan to have 200k+ (not recommended for model sanity) or multiple big context windows. Q5-Q6 was my 5090 sweet spot, and this one is doing really well. KV cache supports a 128k context window with plenty of room to spare, but not so much it feels wasted.

Reply

[-]

grabber4321@reddit

you dont want to go to memory - it slows down the process. Research the DFLASH and MTP and get that running on VLLM. 35B is dumber, just use 27B

Reply

[-]

Top_Training5738@reddit

For agentic coding I’d honestly stay with the 27B running fully in VRAM over trying to squeeze larger Q8 models partly into system RAM. Once you spill heavily into DDR5 the latency hit starts hurting the whole “agent feels responsive” experience. If you want to actually use that 64GB RAM well, I’d probably use it for huge context windows, RAG/vector DBs, caching, parallel agents, or running supporting models instead of offloading the main model itself. A fast smaller model fully on GPU usually feels smarter in practice than a giant sluggish setup.

Looking for Suggestions — Single 5090 & 64gb DDR5

Reply to Post

33 Comments

uti24@reddit

Last_Mastod0n@reddit

icedgz@reddit (OP)

Last_Mastod0n@reddit

Anbeeld@reddit

LA_rent_Aficionado@reddit

Anbeeld@reddit

LA_rent_Aficionado@reddit

Anbeeld@reddit

LA_rent_Aficionado@reddit

uti24@reddit

BitGreen1270@reddit

RMK137@reddit

amberdrake@reddit

PermanentLiminality@reddit

Current_Ferret_4981@reddit

ecl_55@reddit

Current_Ferret_4981@reddit

sword-in-stone@reddit

Worldly-Plastic-2516@reddit

ecl_55@reddit

wizoneway@reddit

ProfessionalSpend589@reddit

fasti-au@reddit

icedgz@reddit (OP)

pand5461@reddit

romrick4@reddit

Qwen_os_has_died@reddit

FullstackSensei@reddit

looselyhuman@reddit

grabber4321@reddit

Top_Training5738@reddit

jacek2023@reddit