Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver

[-]

BitXorBit@reddit

Could you share some numbers? Prompt processing speed and tokens generation speed

[-]

Demonicated@reddit (OP)

In LM studio it's about 37ish tokens a second. Fast enough to work with.

[-]

mxmumtuna@reddit

You need to be using sglang or vLLM with that 6000. It’s significantly faster due to MTP support.

[-]

mxmumtuna@reddit

Except for the 122B bit, yes. 5090 and 6000 are almost identical except a few extra cores and the extra RAM on the 6k.

[-]

I found the overhead to be too large to make vLLM practical for RTX5090. With a Q4 20GB model size I get max 32k context unless I quantize the cache aggressively. Meanwhile llama.cpp gives me 115k KV at Q_6_XL.
Seems 32GB is just not enough to effectively use vLLM with a RTX5090. Does SGlang behave better in that aspect or best to stick with llama (single user, non-concurrent)?

[-]

aaronr_90@reddit

Can you get the 122B NVFP4 model to fit on a single RTX 6000 Pro?

[-]

Nepherpitu@reddit

Yes, it fit on 4x3090 with 96gb vram and bigger overhead. Full context, very fast

[-]

drbanan@reddit

vLLM? tensor parallel? how fast is it?

[-]

Nepherpitu@reddit

vLLM, TP=4, 5K PP, 140 TG @ 128K

[-]

drbanan@reddit

can you check how fast it is at 1 x 3090? and what are specs? it's all linked in x16 pcie? i am building new rig with 2x 3090 and wonder if TP helps a lot - there is a lot of misinformation about speed bump.

[-]

Nepherpitu@reddit

There are no way to fit 122B model to single 3090. You need at least four or eight cards. Llamacpp doesn't have tensor parallelism, only vllm and sglang has.

[-]

drbanan@reddit

i mean - what about other models? how big is speedup between 1 and 2 or 4 cards? it does scale linearly? and you have them in pci-e x16 slots on threadripper/epyc? I am going for vLLM

[-]

Nepherpitu@reddit

Don't know, really didn't bother to test smaller models than I can fit. Only metric I have is Qwen 3.6 27B going from 4 x (PCIe 4.0 X16) to 4 x (4.0 x4) + 2 x (3.0 X8) + 1 x (4.0 X16) + 1 x (3.0 X16).

On tp=4 it was 70-150 tps with MTP=6, on tp=8 it was 70-150 tps with MTP=6 for single user. But on TP=8 it IS limited by PCIe bandwidth because of few cards on 4.0 x4.

I'm still waiting for better risers and x8 splitters.

I'm using EPYC 7702 with 4.0 4x16 slots with bifurcation.

[-]

drbanan@reddit

tp = 4 and tp =8 same tps?

[-]

Nepherpitu@reddit

Yes. But it's VERY different TP. tp=4 is full PCIe bandwidth with P2P and patched vLLM, and tp=8 is stock vLLM with custom all-reduce disabled and PCIe bandwidth saturated. For example, at tp=4 each GPU consume 230W, at tp=8 it barely hit 180W.

[-]

Perfect-Flounder7856@reddit

Shit it says two on the github linked. Hm

[-]

mxmumtuna@reddit

It’s 1. Just not as popular. Ask in the discord. We have setups. It’s basically the same though, just run it single.

[-]

mxmumtuna@reddit

Definitely, and at full context.

[-]

l0nedigit@reddit

Definitely full. I am running full fp8 and 30gb of space for kv. Still plenty of room for another model (roughly 30gb).

[-]

mxmumtuna@reddit

Not 122b at FP8 🤣

[-]

l0nedigit@reddit

Haha oh definitely not. 27b.

[-]

Demonicated@reddit (OP)

I keep getting told this and the only reason I hadnt yet (besides not enough minutes in the day) is that Im running in windows. I downloaded the docker image but then was having issues mounting my drive with all my models.

Maybe it's time to bite the bullet and just get the container running....

[-]

l0nedigit@reddit

I can send ya the vllm docker compose I'm using if you want. I'm running it on a max-q.

[-]

Demonicated@reddit (OP)

Yes please! I'll try it out now

[-]

Norwood_Reaper_@reddit

As an FYI, I have this card and performance between vllm, sglang etc is the same as LMStudio until you move onto 4 or more concurrent pulls, then vllm and sglang are better. So if it's a single user use case, especially on windows, the ease of using something like LMStudio outweighs any negative IMO.

[-]

mxmumtuna@reddit

That’s not true at all. MTP alone makes a huge difference. Also if you’re using agents, it’s almost a non starter with the llama.cpp ecosystem.

[-]

Norwood_Reaper_@reddit

I ran a huge battery of tests and benchmarks. Performance is similar in my use case until you hit 4+ concurrent requests, then you take a 20-30% haircut.

[-]

mxmumtuna@reddit

With MTP enabled?

[-]

Demonicated@reddit (OP)

So that's what I had read which is why I back burnered the idea.

But then I see people claiming that with correct configs you can get almost 2x token throughput - so I don't know what to believe. The internet is wild with claims lol.

[-]

Danmoreng@reddit

Under Windows you could at least use llama.cpp directly and build it from source. Got a repository with pwoershell scripts for easy installation including all dependencies needed to build it: https://github.com/Danmoreng/local-qwen3-coder-env

Since I only have a 5080 with 16GB VRAM, my Qwen3.6 27B script defaults to IQ4_XSS 12GB so it fits in my VRAM.

[-]

fasti-au@reddit

Turbo quant it

[-]

fasti-au@reddit

Bind a seperate model location and use direct mount. Wsl is the windows way. You can move Ubuntu to another place and symlink as long as drivers see drive on boot. I use a Proxmox windows laptop for smal dev work that way but it’s best to have a ssd and busk boot and run inside hyperv is still faster on my spec

[-]

ArtfulGenie69@reddit

If you can afford a 6000 pro you may want to think about a 250-500gb extra drive for Linux. It's worth it to start learning. Just have to get the tool running and it will help you the rest of the way. I like Linux mint cinnamon.

Also on my 4 3090's that are linked through 2.5gbe using rpc in llama.cpp running qwen3.5 122b without mpt or anything else at 55tg and 800pp. Just imagine what that 6000 pro can pull with vllm and that nvfp4 you could make a special quant because you'll have a lot of vram left over. Leave the attention and other non expert layers like mtp full bf16 or fp8 or something. It would be ripping fast with mtp and vllm.

I have to figure out ray so I can try something better on my system. Probably get a bit of a boost over llama and rpc. Make sure to crank your tensor parallel up to like 8 or 16 or something as well, llama.cpp also has --parallel --cont-batching but when you use it, it chops up your context into 1/2 1/4 context depending on the parallel unlike vllm. You could get a big speed boost depending on the context the model is already handling, where as with vllm it is just a speed boost.

[-]

Demonicated@reddit (OP)

It's definitely on the to do list. Just to give a glimpse i have a day job. 2 side businesses. 2 rental units and 2 boys so I'm making out my 1440 minutes a day 😆

I already have a Linux box that's set up as a NAS. Its just a matter of giving it priority. Also that box is DDR4. My inference machine is my daily dev machine. I need to build another box its just painful to build anything these days. Spending so much on ram and pcie5 storage just makes my soul hurt lol

[-]

Perfect-Flounder7856@reddit

I did the same. Like almost to the T. I tried setting up vllm and quantizing my own model. It was painful slow. I am going to bench the bf16 when I get home. For a non tech ceo I feel like I'm doing pretty well. And shit if 3.6 122b dropped tomorrow...I'd buy a second 6000 💯

But yes absolutely get on Linux. And I'm on the discord. It's...a lot but im sure it's great. At one point I was new here to reddit and localLLaMA now i post and comment quite a bit. Had the 6000 for a week now. I'm working on setting it up with agents now. Did a lot of benching (not on agents yet) and Qwen3.6-35B-A3B edged out 27B for my use case on policy reasoning and email/file note drafting.

[-]

Perfect-Flounder7856@reddit

Oh and I'm gimping the card running Q8 GGUF on LM Studio. But LM Link and LM Mini are pretty fucking great. Getting 45tok/sec dense and 200 MoE so can't complain as a single user but will eventually need to setup for concurrent usage.

Oh and I benched 3.5 9B as a Retell Agent model so working on setting that up as well!

[-]

mxmumtuna@reddit

Those would be about double in vLLM/sglang. About 80/85 for 27B and about 300+ 35B. Just FYI. Even faster with multiple users/agentic use.

[-]

Perfect-Flounder7856@reddit

Jesus! Yeah I tried to do a Q8 with Claude's help. But it thought I needed like MXFP8 and that's gotta be completely wrong. I just could find any data or anything on quantizing NVPF8.

[-]

mxmumtuna@reddit

Claude was doing you dirty.

[-]

Perfect-Flounder7856@reddit

I know. Just too new for it. Is there anything on github on quantizing for NVPF8 that you know of?

[-]

sautdepage@reddit

NVFP8 doesn't exist, and llama support Q8 but not FP8. vllm/sglang support FP8. And you can download all variants of these already from huggingface, no need to quantize anything yourself.

[-]

Perfect-Flounder7856@reddit

Perfect thank you. I guess I've seen two people smarter than me refer to NVFP8 and got me confused. FP8 only!

[-]

Perfect-Flounder7856@reddit

Running some BF16 benchmarks on my use case prompts and it's crushing GGUF Q8

[-]

mxmumtuna@reddit

Forget all of that. NVFP4 or AWQ/Int4 is the choice quantization for use in vLLM or sglang. FP8 is FP8 and most labs drop both BF16 and FP8 weights. Qwen included.

[-]

Perfect-Flounder7856@reddit

😉

[-]

mxmumtuna@reddit

You’re just seriously gimping that card by running llama.cpp and friends. Join the discord mentioned below. We can help you work it out, though most of us run Linux.

[-]

aeroumbria@reddit

Over always heard that 1) vllm does very little for single user use case and is mainly for parallel tasks 2) vllm tensor parallel requires full speed pcie to work as expected, and a 16/4 and even 8/8 setup will be gimped. How much of these are still true / no longer true / never true?

[-]

mxmumtuna@reddit

Mostly just not true. The primary reason to use the llamacpp ecosystem is the heterogeneous inference support. Different GPUs, GPU/CPU hybrid, etc. it runs on anything and everything. You trade performance and scalability for that flexibility.

[-]

mythikal03@reddit

KV cache juggling on vLLM ends up driving me away; I have a mountain of unused DDR5 and llama’s ‘cache-ram’ gives me an insurance policy against random scheduled jobs blowing up my main kv. With that backstop I can usually fit an extra model, too.

On vLLM, I hit 106tok/sec with qwen36-27b-fp8 - about double llama. I just wish vLLM leveraged my CPU RAM or sessions better.

[-]

fasti-au@reddit

Llama is the go right now ampere. The turbo quant for vllm isn’t quite done unless it’s been overnighted. I think Tom Tom repo the middle ground. Mines in both

[-]

Blues520@reddit

Would the original FP8 run on vLLM with dual 3090's?

[-]

mxmumtuna@reddit

Should be fine for 27B yeah.

[-]

shifty21@reddit

Is it safe to assume the qwen3.5 vllm setting are the same as qwen3.6? I have a Pro 6000 and vllm nighty builds.

[-]

Perfect-Flounder7856@reddit

Shit it says two on the github linked. Hm

[-]

mxmumtuna@reddit

Here are some of the docker builds in use.

https://github.com/voipmonitor/rtx6kpro/blob/master/optimization/docker-images.md

May want to join the discord as well. I think the link is in /r/BlackwellPerformance

[-]

mxmumtuna@reddit

Yep. 👍

[-]

ToInfinityAndAbove@reddit

Jeez, rtx 6000 pro costs 11.5k in Portugal

[-]

drbanan@reddit

"pennies", jacket guy would say

[-]

FlippyHipp@reddit

by great token reckoning, are you talking about anthropics cli use ban with automated agents?

[-]

jonnywhatshisface@reddit

I am in the same boat. I have a GHCP subscription and the latest stunt they pulled saw me out of credits entirely for the month from 3 prompts on April 3, so I spent all of April playing with local models and trying to get a decent setup going. I found Qwen3.6 and, well, I am cancelling my GHCP subscription and taking them up on their refund offer.

I've thrown some pretty ridiculous tasks at Qwen3.6 35b A3B. I'm only using the quant 4 version. I've had to nudge it to fix a few things it's implemented here and there, but it always reliably gets it done. I've also paired it with Serena for RAG - which has made it an absolute unstoppable beast thanks to the memories capabilities in Serena. Seriously, this model is unbelievably impressive and punches so far above its weight that it's ridiculous.

It also outperformed Claude Sonnet 4.6 on a task yesterday, which was the final nail in the coffin for my GHCP subscription.

I went through absolute hell getting it stable and working properly, so here's a few tips for anyone that has issues with it.

1) The tool calling issues are a widely known and often complained about topic. I've gotten it 100% reliable with tool calling, and it was much easier than one would think. The model REALLY requires preserve_thinking be enabled, which does cost just a little bit more RAM up front - but it's disabled by default (no idea why). Make sure it's enabled. If using LM Studio, toggle the Preserve Thinking on under the inference options. Otherwise, set preserve_thinking = true in your jinja template.

2) The second issue I ran into with tool calling and looping with tool use even after enabling preserve_thinking was the most commonly complained about use-case: opencode. I saw that 90% of the posts about tool calling issues revolved around usage with OpenCode, so after monitoring the hell out of the logs, I noticed that every single time the tool call failed - it was at the same exact token generation count that the model would finish and hand the call off, which would fail with invalid arguments to the tool call and loop. This is because OpenCode enforces a max output token count by default, and it's configurable via your JSON config. I raised the output token count drastically, and no more tool call failures at all.

3) Do NOT quantize the KV cache with Qwen models. Firstly, the model is quite resistant to it - it isn't needed. You won't save much of any space at all. I tested this with running KV cache of quant 4, and it only saved about 200MB of memory and it hurt the performance. The model kept crashing because the memory overhead to deal with trying to quantize it at higher contexts put enough strain on my GPU that Mac OS's interactivity timer watchdog kept killing the model. There's zero need to quantize the KV cache with a Qwen model, and it will only hurt the performance.

4) If running on a Mac, make sure you're weary of the thermal status. When the GPU clusters reach about 82c, they're throttling back. This is enough to cause some lag that results in timeouts when the Interactivity Watchdog, and it will kill the model. Grab the mac fan app and set custom points for the fan. Use the GPU sensors as the sources to monitor, and set the low cool temp to 50c and the highest to 80c. The fans will begin to kick in full-force at 80c and keep it below 82, and the thermal throttling will stop.

5) Use the GGUF model if running on Mac. I know it's tempting to go with MLX because, hey, it's supposed to be optimized for Apple. The truth is you ONLY gain performance in the token generation speed, and not by that much. I do 65tok/s with GGUF, and I believe I clocked about 72 tok/s with MLX version. The issue, however, is that the prompt processing with MLX is WAY slower. The memory is also allocated on-demand and bursts. So after every task is finished, you'll see that memory drop all the way back down to no usage, and the minute you make a prompt it skyrockets back up. This means the KV Cache / token reuse is absolutely disabled, and you're re-processing every full prompt with no token reuse. This actually causes the prompt processing to not only take longer, but more importantly - it spins the GPU's up to the max the entire time it's doing this because it's making a metric shit ton of allocations during the prefill. The higher the context gets, the higher the heat gets - and the longer it holds the GPU (far more aggressively than GGUF, at that) - so the interactivity watchdog kicks in and kills the model. GGUF pre-allocates all of the memory up front, so what you see in use when the model is loaded? That's what it's going to use. IF you see memory creep while using GGUF, it's a different issue: you may have too high of a context for the memory bandwidth you have, and while the KV cache is shifting things around it may be slowed down resulting in memory creep during that process, in which case the model is likely going to be killed by the interactivity timer.

6) Batch size helps with the prompt processing speeds, but too high of a batch size holds the GPU up for longer durations during the prefill. This again in turn increases the risk of the interactivity watchdog killing the model. If you have proper thermal control, you can get away with a batch size of 2048, for example - but I'd really recommend based on my experience thus far not to exceed 512. I noticed that with 2048, I got much much much faster prompt processing times, like single-digit seconds processing times. However, not only did the thermal throttling start kicking in much faster, but the logic seemed to get dumber. It looped more with a higher batch size for some reason. My current sweet spot is 512.

7) Do NOT use Ollama. I had horrific shit performance with ollama. Seriously, I was about to give up on local models entirely because of it. Also, don't try to use vLLM - the metal backend is extremely experimental and it doesn't work well at all. (Works amazing if you're running on NVidia, though!). Use LM Studio of llama.cpp directly if you're running on mac! Also, the beauty of LM Studio is you can use it for its gorgeous and easy to navigate UI to quickly download models, and they're in a format you can just point anything else at to run. Ollama does this chunk storing that feels like container layers, and you can't just point llama.cpp or vLLM at the models, you'll have to re-download them.

[-]

chimph@reddit

Great advice but it's important that you state what exact hardware you have as the thermal constraints dont apply to all Macs. I've not run into 'Interactivity Watchdog killing the model' and I run long AI sessions in clamshell mode when hooked up to a monitor. If you're on a smaller, older. MacBook then that may be an issue.

I'm also having a wonderful time running Qwen3.6 35b A3B Q6 (unsloth gguf) in both opencode and hermes. I'm running both at the same time and its beautiful to watch.

Make sure preserve_thinking is enabled

Ive for some reason not needed this. thinking has come through in opencode and hermes no problem. I think unsloth may have it baked in. No bad thing to enforce it though.

Raise token output drastically

Absolutely. I think it's default set at 4096. I have mine at 16384, which is plenty.

Do NOT quantize the KV cache with Qwen models. I only saved 200MB memory at quant 4

hmm.. you dont state what context size you set but that sounds like you must be running very small context to only have that amount of savings. "The model kept crashing because the memory overhead to deal with trying to quantize".. again this is very hardware dependant and may not be an issue for most. That being said, I also dont use quantisation, but I have plenty of memory. The savings at 128k context are estimated to be \~2.5GB at Q8 over FP16.

If running on a Mac, make sure you're weary of the thermal status.

Yep, good advice but again, how much it actually affects you much depends on what youre running. Theres even a difference in the latest M5 Pros between the 14" and the 16", where the 14" may start to throttle under load compared to the 16" having better thermal management.

Use the GGUF model if running on Mac.

I agree. But mostly because I've got used to llama.cpp and how to work with its settings to get the best out of the model. I doubt switching to mlx will benefit me much, if at all.

Use batch size 512

Yeah, again this is very hardware dependant and sounds like an issue you were having with thermals again. I'm running mine at 4096 and not having any issue. I might monitor temps/fans at 2048 and see if theres any difference but so far its not something Ive personally needed to consider.

Do NOT use Ollama.

yepp... if you want the best out of your models, avoid ollama. I'm having a great time with llama.cpp. Its really not difficult to run.

My settings are:

llama-server
-m /path/to/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q6_K.gguf 
--mmproj /path/to/Qwen3.6-35B-A3B-GGUF/mmproj-F32.gguf 
--host 0.0.0.0 --port 8080 
--ctx-size 131072 -n 16384 
-ngl 99 
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 
--parallel 1 
--cache-ram 24576 
--cache-type-k q8_0 --cache-type-v q8_0 
--batch-size 4096 --ubatch-size 2048 
--flash-attn on 
--jinja

Most of your advice is solid but a lot of it is shaped by thermal constraints that won't apply to everyone. For context, I'm on a 16" Macbook M5 128GB. And yes its the latest and greatest, so I can push much further.

[-]

jonnywhatshisface@reddit

Thank you for the good point on mentioning HW. I'm on a 14" MBP. My context size is 131k as well, but I didn't see much savings at all on KV cache (\~200MB) when quantizing. However, I should also note I'm running the q4 model to begin with. I imagine with the M5 128 you're definitely running a larger quant if not full precision? How's the performance? I'm seriously considering buying an M5 Max 16" right now.

I'll go back and double check with quantizing kv cache, it's quite possible I saw numbers wrong and just wrote it off since I've not needed to quantize it. I have plenty of RAM. But running with the 131k context I've only seen it peak at roughly 50GB, give or take.

If you don't mind, while running a heavy task on the model on the M5 Max, could you grab a few cycles of output from sudo powermetrics --samplers cpu_power,gpu_power ? Would love to see what's going on once the temp gets up with it.

[-]

chimph@reddit

sure. heres about 10mins into a session with context at about 38k

https://paste.rs/V3WZp

[-]

jonnywhatshisface@reddit

Also, you definitely CAN get it to the "Vibe Code" level of Opus (I do have it there!) - but it's all about the agents. That's the special part about GHCP and OpenCode: the agents. Yes, the claude models are very good, although they've been deteriorating lately... But, with proper agent configuration and setup? You can literally automate all of that planning portion and steer it to functioning just like Claude if you put the time in. I can share some of my agent setup. But RAG is also a must. I'm using Serena and I have an agent setup that basically ensures everything is done via serena memories, including full project plans for changes and writing out the patches to memories to work off the memories during the progress. The beautiful part about that is I can also have it plan out major chunks of multiple pieces of work, then have it autonomously go back and execute all of the projects and automate the full testing of everything.

Setup really makes the model more so than the model in some cases. :)

[-]

jonfoulkes@reddit

Thanks for sharing all this great info. As a fellow Mac user, this is gold.
I would greatly appreciate it if you shared more details about your agent setup and how you integrated Serena.

Also, what are your Mac specs?
I have an M4-Pro w/48GB in an MBP.

[-]

jonnywhatshisface@reddit

I'm running an M2 Max with 64GB RAM. I'm also working on some further tweaks by adjusting socket buffer sizes etc to increase the speed of transferring content between the API and the client, as well as testing some tweaks on memory reclaim hackery. Will post more as I progress with some of that.

My agents are just written out to enforce usage of Serena for all file editing and reading and RAG lookup, as well as outlining rules to verify logic, create "project plans" for every change using Serena memories and to create memories as sub-tasks for every code change it's going to make, with the patches and exact lines in the memories, and to use the main project memory as a project tracker - updating it as each step/sub-task is completed. I've also given instructions that if it doesn't know the answer to search the web and verify. The last bit is the test procedures. I have outlined that all results must be tested and verified for functionality, and it needs to make plans for those tests to confirm the end-result for non-UI work.

Using this type of structure in the agent, I literally gave it a prompt of "Write a test to verify the results of the prediction model against real world data. I don't know what data sources exist to verify it, but search and let me know." The end result is it went out and searched the web, found three different real-world API's that log the data (commercial fish species catching reporting data), wrote the code to interact with their API and then called my API's to gather data on specific dates before then calling the other API's to compare the data from my output to the logged catch data... Yeah, it did all that off that one prompt. lol

[-]

dead_dads@reddit

Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.

My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.

What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.

[-]

bighead96@reddit

Use the 35B A3B variant it’s much much much faster and works good

[-]

1asutriv@reddit

A lot of people say the local models are not as effective but in my opinion it comes down to your flow and how well you've built the harness around the model. For example, do you have: - a wiki of docs on the codebase - a set of skills to address FE/BE/Devops and other needs - prompts to comprehensively address additions, updates, and flows

All of those I've built up and IMO switching models is mostly negligible between frontier and local at this point

[-]

Bootes-sphere@reddit

From running production inference, Qwen 3.6 at that quantization level hits a sweet spot most people sleep on. The token efficiency is genuinely competitive with Claude for code work, and the latency on local is unbeatable when you need sub-100ms response times. One thing: make sure your context window settings aren't cutting off early. Qwen handles longer contexts well, but VSCode extensions sometimes have their own ceilings that conflict with the model's actual limits.
How's the memory footprint looking in practice?

[-]

LegacyRemaster@reddit

me too. But try Abiray-Qwen3.6-27B-NVFP4-GGUF\Abiray-Qwen3.6-27B-NVFP4.gguf <-- Faster and zero issues on coding

[-]

Demonicated@reddit (OP)

Ok so its a lot more chatty than the q8 version, so while I see better throughput I'm not convinced its the better experience. Not saying its bad, i was seeing much longer responses coming from that particular model.

[-]

LegacyRemaster@reddit

i'm using it only with kilocode + vscode

[-]

Demonicated@reddit (OP)

Sleeper comment right here! I just loaded it up and will be giving it a test drive today. Doubled token throughput and seems comparable so far.

[-]

LegacyRemaster@reddit

It was a great discovery. I use this version and if it doesn't work or doesn't do something I switch to minimax 2.7 q4_k_s

[-]

odytrice@reddit

This lines up with my personal observation. If you fire it off without verifying its plans, you are in for a world of hurt. That said, it’s a dangerous practice even with SOTA Models and of course free is literally infinitely cheaper

[-]

LienniTa@reddit

what harness do you use?

[-]

Demonicated@reddit (OP)

Github Copilot

[-]

MathematicianLoud947@reddit

Perhaps a silly question, but why not use Open code?

[-]

rulerofthehell@reddit

Not OP but shady privacy, check closed github issues on its repo

[-]

nopeac@reddit

Could you be more specific?

[-]

rulerofthehell@reddit

Still unsolved issues for example but there are more, dig through their open and closed issues: https://github.com/anomalyco/opencode/issues/459

[-]

rulerofthehell@reddit

They send data to unknown aws servers, on by default

[-]

Pleasant-Shallot-707@reddit

How is using aws as a compute service “shady”? Should they be required to run their own data center to manage their backend support services?

[-]

rulerofthehell@reddit

When running locally, genius

[-]

starshade16@reddit

Yo, can you tell me more about this? This worries me and I can't turn up anything with search.

[-]

rulerofthehell@reddit

Still unsolved issues for example but there are more, dig through their open and closed issues: https://github.com/anomalyco/opencode/issues/459

[-]

CapsAdmin@reddit

I use vscode as my editor with copilot chat, and I'm trying to switch to using pi or opencode (i think I'll stick with pi)

What I miss the most right now from copilot chat is the ability to quickly reference files and chunks of lines from within vscode.

I don't mind using a tui too much, but I would much rather use ui.

Terminal input fields are especially annoying to use. Caret navigation, newlines and text selection tends to be inconsistent. It doesn't help that vscocde also messes with some sequence commands like ctrl enter and ctrl j (to insert a newline) when trying to use it in the vscode terminal.

I've learned to sort stop caring about formating or even spell correcting my prompts as i feel llms understand regardless.

[-]

LienniTa@reddit

i hate pi cuz it parses multi line pastes line by line. Vibecoded gui for harnesses lol, its 8 kb vsix

[-]

Demonicated@reddit (OP)

Fair question. Never have. I think I've become accustomed to copilot. We have synergy.

[-]

MathematicianLoud947@reddit

Whatever works best 😊

[-]

rm-rf-rm@reddit

Is Insiders stable/no issues? The local model option has been available for ever there and they refuse to release it to main for some reason (likey because of profit related reasons).

[-]

bgravato@reddit

Are you using continue add-on?

Which sampling parameters (Temperature, Top-K/Top-P or Min-P) are you using?

Did you compare the Q8 version vs Q6 or Q4? Does it really make that huge of a difference?

[-]

Perfect-Flounder7856@reddit

I compared q4 on my use case and it was not nearly good enough for production

[-]

bgravato@reddit

Interesting... When asking AI about it, and even in the description of some models on hugging face, it often says Q4 is the sweet spot and that the difference to Q6/Q8 isn't much, but then I keep seeing comments of users saying otherwise...

[-]

Perfect-Flounder7856@reddit

I've seen the opposite here.

[-]

Beamsters@reddit

Consider q5 if you are memory constrain. It was night and day.

[-]

Perfect-Flounder7856@reddit

Gguf? I think I'm gonna try out nvfp4 on vllm now that I have it running see if it has more accuracy than Gguf

[-]

Demonicated@reddit (OP)

Im using Github Copilot - I honestly love their harness. VSCode Insiders lets you set up OpenAI compliant model sources.

I used whatever was set in LM Studio for inference config:
Temp 0.1
Repeat Penalty 1.1
Top K 40
Top P 0.95
Min P 0.05

I tried Q4, Q8 and bf16 on
QWEN 3.6 Dense and MoE
Gemma 4

I usually run full size for models, but the quality of q8 was honestly just as good in Day 1 of testing. Q4 got lost a couple times and seemed to have more trouble. But I wasnt doing a 1 for 1 test, each model and quant got different tasks that i needed at the time. I was definitely going off "the feel" of using the model rather than going with data.

[-]

bgravato@reddit

From what I've been reading, you should use both Min P and Top K/P simultaneously... The usual recommendation is either go with Min P and disable Top K/P or disable Min P and go with Top K and/or Top P.

[-]

FullstackSensei@reddit

For parameters, just follow the Unsloth guide. Foe quant, yes it really makes a difference for anything non-trivial with anything above 50k context from my experience. Haven't tested with shorter though.

I hand models 30k+ documentation about the project and Q4 would forget parts of either the documentation or prompt.

Few days ago, I let it loose on a sizeable codebase. I had spent a couple of exploring this codebase to understand what was where and how things where done. Using Roo, I gave Q8_K_XL a 2k prompt about my findings and detailed instructions about how to proceed and how to plan it's actions to document all the parts I cared about, and went to sleep. The next morning, it had gone through 2M input tokens and generated almost 700k tokens, created over 40 sub tasks (part of my prompt was one sub task per OG source, but free to dig into any dependencies of that file) and a corresponding number of documentation files. Checked a few against the code, and they're impeccable. IMO, not bad for a 27B model.

[-]

iMakeTea@reddit

Thats a lot of work done by the Q8 overnight.

You notice if Q5 and Q6 are inbetween q4 and q8 for performance and remembering context?

[-]

FullstackSensei@reddit

I generally don't bother with Q5 or Q8. I want to trust the model, and I'll take the hit in performance if it means I can do that.

[-]

getstackfax@reddit

This is the local workflow that makes the most sense to me.

Not “local replaces every frontier model,” but local becomes the default daily driver for routine work, tool calls, planned implementation, refactors, etc. Then premium hosted models are reserved for the parts that actually need the extra reasoning.

The Plan round point feels important. A smaller/local model can punch way above its weight when the task is decomposed first, but it is probably not the best fit for vague “go build this whole feature” prompts.

That seems like the real token-saving stack:

local by default
plan before implementation
cloud escalation only when the task earns it

[-]

idkwhattochoo@reddit

how to make banana pineapple pizza locally? local by default, yes

[-]

Icy_Butterscotch6661@reddit

Lowercasing text and removing emdashes aren't just enough, it's the bare minimum

[-]

getstackfax@reddit

Fair point. Style cleanup is easy; making the workflow actually reliable is the hard part.

[-]

przemekcoditive@reddit

I totally agree. Such approach makes a huge sense.

Recently I spent a lot of time trying to write workflows that decomposes „build this feature” request to the whole step by step plan and the results were amazing, even for „old” models.

[-]

getstackfax@reddit

Exactly. The planning layer changes everything.

A weaker model with a clear checklist, file boundaries, and step-by-step execution can outperform a stronger model that is just given a vague “build the whole thing” prompt.

I think that’s where a lot of local setups get underrated: not as magic replacement brains, but as cheap reliable workers once the job is broken down properly.

[-]

The-Rubber-Bandit@reddit

+1 to some of the other posts here. Why are you running GGUF? At least run AWQ, but more than likely you can run a full fat FP8. And yes, definitely vLLM! Check out the DFlash speculative decoder as well for even more of a speed bump

[-]

Dany0@reddit

How are you using it? Copilot + oai api provider? Kilo code? Hermes? Roo/cline?

[-]

Dany0@reddit

Also like others said, vllm is the way to go. Just point an idiot clanker at it and tell it to set it up via wsl2, there are ready-to-go recipes online. Just tell it to include vision and MTP

If you don't need vision, you might be better off with a dflash vllm fork or that stupid vibe coded luca dflash thingie, but I would worry about that later

It's a small model and ironically better at planning than big models, it makes for really good simple scaffolding for reducing big MoE's entropy effect

They have a tendency to hit a poison token around 128k context IME. You might be better off running vllm with the max seq param at 2-3 and lower context and just run lotsa subagents. You can reach crazy numbers if you batch and do small prompts, people here got 2000tps on a 4090 iirc

But like I said I'd worry about optimising that later. If you get vllm + MTP running you've got most of the experience anyway

[-]

mxmumtuna@reddit

Good advice. Once you have the basics down it’s just tweaking and realizing that vLLM and sglang don’t give any shits about the consumer market, which is most evident in Blackwell. Lots of monkey patches.

[-]

Dany0@reddit

It's not that they don't give a shit. They made an engineering decision and it makes sense that GPUs for gamers vs ML researchers will be different. The fact that they have a bias that cuts one way because money makes sense but unless you have insider info you cannot know which one won out. And if I was an engineer @ Nvidia, it wouldn't be an easy decision for me given all the constraints. And the difference between the consumer blackwell and "full" blackwell gpus right now sometimes makes sense silicon design-wise. Consumer gpus need a rural road in comparison to the highways the big gpus need. Even the DGX spark is "stifled" compared to the datacenter GPUs, so it's not like Nvidia is specifically cutting features out of the consumer GPUs for the hell of it

Gamer demographic includes Tommy the warehouse and odd job worker that saved up for a gpu for the last 5 years because it was his dream to once-in-a-life own the biggest baddest GPU among his CS2 playing friends. Of course the price will be pushed down

ML researcher demographic includes people with 12th government grant this year for a rack of H200s because they want to find oil or enemy soliders or protesters. Of course the price will push up

[-]

mxmumtuna@reddit

I think we agree. You just said it more elegantly than I did. I’m not saying them not giving a shit isn’t justified or the wrong engineering decision, it just makes it take more effort to maximize Blackwell architecture cards because they pretty deliberately slow roll (or don’t roll at all) sm120 improvements.

[-]

poopertay@reddit

Ok cool so all I need is 10k usd to run something not as good as codex or Claude… got it.

[-]

Demonicated@reddit (OP)

The amount of tokens I use would add up to 10k in under a year in the new token pricing world. You're negative sentiment is misplaced.

I was already getting closer to maxing this thing out before idiscovered I could also now use it for copiloting

[-]

poopertay@reddit

What I mean is GPUs are overpriced

[-]

tmvr@reddit

Maybe, but you also don't need a 6000 Pro to run the 27B model. A 5090 is 3000-3500 and you can also use 2x 5060Ti 16GB which is under 1000 total. The you have the 24GB cards from the previous gens like the 4090 or the 3090.

[-]

Demonicated@reddit (OP)

If you want to run quants. I typically don't, I did for this experiment to find out where the trade offs occur, but q8 did impress in the github harness

[-]

Demonicated@reddit (OP)

Yes and no. They are way overpriced for the amount of VRAM. But the amount of business value i can get out of a 10k card is much greater. They pay for themselves. The value is there.

[-]

poopertay@reddit

Can’t wait for china to rain the pain on all this gpu price gouging

[-]

Demonicated@reddit (OP)

Amen

[-]

Pleasant-Shallot-707@reddit

Not if they’re selling above the expectations

[-]

guitarjob@reddit

You’re running a model that costs pennies on open router

[-]

Pleasant-Shallot-707@reddit

Skill issues

[-]

SharpRule4025@reddit

If you are building a data mining and scraping app, local models like Qwen work very well for the extraction phase. Sending raw HTML to hosted models gets expensive fast. You can run the initial scrape, strip the DOM down to just the text nodes, and pass that to your local 27B model to pull out structured JSON.

Keeping the context window clean is the main challenge. If you use a headless browser to get the page source, drop all the scripts, styles, and SVG tags before feeding it to Qwen. You get much more reliable JSON outputs and it cuts token generation time.

For sites that obfuscate their CSS class names, having the local model analyze the surrounding text rather than relying on precise DOM selectors makes your scrapers less brittle. Just make sure your system prompt enforces strict JSON formatting.

[-]

Demonicated@reddit (OP)

Yeah we spent a lot of time fine tuning our preprocessing and prompts because when we first designed everything we were using openai OSS 120. So once gemma 4 and new qwens dropped it was like a gift from the gods 😆 our whole system became airtight immediately.

[-]

Fit-Statistician8636@reddit

So, it is still possible to connect Copilot to own endpoint(s) with the insider build? I thought they removed the feature. Did you need to hack it somehow?

[-]

Demonicated@reddit (OP)

Nope. It's a normal user option on that build

[-]

getstackfax@reddit

Exactly. The planning layer changes everything.

A weaker model with a clear checklist, file boundaries, and step-by-step execution can outperform a stronger model that is just given a vague “build the whole thing” prompt.

I think that’s where a lot of local setups get underrated: not as magic replacement brains, but as cheap reliable workers once the job is broken down properly.

[-]

uti24@reddit

Qwen-3.6-27B is really impressive and I can't recommend it more for free local use with sane hardware, but:

I am trying to use Qwen-3.6-27B as my hobby driver and it's not that good. One shots and things like conversation are really good, but agentic work, not that good. Model gets lost when my hobby stuff or whatever getting bigger than 5 files. Using OpenCode.

To be clear, I dont think this it can work at the feature level like Opus 4.6 could.

Yeah, lets be even more clear, it's not even Haiku 4.5 level, far from it. Also it's getting into loops sometimes.

But again, I am using like Q6 and AMD thingie, so maybe Q8 much better.

I AM SO IMPRESSED! The token generation can be a tad bit slow, but the truth is, I was seeing long delays even when I was using Github Copilot hosted models.

Never seen hiccups from github models, unless it's really complicated feature with many steps, I am getting result right away. And with Qwen-3.6-27B I enter my request and wait, I can left it and go doing my other stuff and it will finish like 25 minutes later.

I often see it's wasting tokens, thinking not about how to implement something, but just spitting exact code it going to implement, and running ideas around again and again.

[-]

boutell@reddit

I can read this two ways: "yes, local AI will cut it for challenging work," or "no, local AI is not a realistic option for less than nine grand."

But for my personal use cases I'm finding 27b is tantalizingly good at 4 bit, even if 4 bit didn't cut it for your tasks.

So I'm tempted just to build a box around a card with just barely enough VRAM and excellent memory bandwidth. Which is definitely the limiting factor here. Everyone's numbers show it, including yours.

[-]

mr_Owner@reddit

Vibe Engineering is the way with slm's when you what your doing 😁

[-]

autonomousdev_@reddit

tried a similar setup for three months and yeah the context window thing is brutal on react projects. python autocomplete was actually decent though. ended up switching back to copilot plus a rented gpu for batch stuff. saved like 200 bucks a month on electricity cause running that card daily was insane on my bill

[-]

fasti-au@reddit

27b is dense so more of the token hopping is layer matrix and 35b is faster. Think 100 to 170 token speed. Treat as flash 1shot on specs and small focused and use 27b as a light reasoner as the oversight relationship manager. They are trained on UL lists so one task per line. Numbers are work orders and panic is straight to bash ps send and ls so one shot is better if tooled.

It’s a delta in qwen and llama.cpp turbo quant is in play also. Tip. Do not expect recall to hold up if using human. Is synthetic training so all spec kit style recall

[-]

misha1350@reddit

Now you only need to generate like a couple billion tokens or something just for it to pay off... I hope you have an actual usecase for using a local LLM such as protecting your private code.

Otherwise you would've been much much better off buying an Intel ARC Pro B60 or B70 to run the same Qwen3.6 27B at Q4_K_XL or Q5_K_XL with a decently sized context window instead.

[-]

running101@reddit

Every provider is raising prices

[-]

misha1350@reddit

Except DeepSeek. Which is the only API provider that really matters.

[-]

walmis@reddit

People tend to forget that you can always resell these parts down the line. When you factor in the residual resale value of the GPUs, the break-even point for API tokens drops dramatically.

Running a local setup is essentially like driving an EV and charging it with your own solar panels. You pay a heavier upfront premium to build the infrastructure, but you escape the continuous cycle of buying "gas" (or constantly paying for API credits). At the end of the day, you own a tangible asset you can liquidate whenever you want to upgrade or exit.

[-]

Demonicated@reddit (OP)

So right now we use about 200 Million tokens a month - but this is only running 8 states of data. When that goes up to 50 I think I'll be more than maxing out this card. Right now prices are $5 - $30 per 1M - so lets assume the low end and I'm still using $1k in tokens a month.

But the icing on the cake is I have a solar and battery system on my house, so i dont even pay the cost of electricity (or very little) on this thing. It will be less than a year to recoup my costs and you'd be a fool to think prices wont be perpetually going up.

[-]

mxmumtuna@reddit

Pretty soon you’ll have 4 or 8. Start making space now.

[-]

Demonicated@reddit (OP)

I don't see the issue with this lol

[-]

pack170@reddit

Power. If you're running off of 120v 2 of those GPUs + the rest of the system power will put you pretty close to the 80% limit of a 15 amp circuit on its own. You might need an electrician to run a higher amp/volt circuit and/or more hvac to deal with the extra heat.

[-]

Demonicated@reddit (OP)

I wired a 30 amp line into my office. And I have about 1200 watts of solar power coming in with a battery.

With one gpu I'm essentially getting things "for free".

[-]

mxmumtuna@reddit

It’s just the natural progression

[-]

vick2djax@reddit

This post made me check my Claude code usage. 8.26B tokens last month on my Claude max subscription lol

[-]

Demonicated@reddit (OP)

B..... like billion?! Just you???

[-]

bronxct1@reddit

That’s silly. I know someone who spent almost 9k on Claude last month. There are absolute power users out there that will make up that I no time

[-]

vick2djax@reddit

Why not Claude max?

[-]

bronxct1@reddit

This is someone who was hitting the max limits. He had to go to api pricing

[-]

vick2djax@reddit

Dang, I thought I was hitting it hard. I have 8.26B tokens used on my Claude Max sub last month and was able to stay under the limits outside of maybe 1-2 days.

[-]

mxmumtuna@reddit

A B70? Come on. Be real.

[-]

lunerift@reddit

This matches my experience - “good enough” local models work if you already know what you’re doing.The gap is less about raw capability, more about how much steering and structure they need.Tooling + planning matters more than model choice at that point. How stable is it for longer multi-step tasks in your setup?

[-]

Demonicated@reddit (OP)

It seems to be alright but Im keeping work scope small intentionally. Back before plan mode, I used to have models generate markdown files with the plan so i can have it break things into phases. Im tempted to see about doing that again instead of planning mode to see if i have luck with bigger tasks.

[-]

przemekcoditive@reddit

Im curious about this too. Keep us posted!

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

WonderRico@reddit

highly recommand testing QuantTrio/Qwen3.5-122B-A10B-AWQ in vLLM for the speed. (hopefully the 3.6 version will be released...)

[-]

matjam@reddit

I've been spinning up inference in AWS and testing it with vllm and its ridiculously good, running the FP18 safetensors. So fast.

Got some issues with random disconnections though. Might be the proxy layer I wrote.

[-]

Pleasant-Shallot-707@reddit

Frankly, just telling Claude to implement C feature without a proper plan documented and governance on how you want the project designed architecturally results in really fragile and difficult to manage code anyway.

[-]

poopertay@reddit

Can’t wait for china to rain the pain on all this gpu price gouging

[-]

txoixoegosi@reddit

I really want a 6000 for my daily driver but I can’t justify the ROI just yet… 10k$ are many months of any decent AI service. I need some argument support haha

[-]

e979d9@reddit

Shared ownership with 2-4 same-minded folks would totally make sense

[-]

WetSound@reddit

What's your tps?

[-]

Demonicated@reddit (OP)

RTX 600 Pro, LM Studio, Windows, qwen3.6-27b@q8_k_xl (Unsloth)

38 tok/sec

[-]

redditrasberry@reddit

I think you touch on one of the reasons there is so much disagreement on how useful local models are. If you really need your hand held then that is where full scale hosted models are very different. But for experienced devs, we actively don't want our hand held. We want to boss this thing around. Once you are doing that anyway, the difference between full scale models and local ones is much more marginal.

[-]

wbulot@reddit

Totally agree with this. I'm also so impressed by Qwen 3.6 27B that I use it for 90% of what I do daily. I want to keep full control of everything, read every line of code it generates to keep it all in my head, and decide the next step myself. My slow 15 t/s isn't even an issue — it's almost exactly the speed I read at. I just switch to a bigger model when I need to investigate something very complex; otherwise, the local one is perfectly fine.

[-]

Demonicated@reddit (OP)

Exactly this. I definitely got pulled into the bong vibe coding mode and just got really proficient and reviewing PRs.

But really I never needed anything that powerful. 3.6 is feeling like good enough that I can truly steer the ship and get great results. I feel like the next open source midel will hit in a few months and I'll be exactly where I want from a local model. If I can get opus 4.6 locally I will never want for more.

[-]

User_Deprecated@reddit

The plan-first thing is real. I tried feeding it a feature request cold and it went in circles, but once I broke it down into "here's the interface change, here's the handler, here's the test" it knocked each piece out fine.

Thinking mode is worth toggling off for straightforward implementation though. It burns a bunch of tokens just restating what's already in the plan before it starts writing code, and the output isn't really better for it.

[-]

dontbeeadick@reddit

helpful, ive been experimenting with local qwen configs and experiencing many issues. also rtx 6000

[-]

Brilliant_Anxiety_36@reddit

Same here. That model is impressive, 35B A3B is also usable. I dont trusted much being a MoE but both don’t overthink, follow instructions correctly and like to test everything first, i hope there would be a full tensor version of this model cause the slow performance is being a hybrid ssm model

[-]

Bohdanowicz@reddit

Running official fp8 on a6000 adaand im doing 400-500 tks across 8-12 parallel workloads. Ive seen input reach 12000 depends on batching.

Vlm serving with recommended settings.

[-]

j4ys0nj@reddit

I know this has been said but vLLM is the way to go! You can get way more concurrency. Like 6-10 simultaneous requests all running at near the same speed as 1x.

[-]

Eyelbee@reddit

What's new on VSCode insiders edition? Is there a better local harness or something? Copilot already supports local models but it sucked pretty hard last time.

[-]

Demonicated@reddit (OP)

regular vscode only lets you point at ollama. insiders lets you point at any IP.

Everyone has their preference on harness, i find copilot pretty decent. Cursor is fine too. I havent used Continue in a minute.

[-]

StardockEngineer@reddit

You can use the litellm provider for any custom endpoint fyi. I point it at llamacpp

[-]

grabber4321@reddit

Get Zed Dev - its a much better harness IDE. Works out of the box.