Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

Posted by grumd@reddit | LocalLLaMA | View on Reddit | 42 comments

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable.

Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L

Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL

Why these models:

Qwen2.5 is still the best model for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B and both produce weird suggestions.

This autocomplete model takes ~8GB VRAM using the command below. The speed of suggestions is basically instant.

Qwen3.6 35B-A3B is actually good at agentic coding at Q8 if you give it a good prompt. At Q4 it's not usable tbh and gets lost a lot, but at Q8 it can figure stuff out and actually finish its work correctly. If you don't have a lot of RAM for MoE experts, try Q6_K, but lower quants have noticable quality issues. You probably need 64GB total RAM minimum.

Because it has 3B active params, it's still fast and fits into the remaining 8GB VRAM.

Commands:

llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \
  -ngl 99 --no-mmap --ctx-size 32000 -ctk q8_0 -ctv q8_0 \
  -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081

Note: I actually have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I'll edit the post.

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \
  --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \
  -b 2048 -ub 2048 --jinja \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01

llama.cpp autofits the model and I get ~145k context with this command. You can use -ctv q8_0 -ctk q8_0 if you want more context.

35B-A3B speed with this setup:

pp4096 | 2093.93 ± 22.64
tg128 | 35.29 ± 0.48

[-]

grumd@reddit (OP)

I made this post mostly for GPU-poors with 16GB VRAM, if you have 24-32GB then please use Qwen 3.6 27B.

I've tried connecting my wife's gaming PC with a 16GB 6900 XT via llama.cpp RPC server and ran Qwen 3.6 27B Q6_K with a good context length. It's much better than 35B-A3B Q8. However, 27B Q4_K_M didn't feel as good, felt worse or on par with 35B-A3B Q8. YMMV.

[-]

luncheroo@reddit

Have you given any thought to the nvfp4 models on that card? I wonder if the quality would be as good as the speed improvement on something like vllm.

[-]

grumd@reddit (OP)

NVFP4 is mostly for SM100 arch, and you need the full model to stay on the GPU anyway for NVFP4 to be accelerated by native FP4 math. It's not something I'm looking at for a 5080

[-]

luncheroo@reddit

I respect your decision; I thought that the Qwen3.6-27b model in NVFP4 was 14GB.

[-]

grumd@reddit (OP)

14GB is around Q3 for that model, idk where you saw NVFP4 for 14GB

[-]

ea_man@reddit

Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf
https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF

https://www.reddit.com/r/LocalLLaMA/comments/1tau4bk/comment/olf48kb/?context=3&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

grumd@reddit (OP)

IQ4_XS is not NVFP4

[-]

ea_man@reddit

ofc not but as you can see you can have IQ4 at 14GB, so maybe...

[-]

grumd@reddit (OP)

I've tried 27B at IQ4. It's dumber than 35B at Q8.

[-]

ea_man@reddit

I hope you didn't use the same values for Q8 and Q4, like temp, max_p ...

[-]

grumd@reddit (OP)

Yes I used the same numbers, as Qwen recommends.

[-]

ea_man@reddit

Yeah that would not work, you'll get some gibberish.

If you care I can paste you my tested values, pretty much reduce temp to 0.3 - 0.6 and reduce repetitions.

[-]

ea_man@reddit

Sorry I don't use NVFP4, dunno.

Yet I'd rather use an IQ4 without that than an IQ3, maybe you can find a stripped down NVFP4

[-]

luncheroo@reddit

I've definitely tried that one and it is good. I was trying to see if I could squeeze an NVFP4 type model into vllm on a 5060ti for the speed. From what I learned by trying, it *may* be possible with a text only model on a headless machine where one is using Linux and there's not a display driver taking up extra VRAM, but I didn't want to go that far with my experiment. I'll wait and see what other members of the community come up with. This is all kind of bleeding edge and this arena is not my discipline.

[-]

ea_man@reddit

well I guess you or anybody interested in NVFP4 could come with a IQ4_XS that don't use the extra space that the other version use on some layers due to the old llama.cp bug.

Or maybe there's an old qwen3.5 around in NVFP4 that is still small, I dunno, I don't run NVIDIA, maybe ask on this sub for that.

[-]

luncheroo@reddit

I was looking at the hardware test info here, but I admit I haven't tried it myself. I may try it with a small context to see how it does. A smaller context isn't useful for coding like your setup, but it might help me in other ways.

https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

[-]

Clean_Initial_9618@reddit

I have a rtx 3090 in my gaming pc and a spare laptop with rtx 3070 can i link them together and run better quants like would speed be in issue??

[-]

grumd@reddit (OP)

You should definitely try setting up llama.cpp RPC and see how it goes

[-]

DiscipleofDeceit666@reddit

The 27B is so much slower tho 😴 I think I’m convinced to give the moe another shot. I been using Q4 but I probably have room for the q8.

So q8 small or q6 XL?

[-]

grumd@reddit (OP)

UD-Q8_K_XL with --cpu-moe takes only 8GB VRAM, I think you should try the full XL Q8

[-]

ikkiho@reddit

tbh i just use llama.vscode for the autocomplete part, it hits llama.cpp's FIM endpoint directly so no proxy nonsense. for agentic at this size aider is the obvious choice, opencode is the newer thing people are trying but still rough. one gotcha if you go down this road, running infill and agentic off the same llama-server will fight for kv cache, just bind two ports.

[-]

grumd@reddit (OP)

that's exactly what my setup uses. llama.vscode extension pointing at a llama-server that runs on a separate port

[-]

PhlarnogularMaqulezi@reddit

This is my exact laptop configuration for the VRAM and RAM, it's always nice seeing others with it too.

I'll have to give this a go. I'm still a pleb using LM studio and I've been trying to find time to get setup with llama.cpp proper (hopefully it can use my existing models downloaded via LM)

Qwen3.5-9b has been one of the best models I've used so far, but agentic coding wise it's still on the rough side for sure.

[-]

arbv@reddit

Don't miss the llama-swap with its matrix DSL for model loading to easily switch between models. When using llama.cpp experiment with batch size (-b), micro batch size (-ub), --fit-target, --fit-ctx, mmproj offloading, etc. Batch and micro batch size is really important - I have seen people claiming that Vulkan is taster than ROCm on AMD - while in my case ROCm is always superior with peoper batch sizes. You need to experiment a lot to figure out the right parameters for each model.

[-]

Danmoreng@reddit

Just build it from source and run it in router mode pointing at your model directory. I have install scripts for windows which install all the prerequisites to build from source for cuda systems: https://github.com/Danmoreng/llama.cpp-installer

[-]

Hypersonic_Popcorn@reddit

I did just that. I use llama-swap to allow me to serve more than one model but really, you just need the path to the model and llama-server/llama-swap can load the GGUF from LM Studio just fine. The reverse is true as well.

[-]

SuperWallabies@reddit

Your work is beautiful.

But are you really satistifed with the result you got ? is that made your coding life better ?
🤔 I still think, may better to use service like Cursor.

[-]

TurpentineEnjoyer@reddit

What did you use for the agentic coding on the client side? VS Code + OpenCode plugin, for example?

[-]

grumd@reddit (OP)

VSCode with OpenCode is solid.

I'm using Zed as my IDE. And lately using pi.dev more than OpenCode, without any plugins. Pi is honestly great as-is for straightforward tasks.

[-]

TurpentineEnjoyer@reddit

Thanks! I'll take a look at pi.dev later

[-]

Shot-Ad8790@reddit

For Qwen2.5 hyperparameters, try lowering temperature to 0.3 and top-p to 0.9 for stricter infill. You might also experiment with top-k at 40. Often, small tweaks make autocomplete less erratic.

[-]

alphatrad@reddit

I'd been telling people for some time around these parts the Qwen Coder was a great autocomplete.

solid configuration there.

[-]

Organic_Scarcity_495@reddit

nice setup. the Q8 vs Q4 difference on the A3B is real — i found the same thing where below Q6 the MoE routing starts making noticeably worse expert choices. one tip: if you're running the autocomplete model on a separate port, you can also run a tiny embedding model (like 0.5B) on CPU for retrieval-augmented infill. not needed for basic autocomplete but helps a lot when the agent needs to pull in specific function signatures from your codebase on the fly.

[-]

grumd@reddit (OP)

I might spin up a small embedding model for lat.md usage tbh

[-]

Similar-Ad5933@reddit

Qwen3.6-35B-A3B-UD-Q4_K_M is good for coding. But context should stay at Q8 or it will get lost.

[-]

grumd@reddit (OP)

If you can fit Q6 or Q8 I recommend trying that instead of Q4, it will be much more stable and smart

[-]

rob417@reddit

Which front end are you using for auto complete? I tried changing the auto complete model in VS Code a few months ago, but I couldn't find a way because Microsoft locked down that option. It could just be I didn't figure out the correct way to change it.

[-]

grumd@reddit (OP)

Use llama-vscode extension

[-]

Organic_Scarcity_495@reddit

the 7b for autocomplete + 35b for agentic is a smart split. running both on the same 16GB card is impressive — ram offloading is underrated for getting usable setups without buying a second card

[-]

bitslizer@reddit

What front end/agent manager are you using? Hermes? Claude code? This is pretty much my next project

[-]

grumd@reddit (OP)

pi.dev or OpenCode are my go-to

[-]

Objective-Can108@reddit

Good work man!