Whats the best local model i can run with 16 GB VRAM (RTX 5070 Ti)

[-]

callmedevilthebad@reddit (OP)

Seems like `Qwen3.5-35b-a3b` will not fit under my VRAM. Trying to find for quantized ones on ollama (i will isntall using openwebui). Can you share, If you have any links to the same. Thanks :)

[-]

It should offload into your system RAM, it will run slower but it will still run! Try llama.cpp and check my comments for how I got it setup if you are stuck, i'm getting 57 tokens a second on there with my 12GB GPU.

[-]

callmedevilthebad@reddit (OP)

Curious if you tried 3.6 a3b . and if you can share config

[-]

c64z86@reddit

I did, and I get 36 tokens a second on that, dropping down to 30-29 when context gets full (like over 40k). I just loaded it up with the defaults and let llama.cpp handle it all. the only option being to use the -c option to make the context 128k, so "-c 128000".

[-]

callmedevilthebad@reddit (OP)

whats your take on 3.6 vs 3.5

[-]

c64z86@reddit

Much better! It doesn't think as much and gets straight to the point more often. It's still not as good at one shooting things as gemma is though.

[-]

callmedevilthebad@reddit (OP)

How do you recommend using it for agents vs for coding

[-]

c64z86@reddit

I didn't use it for agents, just pure llama.cpp chat and HTML coding.

[-]

Billah-07@reddit

How are you using the local model? Are you using it via chat or doing more with it? I have 5070ti 16GB. Chatting works but when I run claude code with llm model it doesn't work. I can see the logs inside LM Studio saying the context is too much. I am working in an exisiting project so the context will be high.

[-]

c64z86@reddit

I'm using it via chat only, I don't even use a harness or anything else with it because I haven't figured out how to set all that up yet.

[-]

Billah-07@reddit

Yeah that's what I am trying to figure out myself. CC requires a very huge context prompt which fails when faced with my puny 16gb GPU.

[-]

c64z86@reddit

Ah, I don't have anything else other than a good luck sorry as when I looked that up it looked so confusing to setup that I just noped back out of there lol.

[-]

Billah-07@reddit

I am discussing this on a different thread with a user who is using 9070xt and he says he can run agents smoothly. Maybe he can help us figure something out.

Using Pi Dev / llama.cpp instead of CC

[-]

callmedevilthebad@reddit (OP)

I was trying to avoid any extra setup

[-]

c64z86@reddit

If it helps, it's all portable and no setup required other than downloading a few things and putting them all in one folder.

[-]

callmedevilthebad@reddit (OP)

that sounds easy. Actually let me try

[-]

c64z86@reddit

Sure! You can grab the binaries here, make sure to also download the cuda 13 DLLs alongside it too Releases · ggml-org/llama.cpp

[-]

callmedevilthebad@reddit (OP)

Thanks man! It means a lot

[-]

c64z86@reddit

NP, have fun... and a word of caution, 27b still might be as slow as a glacier lol, even with llama.cpp, it still was 5 tokens a second on mine.. you might have much more fun with 35b!

[-]

callmedevilthebad@reddit (OP)

thats true. Took eternity to respond. I am deleting it for now. Will wait till quantized one is released

[-]

nikhilprasanth@reddit

How much system ram do you have?

[-]

callmedevilthebad@reddit (OP)

64 gigs

[-]

nikhilprasanth@reddit

https://pastebin.com/UPhHj1Y9

Try this too. It's a simple html page that lets you know how many layers to offload

[-]

callmedevilthebad@reddit (OP)

You can host this on github

[-]

_-_David@reddit

That's the beauty of it. I knew you didn't have enough VRAM for it. It's still the best for you.

[-]

callmedevilthebad@reddit (OP)

Are you saying qwen3.5:27b will work? i think it is 17gb in size

[-]

_-_David@reddit

Downvoted? You are welcome for the correct and simple answer to your question.

[-]

callmedevilthebad@reddit (OP)

Its not me who downvoted. I am getting downvotes on my own comments lol reddit is crazy

[-]

_-_David@reddit

Sorry for my short tone. I was having a stressful day. And yeah, reddit does seem weird. I've been around for a little while, but I just started posting comments very recently. The responses are WILD. Did you ever get the 35b working? I used to use qwen3-30b-a3b with CPU offloading when I was running a less expensive card, and that worked nicely for me because of the low active parameter count of that MoE model. qwen3.5 should offer the same but better. I know a lot of people on low vram are enjoying its capabilities quite a bit.

[-]

_-_David@reddit

No. 35b-a3b. I could explain why, but it really isn't necessary. Trust. Download. Use it. It is the best model for you. 100% confident.

[-]

callmedevilthebad@reddit (OP)

I ran it , very slow. Again i think i will wait for quant. one

[-]

defensivedig0@reddit

Assuming you're using llama cpp, you need to offload experts to system ram. Offloading entire layers will be very slow. On my 5060ti + 32gb ram it runs fine, so I can only imagine it should run faster for you

[-]

callmedevilthebad@reddit (OP)

NO i tried via openwebui for now. I will try to setup llama cpp on weekend

[-]

InvertedVantage@reddit

27B will work and 35b-a3b will work, just offload all the experts to the CPU.

[-]

NonStopArseGas@reddit

any advice on what of the new quen models is best suited to my 8gb 3060ti, 32gb ddr4?

[-]

_-_David@reddit

The new qwen3.5 "small" models should be out imminently. I've heard Tuesday is the top guess. I'd anticipate that a qwen3.5 9b mod would work nicely. But until they come out, we can't be certain what the "small" sizes will be exactly. But I'd bet there's a solid one you'll be able to use :)

[-]

nikhilprasanth@reddit

Running Qwen 3.5 35B A3B Q8. 5060 TI 16GB and DDR5 with 65K context

set CUDA_VISIBLE_DEVICES=0 && "C:\Users\user\Desktop\llama\llama-server.exe" ^

-m "D:\Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf" ^

-a Qwen3.5-35B-A3B ^

--ctx_size 65536 ^

-ot ".ffn_.*_exps.=CPU" ^

--jinja ^

-fa on ^

-ngl 999 ^

-t 12 ^

-b 2048 ^

-ub 256 ^

--no-mmap ^

-ctk q8_0 ^

-ctv q8_0 ^

--temp 0.6 ^

--top-p 0.95 ^

--top-k 20 ^

--min-p 0.00 ^

--chat-template-kwargs "{\"enable_thinking\": false}"

[-]

ForsookComparison@reddit

Run with Llama CPP and use the "--n-cpu-moe" option. Try and set it so that your GPU is close to full (maybe 14GB used?) and the rest is on CPU/system memory.

[-]

sine120@reddit

Use the latest LM Studio or compile llama.cpp

[-]

sine120@reddit

3.5-35B IQ3_XXS for speed, or the Q3 of the 27B for intelligence. Both are performing great for me.

[-]

Billah-07@reddit

I am using LM studio and when I use the chat the tok/s speed is decent but I need to use a local model with Caude code, I cannot make that work. Is there any other way to use local LLMs to help me with coding? I don't want to use chat and go back and forth from code to chat.
I have a 5070Ti

[-]

sine120@reddit

CC has a huge system prompt. Try something smaller like Pi, and switch to llama.cpp and use llama-server. I have a 9070 XT with 16GB VRAM and using llama-server and Pi I get good results. Agents in Pi feel smarter than CC.

https://pi.dev/

[-]

Billah-07@reddit

Can you share some materials regarding this? I have only used C for the past few months and bought GPU last week to try local models so that I can get rid of paying monthly. I feel like I wasted my money and the programming with the help of local LLM is not real 😂.

I don't have crazy big tasks, the project I am working on is in its completion stage. Its just minor tasks left for which I don't think I need to pau for Anthropic's Opus

[-]

sine120@reddit

It's very possible to do light coding with local models now with Qwen3.6. I can't teach you how to use agents over a reddit comment, you will need to do some research and playing with it yourself. Download and build Llama.cpp, create or borrow a script to run it with params tuned to your system. Connect your local LLM to Pi, tell it to do something and get a feel for how it works. LM Studio is not suitable for hosting it, it's more for testing if the model works. It uses Llama.cpp in the backend, so just use the backend yourself.

I don't know your system, but if you want it 100% in VRAM, try a small (Q3, 12-14GB) quant of Qwen3.6-27B. If you are okay with using an MoE split running out of normal RAM, try Qwen3.6-35B-A3B, probably Q4 to Q6_K_XL. A larger AI model like Claude or Gemini can walk you through it, but make sure to prompt it to use latest information and search recent developments, as its training data is very out of date.

[-]

TurnUpThe4D3D3D3@reddit

Q3 seems precarious. I would be worried about inaccuracies at that quant. How’s it working for you?

[-]

sine120@reddit

Very well, so far. Speed on a 9070 XT is 30/100 for the dense and MoE, and it's very coherent. My basic coding tasks were no issue. I wish I could fit the Q4_K_XL quants but for the speeds, the IQ3 is very usable.

[-]

TurnUpThe4D3D3D3@reddit

Won’t fit in VRAM

[-]

Eternal_Ohm@reddit

Doesn't matter, it's a MoE model, so it doesn't have to fit entirely within VRAM, only the experts need to.

[-]

TurnUpThe4D3D3D3@reddit

Still has to load the proper expert transformer block into VRAM on every token. Your main bottleneck will be memory bandwidth, not compute.

[-]

callmedevilthebad@reddit (OP)

it has vision support? Let me check

[-]

TurnUpThe4D3D3D3@reddit

Try Qwen 3.5 9B when it comes out

gpt-oss-20b could be good as well

[-]

WarBroWar@reddit

Vs mistral 3 14b any Idea?

[-]

Guilty_Rooster_6708@reddit

I have the same GPU and 32gb system ram. I use Qwen 3.5 35B A3B Q4_K_M. It’s better than gpt oss 20b from what I’ve seen so far

[-]

callmedevilthebad@reddit (OP)

using ollama? and openwebui

[-]

Guilty_Rooster_6708@reddit

Qwen 3.5 9B and 4B also just came out as well. They are probably going to be very good for their small sides. Qwen 3 4B thinking 2507 was really good for tool calls too and runs really fast on 5070Ti

[-]

WarBroWar@reddit

Have you any idea of comparison to mistral 314b vs qwen 3.5 9b.

[-]

callmedevilthebad@reddit (OP)

Just installed 9B. Getting `500: Ollama: 500, message='Internal Server Error', url='http://localhost:11434/api/chat'` installed directly via openwebui using ollama run hf.co/unsloth/Qwen3.5-9B-GGUF:Q8_0

[-]

Guilty_Rooster_6708@reddit

LM Studio back end and Open WebUI front end

[-]

Chess_pensioner@reddit

Look here: https://whatmodelscanirun.com/

[-]

sine120@reddit

Qwen3.5 35B or the 27B fit in your VRAM with the smaller Q3 quants, and both are performing really well for me. 35-A3B Q4 is good with offloading. You can get a lot of context with your system. Qwen3-Coder-Next also performs really well on 16GB VRAM/ 64GB RAM systems like mine.

[-]