Whats the best local model i can run with 16 GB VRAM (RTX 5070 Ti)
Posted by callmedevilthebad@reddit | LocalLLaMA | View on Reddit | 67 comments
I want to use this for testing but with image support . Think more like playwright test cases. So should have some coding capabilities to fix if something goes off
_-_David@reddit
Qwen3.5-35-b-a3b.
callmedevilthebad@reddit (OP)
Seems like `Qwen3.5-35b-a3b` will not fit under my VRAM. Trying to find for quantized ones on ollama (i will isntall using openwebui). Can you share, If you have any links to the same. Thanks :)
c64z86@reddit
It should offload into your system RAM, it will run slower but it will still run! Try llama.cpp and check my comments for how I got it setup if you are stuck, i'm getting 57 tokens a second on there with my 12GB GPU.
callmedevilthebad@reddit (OP)
Curious if you tried 3.6 a3b . and if you can share config
c64z86@reddit
I did, and I get 36 tokens a second on that, dropping down to 30-29 when context gets full (like over 40k). I just loaded it up with the defaults and let llama.cpp handle it all. the only option being to use the -c option to make the context 128k, so "-c 128000".
callmedevilthebad@reddit (OP)
whats your take on 3.6 vs 3.5
c64z86@reddit
Much better! It doesn't think as much and gets straight to the point more often. It's still not as good at one shooting things as gemma is though.
callmedevilthebad@reddit (OP)
How do you recommend using it for agents vs for coding
c64z86@reddit
I didn't use it for agents, just pure llama.cpp chat and HTML coding.
Billah-07@reddit
How are you using the local model? Are you using it via chat or doing more with it? I have 5070ti 16GB. Chatting works but when I run claude code with llm model it doesn't work. I can see the logs inside LM Studio saying the context is too much. I am working in an exisiting project so the context will be high.
c64z86@reddit
I'm using it via chat only, I don't even use a harness or anything else with it because I haven't figured out how to set all that up yet.
Billah-07@reddit
Yeah that's what I am trying to figure out myself. CC requires a very huge context prompt which fails when faced with my puny 16gb GPU.
c64z86@reddit
Ah, I don't have anything else other than a good luck sorry as when I looked that up it looked so confusing to setup that I just noped back out of there lol.
Billah-07@reddit
I am discussing this on a different thread with a user who is using 9070xt and he says he can run agents smoothly. Maybe he can help us figure something out.
Using Pi Dev / llama.cpp instead of CC
callmedevilthebad@reddit (OP)
I was trying to avoid any extra setup
c64z86@reddit
If it helps, it's all portable and no setup required other than downloading a few things and putting them all in one folder.
callmedevilthebad@reddit (OP)
that sounds easy. Actually let me try
c64z86@reddit
Sure! You can grab the binaries here, make sure to also download the cuda 13 DLLs alongside it too Releases · ggml-org/llama.cpp
callmedevilthebad@reddit (OP)
Thanks man! It means a lot
c64z86@reddit
NP, have fun... and a word of caution, 27b still might be as slow as a glacier lol, even with llama.cpp, it still was 5 tokens a second on mine.. you might have much more fun with 35b!
callmedevilthebad@reddit (OP)
thats true. Took eternity to respond. I am deleting it for now. Will wait till quantized one is released
nikhilprasanth@reddit
How much system ram do you have?
callmedevilthebad@reddit (OP)
64 gigs
nikhilprasanth@reddit
https://pastebin.com/UPhHj1Y9
Try this too. It's a simple html page that lets you know how many layers to offload
callmedevilthebad@reddit (OP)
You can host this on github
_-_David@reddit
That's the beauty of it. I knew you didn't have enough VRAM for it. It's still the best for you.
callmedevilthebad@reddit (OP)
Are you saying qwen3.5:27b will work? i think it is 17gb in size
_-_David@reddit
Downvoted? You are welcome for the correct and simple answer to your question.
callmedevilthebad@reddit (OP)
Its not me who downvoted. I am getting downvotes on my own comments lol reddit is crazy
_-_David@reddit
Sorry for my short tone. I was having a stressful day. And yeah, reddit does seem weird. I've been around for a little while, but I just started posting comments very recently. The responses are WILD. Did you ever get the 35b working? I used to use qwen3-30b-a3b with CPU offloading when I was running a less expensive card, and that worked nicely for me because of the low active parameter count of that MoE model. qwen3.5 should offer the same but better. I know a lot of people on low vram are enjoying its capabilities quite a bit.
_-_David@reddit
No. 35b-a3b. I could explain why, but it really isn't necessary. Trust. Download. Use it. It is the best model for you. 100% confident.
callmedevilthebad@reddit (OP)
I ran it , very slow. Again i think i will wait for quant. one
defensivedig0@reddit
Assuming you're using llama cpp, you need to offload experts to system ram. Offloading entire layers will be very slow. On my 5060ti + 32gb ram it runs fine, so I can only imagine it should run faster for you
callmedevilthebad@reddit (OP)
NO i tried via openwebui for now. I will try to setup llama cpp on weekend
InvertedVantage@reddit
27B will work and 35b-a3b will work, just offload all the experts to the CPU.
NonStopArseGas@reddit
any advice on what of the new quen models is best suited to my 8gb 3060ti, 32gb ddr4?
_-_David@reddit
The new qwen3.5 "small" models should be out imminently. I've heard Tuesday is the top guess. I'd anticipate that a qwen3.5 9b mod would work nicely. But until they come out, we can't be certain what the "small" sizes will be exactly. But I'd bet there's a solid one you'll be able to use :)
nikhilprasanth@reddit
Running Qwen 3.5 35B A3B Q8. 5060 TI 16GB and DDR5 with 65K context
set CUDA_VISIBLE_DEVICES=0 && "C:\Users\user\Desktop\llama\llama-server.exe" ^
-m "D:\Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf" ^
-a Qwen3.5-35B-A3B ^
--ctx_size 65536 ^
-ot ".ffn_.*_exps.=CPU" ^
--jinja ^
-fa on ^
-ngl 999 ^
-t 12 ^
-b 2048 ^
-ub 256 ^
--no-mmap ^
-ctk q8_0 ^
-ctv q8_0 ^
--temp 0.6 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.00 ^
--chat-template-kwargs "{\"enable_thinking\": false}"
ForsookComparison@reddit
Run with Llama CPP and use the "--n-cpu-moe" option. Try and set it so that your GPU is close to full (maybe 14GB used?) and the rest is on CPU/system memory.
sine120@reddit
Use the latest LM Studio or compile llama.cpp
sine120@reddit
3.5-35B IQ3_XXS for speed, or the Q3 of the 27B for intelligence. Both are performing great for me.
Billah-07@reddit
I am using LM studio and when I use the chat the tok/s speed is decent but I need to use a local model with Caude code, I cannot make that work. Is there any other way to use local LLMs to help me with coding? I don't want to use chat and go back and forth from code to chat.
I have a 5070Ti
sine120@reddit
CC has a huge system prompt. Try something smaller like Pi, and switch to llama.cpp and use llama-server. I have a 9070 XT with 16GB VRAM and using llama-server and Pi I get good results. Agents in Pi feel smarter than CC.
https://pi.dev/
Billah-07@reddit
Can you share some materials regarding this? I have only used C for the past few months and bought GPU last week to try local models so that I can get rid of paying monthly. I feel like I wasted my money and the programming with the help of local LLM is not real 😂.
I don't have crazy big tasks, the project I am working on is in its completion stage. Its just minor tasks left for which I don't think I need to pau for Anthropic's Opus
sine120@reddit
It's very possible to do light coding with local models now with Qwen3.6. I can't teach you how to use agents over a reddit comment, you will need to do some research and playing with it yourself. Download and build Llama.cpp, create or borrow a script to run it with params tuned to your system. Connect your local LLM to Pi, tell it to do something and get a feel for how it works. LM Studio is not suitable for hosting it, it's more for testing if the model works. It uses Llama.cpp in the backend, so just use the backend yourself.
I don't know your system, but if you want it 100% in VRAM, try a small (Q3, 12-14GB) quant of Qwen3.6-27B. If you are okay with using an MoE split running out of normal RAM, try Qwen3.6-35B-A3B, probably Q4 to Q6_K_XL. A larger AI model like Claude or Gemini can walk you through it, but make sure to prompt it to use latest information and search recent developments, as its training data is very out of date.
TurnUpThe4D3D3D3@reddit
Q3 seems precarious. I would be worried about inaccuracies at that quant. How’s it working for you?
sine120@reddit
Very well, so far. Speed on a 9070 XT is 30/100 for the dense and MoE, and it's very coherent. My basic coding tasks were no issue. I wish I could fit the Q4_K_XL quants but for the speeds, the IQ3 is very usable.
TurnUpThe4D3D3D3@reddit
Won’t fit in VRAM
Eternal_Ohm@reddit
Doesn't matter, it's a MoE model, so it doesn't have to fit entirely within VRAM, only the experts need to.
TurnUpThe4D3D3D3@reddit
Still has to load the proper expert transformer block into VRAM on every token. Your main bottleneck will be memory bandwidth, not compute.
callmedevilthebad@reddit (OP)
it has vision support? Let me check
TurnUpThe4D3D3D3@reddit
Try Qwen 3.5 9B when it comes out
gpt-oss-20b could be good as well
WarBroWar@reddit
Vs mistral 3 14b any Idea?
Guilty_Rooster_6708@reddit
I have the same GPU and 32gb system ram. I use Qwen 3.5 35B A3B Q4_K_M. It’s better than gpt oss 20b from what I’ve seen so far
callmedevilthebad@reddit (OP)
using ollama? and openwebui
Guilty_Rooster_6708@reddit
Qwen 3.5 9B and 4B also just came out as well. They are probably going to be very good for their small sides. Qwen 3 4B thinking 2507 was really good for tool calls too and runs really fast on 5070Ti
WarBroWar@reddit
Have you any idea of comparison to mistral 314b vs qwen 3.5 9b.
callmedevilthebad@reddit (OP)
Just installed 9B. Getting `500: Ollama: 500, message='Internal Server Error', url='http://localhost:11434/api/chat'` installed directly via openwebui using ollama run hf.co/unsloth/Qwen3.5-9B-GGUF:Q8_0
Guilty_Rooster_6708@reddit
LM Studio back end and Open WebUI front end
Chess_pensioner@reddit
Look here: https://whatmodelscanirun.com/
sine120@reddit
Qwen3.5 35B or the 27B fit in your VRAM with the smaller Q3 quants, and both are performing really well for me. 35-A3B Q4 is good with offloading. You can get a lot of context with your system. Qwen3-Coder-Next also performs really well on 16GB VRAM/ 64GB RAM systems like mine.
TurnUpThe4D3D3D3@reddit
I would be really curious to see how 27B Q3 compares to 9B Q8
sine120@reddit
I'll be testing the 9B tonight, but the 27B has been very impressive. I just wish I could fit more context.
Tema_Art_7777@reddit
Yes tried those combos and they work well with unified mem, 5060t 16g +128g at 64k context but I run the Q4 variants.
Soft-Barracuda8655@reddit
Should be a pretty potent 9b coming from qwen in a day or two. You'd be able to run that with a nice big contex window
one-wandering-mind@reddit
gpt-oss-20b
Impossible-Glass-487@reddit
Try one of the new qwen models