Best Models for 16GB VRAM

Posted by LinuxIsFree@reddit | LocalLLaMA | View on Reddit | 36 comments

Swiped up an RX 9070 from newegg since it's below MSRP today. Primarily interested in gaming, hence the 9070 over the 5070 at a similar price. However, Id like to sip my toes further into AI, and since Im doubling my vram from igb to 16gb, Im curious

**What are the best productivity, coding, ans storywriting AI models I can run reasonably with 16GB VRAM?

Last similar post I found with google was about 10mo old, and I figured things may have changed since then?

[-]

ItilityMSP@reddit

Most tools are geared to NVidia right now cuda cores, AMD can work but will require more tweaking, troubleshooting, I would return it and get a 5060 ti 16gb, you can game and play with llms with that setup. Love to support AMD but LLM playground is rough right now.

[-]

LinuxIsFree@reddit (OP)

Innthe past, Ive had no issue except for it being slower. Ill sooner pass the ai than return since the 9070 also performs slightly better.

[-]

ItilityMSP@reddit

Specifically I'm talking about unsloth and the ability to do reinforcement learning at fp8 this only works on 50,60 series Nvidia Blackwell chips. Not sure why people down vote hard. Giving real advice here. Reinforcement learning will allow you to do incredible specific things/domains with smaller models. This wasn't feasible until last week and would have required renting cloud time.

[-]

ttkciar@reddit

Ignore the Nvidia fanboyism. AMD GPUs jfw with llama.cpp's Vulkan back-end, no need for ROCm.

For models, you have a few options:

Qwen3-Coder-REAP-25B-A3B won't fit entirely in your VRAM, so you will need to partially offload to CPU, but with only 3B active parameters it will still be quite zippy.
GPT-OSS-20B might fit in VRAM, quantized hard enough, but anything smaller than Q4_K_M tends to be somewhat brain-damaged. Fiddle with different quants.
Qwen2.5-Coder-14B is a bit old, but still quite good, and will fit in your VRAM no problem.

For all non-coding tasks, I strongly recommend Tiger-Gemma-12B-v3.

[-]

Background_Praline18@reddit

I ran Nvidia and rocm qwen3 coder I think rocm is not as fast but rocm will work if you have an AMD card if not try your luck with Vulkan but so far Nvidia setups seem easier you can do nifty things like use the npu to run tools

[-]

luncheroo@reddit

Do you find Tiger to be better all around than the QAT versions?

[-]

ttkciar@reddit

I have not tried the QAT versions. The main advantages of Tiger over original Gemma3 are:

Anti-sycophancy: TheDrummer was aiming for a less agreeable model, and in that regard he succeeded splendidly. For best effect I use the following system prompt for tasks other than creative writing:

You are a clinical, erudite assistant. Your tone is flat and expressionless. You avoid unnecessary chatter, warnings, or disclaimers. Never use the term "delve".

(I know Google documentation states that Gemma3 does not support a system prompt, but in practice it responds very well to system prompts, and the only way I could make Medgemma useful was with a system prompt telling it it was advising a doctor.)

Mostly uncensored: TheDrummer's fine-tune took down most of its guardrails, though not all of them. It hasn't refused any of my prompts for a long time.
More creative: Sometimes I use Big-Tiger-Gemma-27B-v3 for inferring Murderbot Diaries fan-fic, or other sci-fi. Compared to Gemma3 it is slightly more creative and better able to develop characters. Some of this might be the anti-sycophancy; I have noticed that its fight scenes are a lot more brutal and explicit than the original Gemma3.

One thing I have not tested is Tiger's vision capabilities, and TheDrummer stated that he hasn't tested it for vision, either. So it might or might not be as good as the original Gemma3 at vision tasks.

[-]

luncheroo@reddit

Thanks, I'm going to give it a try. Gemma does great for my late case, but I'm still sorting out whether Q8 12b or Q4 27b is the best fit for me. I'll give the Drummer's version a go.

[-]

LinuxIsFree@reddit (OP)

Thank you for the detailed suggestions!

[-]

offdagrid774_@reddit

I have both a 5060 Ti 16GB and 9070 XT in different machines. The former was a bit easier to set up my development environment for, but both worked fine for inference. You’ll be fine!

[-]

Flaurentiu26@reddit

What are the best models for 5060 ti 16gb ? I own one and it's a big difference between gpt-oss:20b 100tokens/s and other models, for example mistral-small:24b ~30tokens/s

[-]

Hamilton-Io@reddit

Hey, I got a 7800xt and it gives about 130 tok/s with flash attention and 120 tok/a without. I recommend the qwen3 32b with some system ram. Its much better than OSS 20b

[-]

Flaurentiu26@reddit

Do you use lm-studio, ollama or just llama.cpp ?

[-]

Hamilton-Io@reddit

Lmstudio on Linux with vulkan as backend.

[-]

pmttyji@reddit

GPT-OSS-20B, Qwen3-30B MOEs(Q4), Ling/Ring mini models, Ernie(Q5), Granite4-Small(Q4), Gemma3-12B, Qwen3-14B, Mistral 24B models(Q5).

Mentioned Quants in parenthesis for some models(Still you could use higher quants with system RAM .... offloading). For Other models you could go with Q8.

[-]

AppearanceHeavy6724@reddit

I do not think Mistral 24b fits 16 GiB at Q5.

[-]

pmttyji@reddit

Yeah, Tight one, K_S will. Still context there.

[-]

AppearanceHeavy6724@reddit

Yeah, kinda fits but not useful.

[-]

pmttyji@reddit

Updated

[-]

Emergency_Rush_9941@reddit

AMD gpu doesn’t support AI models as good as Nvidia. It would be much better if you go with 5070 to be able to enjoy both worlds, gaming and AI.

[-]

Ololoshkaaaa@reddit

I have LM Studio, 2x 5060TI 16GB, Which model would you recommend?

[-]

Tai9ch@reddit

Qwen3 30b-A3b at Q4 with llama.cpp; it's really good, and there's a VL version to play with too.

[-]

Ololoshkaaaa@reddit

If there is a topic, I'll ask you, I have LM Studio, 2x 5060TI 16GB, Which model would you recommend?

[-]

Dreamthemers@reddit

GPT-OSS 20B/120B.

[-]

Potential-Emu-8530@reddit

How do models like this compare to high end cloud ones like gpt 5.1 or sonnet 4.5

[-]

Different-Set-1031@reddit

What’re your thoughts on this model vs Qwen3 VL or Ariel?

[-]

JLeonsarmiento@reddit

Yes, lots of options, some with vision also. Qwen3 8b fine tunes are super.

[-]

Dreamthemers@reddit

If vision capabilities are needed, then Qwen 3 VL is good alternative. GPT-OSS doesn’t have it.

[-]

AppearanceHeavy6724@reddit

Story writing- Gemma 3 12b and it's finetunes, Mistral Nemo and it's finetunes.

[-]

Long_comment_san@reddit

I have 12gb VRAM and can highly recommend Mistral models. You should be able to run a Q4 with 80%+ of the model loaded which is really not bad. And yeah, search for MOE.

[-]

usernameplshere@reddit

Phi 4 Reasoning (Plus) q6 , GPT OSS 20b MXFP4, Qwen 3 30b a3b VL q4 (if you got a decent CPU+RAM with offloading), You could also try Gemma 3 27B qat, but I am unsure if it fits in 16GB, but it's a great dense model with vision. I would dodge Gemma 3 12b, even in q8 it's super ass in my experience.

[-]

LinuxIsFree@reddit (OP)

Thanks! Found this by searching that way https://www.reddit.com/r/LocalLLaMA/s/sfsK7XN4nC

I forgot Reddit has search built in that actually works sometimes

[-]

iron_coffin@reddit

That being said, gpt-oss:20b and a qwen3 that fits