Best models to run with 8GB VRAM, 16GB RAM

Posted by Qxz3@reddit | LocalLLaMA | View on Reddit | 37 comments

Been experimenting with local LLMs on my gaming laptop (4070 8GB, 16GB of RAM). My use cases have been coding and creative writing. Models that work well and that I like:

Gemma 3 12B - low quantization (IQ3_XS), 100% offloaded to GPU, spilling into RAM. \~10t/s. Great at following instructions and general knowledge. This is the sweet spot and my main model.

Gemma 3 4B - full quantization (Q8), 100% offloaded to GPU, minimal spill. \~30-40t/s. Still smart and competent but more limited knowledge. This is an amazing model at this performance level.

MN GRAND Gutenburg Lyra4 Lyra 23.5B, medium quant (Q4) (lower quants are just too wonky) about 50% offloaded to GPU, 2-3t/s. When quality of prose and writing a captivating story matters. Tends to break down so needs some supervision, but it's in another league entirely - Gemma 3 just cannot write like this whatsoever (although Gemma follows instructions more closely). Great companion for creative writing. 12B version of this is way faster (100% GPU, 15t/s) and still strong stylistically, although its stories aren't nearly as engaging so I tend to be patient and wait for the 23.5B.

I was disappointed with:

Llama 3.1 8B - runs fast, but responses are short, superficial and uninteresting compared with Gemma 3 4B.

Mistral Small 3.1 - Can barely run on my machine, and for the extreme slowness, wasn't impressed with the responses. Would rather run Gemma 3 27B instead.

I wish I could run:

QWQ 32B - doesn't do well at the lower quants that would allow it to run on my system, just too slow.
Gemma 3 27B - it runs but the jump in quality compared to 12B hasn't been worth going down to 2t/s.

[-]

pmttyji@reddit

Hey u/Qxz3 , it's been 3 months since your post. What tiny/small models are you using right now? Please recommend & share your finding(My laptop also with same 8GB VRAM). Thanks

[-]

Qxz3@reddit (OP)

Still mostly Gemma 3 12B QAT. Qwen3 14B makes for a nice change from time to time but tends to write a lot of AI slop like bad analogies and pointless verbiage. Haven't found something better than Gemma 3 yet.

[-]

pmttyji@reddit

Thanks. After seeing 30+t/s using tiny models(Below 10B models. Ex: Gemma3 4B), I couldn't go back to small models(Above 10B models. Ex: Qwen3 14B) which gives only \~5-10t/s.

Please recommend some tiny models(Below 10B) you're using right now. Have you tried MiniCPM4?

Currently I'm looking for models on some stuff like Content creation, Youtube channel, etc., thing.

Thanks again

[-]

Qxz3@reddit (OP)

Josiefied Qwen3 8B is nice, the regular model probably is too. I don't use very small models too much though so I'm probably not the best person to ask.

[-]

pmttyji@reddit

Thanks for this one

[-]

Vermicelli_Junior@reddit

try Gemma 27B Q2_K_L with 10.84 size , its a lot better than 12B models :)

[-]

ShineNo147@reddit

I would try Mistral-Small-24B really does feel GPT-4 quality despite only needing around 12GB of RAM to run—so it’s a good default model if you want to leave space to run other apps

[-]

Right-Law1817@reddit

Hi OP, I've downloaded the MN GRAND Gutenburg Lyra4 Lyra 23.5B Q4_k_m version, and according the instructions mentioned in the repo's readme. I applied this template using ollama:

TEMPLATE """{

"name": "Alpaca",

"inference_params": {

"input_prefix": "### Instruction:",

"input_suffix": "### Response:",

"antiprompt": [

"### Instruction:"

],

"pre_prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"

}

"""

But it always returns output like this:

, {

"name": "Alpaca-3b-base",

"inference_params": {

"input_prefix": "### Instruction: \n",

"input_suffix": "\n### Response:\n",

"antiprompt": [

"### Instruction: \n"

],

"pre_prompt": "### Prompt: \n

Can you tell me how you made it workout?

[-]

cibernox@reddit

I have 12Gb of vram, but still today my basic model remains qwen 2.5 14B. It fast enough and good enough. I tried R1 many times and still I don’t feel that the improvement is worth how much slower the reasoning makes it.

[-]

Uncle___Marty@reddit

If you want to try QWQ out then try running it on LM studio with flash attention on and KV cache at FP8 on a low quant. It'll reduce the memory needed quite a bit. Might even be semi usable.

[-]

Feztopia@reddit

In my short comparison Yuma42/Llama3.1-LoyalLizard-8B was better than gemma3 4b it.

But the 4b one does come close to the 7-8b models which is nice. I had good experience with gemma 2 9b it quality so I'm not surprised that it's successor 3 12b works well for you but even 9b was so slow for me that I didn't try the 12b.

[-]

SnooSketches1848@reddit

qwen2.5-coder:7b
deepseek-r1:1.5b

You can mix this two. Take reasoning from the deepseek and pass it to the qwen so can you better result.

And both can run on same machine simultaneously on 8GB ram.

[-]

Qxz3@reddit (OP)

That's interesting, how do you set that up?

[-]

edg3za@reddit

Take a look at LM Studio (I use it on Windows) and you can search the models list easy. It should be easy enough to use

[-]

SnooSketches1848@reddit

I mean to say how to do this process like taking content from the think and pass it to non reasoning model.

[-]

SnooSketches1848@reddit

I do this in the code actually. Not sure about any gui app does this.

So you do the first request to the `deepseek-r1:1.5b` mention a stop sequence like ``````

so it will stop the call once the thinking is done. And pass it to the main model that is `qwen2.5-coder:7b` as message.

In ollama I just simple copy paste don't know better way.

```
{"model":"deepseek-ai/DeepSeek-R1","messages":[{"role":"user","content":"Hey how are you?"}],"stream":true,"stream_options":{"include_usage":true,"continuous_usage_stats":true},"stop":[""]}
```

[-]

Lentaum@reddit

You can check it yourself in https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

The most intelligent model with 12B is https://huggingface.co/yamatazen/EtherealAurora-12B

[-]

terminoid_@reddit

i would use gemma3 12B jailbroken or finetuned any day of the week over that

[-]

Yes_but_I_think@reddit

Never run a model less than Q4_K if you want reliability. Prefer Q6_K for longer context. If you can’t fit it you can’t fit it. Nothing to feel bad about it. Move on.

[-]

Qxz3@reddit (OP)

I tried relying more on Gemma 3 4B Q8 but the deciding factor for me was when it got very confused when I asked about browser-compatible import syntax in JavaScript (import maps etc.). It would give straight up non-working, nonsensical code. Gemma 3 12B, even at 3-bit quantization, got that right. In general it just seems to provide more informative answers with less "fluff". Neither are that great for coding though, that's for sure - still looking for a better solution on that front.

[-]

Yes_but_I_think@reddit

Use coding specific models. Local models are not multitaskers.

[-]

Aaaaaaaaaeeeee@reddit

https://github.com/Infini-AI-Lab/UMbreLLa

This project lets people run larger models with the model in RAM, try and see if you get much faster 32B speed!

[-]

Frankie_T9000@reddit

So does lm studio? Is there a reason to use this instead?

[-]

Aaaaaaaaaeeeee@reddit

The speculative technology relies on this paper, which is much faster: https://arxiv.org/abs/2402.12374

LMStudio does not have that technology, only naive versions in llama.cpp.

My guess would be here: PCIE is more utilized, And the compute is asymmetrical with speculative decoding in GPU.

They have an open AI compatible API.

[-]

Frankie_T9000@reddit

Thanks for clearing it up!

[-]

timedacorn369@reddit

whats this? how does it work? never heard of this

[-]

Aaaaaaaaaeeeee@reddit

Here are descriptions from the maintainer and some benchmarks from users: https://old.reddit.com/r/LocalLLaMA/comments/1i28pfq/umbrella_llama3370b_int4_on_rtx_4070ti_achieving/

[-]

AnduriII@reddit

Qwen2.5 is amazing fpr this limited space

[-]

My_Unbiased_Opinion@reddit

Personally, I would go with Qwen 2.5 7B. Try for Q4KM and KV cache at Q8 and fill the rest up with context. The 7B model is surprisingly robust.

[-]

HumbleTech905@reddit

+1 Qwen2.5 7b

[-]

Electrical_Cut158@reddit

Following

[-]

lmvg@reddit

QWQ 32B - doesn't do well at the lower quants that would allow it to run on my system, just too slow.

What is too slow for you? Mine is 4-5t/s with q4, which for a thinking model I agree is too slow but in nonthinking models it's acceptable I think?

[-]

Qxz3@reddit (OP)

Ya the issue is that it's a thinking model so getting an answer at that speed will take forever. Also at the quants I can run it at, it tends to get stuck in loops it seems.

[-]

AppearanceHeavy6724@reddit

If you like QwQ 32b, you may want to try Qwen-2.5-32b-vl. Vl version is a better storyteller than normal Qwen but slightly worse coder.

[-]

Elegant-Ad3211@reddit

For me Phi-4 q2 was better at coding than Gemma 3 MacBook pro m2 16gb (10gb vram)

R1 was also not bad.

Thanks for this comparison man!

[-]

AppearanceHeavy6724@reddit

I tried Gemma 3 12B IQ4_XS and while it was decent at fiction, it was very bad at coding. Then I had tried Nemo at IQ4 and was as bad at coding as Gemma 3 IQ4_XS and generally dumb. Then I switched my normal workhorse Nemo Q4_K_M it was noticeably better at coding than both previous. Moral of the story - IQ3_XS should not be used, it almost certainly braindamaged esp. at 12b size.

Will try Lyra4 thanks.

[-]

superNova-best@reddit

Have you given Distilled R1 a try? It could solve those issues. Also, consider trying some community-distilled models—maybe flavors of 3.1 8B. The 8B is really good, to be honest; it just needs tuning. There’s also Phi-4—you should give it a try. It worked really well in a role-playing game of mine; it mimicked the characters perfectly.