Best models to run with 8GB VRAM, 16GB RAM

Posted by Qxz3@reddit | LocalLLaMA | View on Reddit | 37 comments

Been experimenting with local LLMs on my gaming laptop (4070 8GB, 16GB of RAM). My use cases have been coding and creative writing. Models that work well and that I like:

Gemma 3 12B - low quantization (IQ3_XS), 100% offloaded to GPU, spilling into RAM. \~10t/s. Great at following instructions and general knowledge. This is the sweet spot and my main model.

Gemma 3 4B - full quantization (Q8), 100% offloaded to GPU, minimal spill. \~30-40t/s. Still smart and competent but more limited knowledge. This is an amazing model at this performance level.

MN GRAND Gutenburg Lyra4 Lyra 23.5B, medium quant (Q4) (lower quants are just too wonky) about 50% offloaded to GPU, 2-3t/s. When quality of prose and writing a captivating story matters. Tends to break down so needs some supervision, but it's in another league entirely - Gemma 3 just cannot write like this whatsoever (although Gemma follows instructions more closely). Great companion for creative writing. 12B version of this is way faster (100% GPU, 15t/s) and still strong stylistically, although its stories aren't nearly as engaging so I tend to be patient and wait for the 23.5B.

I was disappointed with:

Llama 3.1 8B - runs fast, but responses are short, superficial and uninteresting compared with Gemma 3 4B.

Mistral Small 3.1 - Can barely run on my machine, and for the extreme slowness, wasn't impressed with the responses. Would rather run Gemma 3 27B instead.

I wish I could run:

QWQ 32B - doesn't do well at the lower quants that would allow it to run on my system, just too slow.
Gemma 3 27B - it runs but the jump in quality compared to 12B hasn't been worth going down to 2t/s.