RTX 5070 Ti Laptop (12GB VRAM) + 64GB RAM — best local LLM recommendations?

Posted by AgentFlashAlive@reddit | LocalLLaMA | View on Reddit | 7 comments

Hey everyone!

I recently picked up a new laptop : Ryzen 9 9955HX, RTX 5070 Ti with 12GB GDDR7, 64GB DDR5 RAM, and a pair of 2TB PCIe Gen4 SSDs on Windows 11. On paper it feels like a solid local LLM machine, but I'm not getting the most out of it yet.

I've been running things through LM Studio and currently using Hermes, but honestly I'm not that pleased with the performance and I feel like this hardware deserves better. Looking to see what others with similar setups are actually running in 2026.

Mainly I care about two use cases : coding (Python and R, mostly research workflows) and reasoning/thinking tasks like analysis, summarization, and long-form writing. Happy to keep everything fully in VRAM for speed, but I'm also open to offloading larger models into system RAM if the quality jump is worth the slower tokens.

Would love to hear what models and quantization formats you'd actually recommend for this setup.

Thanks in advance!

[-]

Jemito2A@reddit

5070 Ti owner here (16GB version), running LLMs 24/7 for months. Some real-world notes:

Everyone's recommending models — I'll focus on the stuff nobody tells you about the 5070 Ti specifically:

Model picks for 12GB: +1 for qwen2.5-coder:14b for Python (Q4_K_M fits). For reasoning, qwen3.5:9b over Hermes —

massive quality jump. I'd also try gemma4:e4b as others mentioned, but heads up: it requires think: true in the API or

you get empty responses, and set num_predict: 2048+ because the thinking tokens eat your budget.

What nobody mentions about the 5070 Ti:

- Cap the power (nvidia-smi -pl 200) — sustained LLM inference pushes these cards hard. Mine was crashing with TDR

errors before I capped it at 250W

- Ollama keeps models in VRAM for 5 min after last use. With 12GB, switching between two models = OOM crash. Use

keep_alive: "30s" in your API calls

- num_ctx: 32768 via a custom Modelfile — the default 4K context is useless for real code work

- Skip offloading to RAM with 64GB. A 9B model fully in VRAM at 80 tok/s beats a 30B model half-offloaded at 8 tok/s

for almost everything

LM Studio is fine for testing but switch to Ollama for anything serious — ollama ps, API access, and Modelfiles give

you control you'll need.

[-]

andy2na@reddit

Ollama is best for newcomers but it is probably the bottom of the list for recommendations. Highly recommend llama.cpp (with llama-swap) or vLLM for best performance and anything serious

[-]

mladenConcept@reddit

Really useful comment.
A lot of people just recommend models, but the real-world 5070 Ti behavior under actual daily LLM use is the stuff people usually don’t mention.