RTX 5070 Ti Laptop (12GB VRAM) + 64GB RAM — best local LLM recommendations?
Posted by AgentFlashAlive@reddit | LocalLLaMA | View on Reddit | 7 comments
Hey everyone!
I recently picked up a new laptop : Ryzen 9 9955HX, RTX 5070 Ti with 12GB GDDR7, 64GB DDR5 RAM, and a pair of 2TB PCIe Gen4 SSDs on Windows 11. On paper it feels like a solid local LLM machine, but I'm not getting the most out of it yet.
I've been running things through LM Studio and currently using Hermes, but honestly I'm not that pleased with the performance and I feel like this hardware deserves better. Looking to see what others with similar setups are actually running in 2026.
Mainly I care about two use cases : coding (Python and R, mostly research workflows) and reasoning/thinking tasks like analysis, summarization, and long-form writing. Happy to keep everything fully in VRAM for speed, but I'm also open to offloading larger models into system RAM if the quality jump is worth the slower tokens.
Would love to hear what models and quantization formats you'd actually recommend for this setup.
Thanks in advance!
Jemito2A@reddit
5070 Ti owner here (16GB version), running LLMs 24/7 for months. Some real-world notes:
Everyone's recommending models — I'll focus on the stuff nobody tells you about the 5070 Ti specifically:
Model picks for 12GB: +1 for qwen2.5-coder:14b for Python (Q4_K_M fits). For reasoning, qwen3.5:9b over Hermes —
massive quality jump. I'd also try gemma4:e4b as others mentioned, but heads up: it requires think: true in the API or
you get empty responses, and set num_predict: 2048+ because the thinking tokens eat your budget.
What nobody mentions about the 5070 Ti:
- Cap the power (nvidia-smi -pl 200) — sustained LLM inference pushes these cards hard. Mine was crashing with TDR
errors before I capped it at 250W
- Ollama keeps models in VRAM for 5 min after last use. With 12GB, switching between two models = OOM crash. Use
keep_alive: "30s" in your API calls
- num_ctx: 32768 via a custom Modelfile — the default 4K context is useless for real code work
- Skip offloading to RAM with 64GB. A 9B model fully in VRAM at 80 tok/s beats a 30B model half-offloaded at 8 tok/s
for almost everything
LM Studio is fine for testing but switch to Ollama for anything serious — ollama ps, API access, and Modelfiles give
you control you'll need.
andy2na@reddit
Ollama is best for newcomers but it is probably the bottom of the list for recommendations. Highly recommend llama.cpp (with llama-swap) or vLLM for best performance and anything serious
mladenConcept@reddit
Really useful comment.
A lot of people just recommend models, but the real-world 5070 Ti behavior under actual daily LLM use is the stuff people usually don’t mention.
DjuricX@reddit
DeepSeek-V3.2-Coder (14B)
arman-d0e@reddit
Uhh this is a nonexistent model…
ilintar@reddit
Yeah, new Gemma MoE probably will be the best fit, you can try offloading KV cache to RAM.
Late_Night_AI@reddit
Id probably go gemma 4 26B and qwen coder next. Im aware that qwen coder next wont fit on your gpu, but its a moe and shockingly fast even when only partially gpu loaded.