What’s your current local LLM setup in 2026?

[-]

Snoo_81913@reddit

MSI Stealth A13V i713k, 4060 8GB VRAM, 64GB DDR5 5200

qwen3.6 35B A3B Q5_K_M Q8/Q4 131k context - godot coding / planning / formatting I run Q4/Q4 at 196k for testing.
Gemma 4 26B - testing
Llama 8b - Sqlite calls and formatting RAG stack.
GLM-OCR RAG stack.

I've got about 15 models that I load in and out with a custom loader for testing.

[-]

m31317015@reddit

EPYC 7B13+ROMED8-2T
8x64GB DDR4-2666 ECC
RTX 5090 + 3090 (56GB VRAM total)
(I have another build with a 3090 but have yet to decide to migrate it to the server or not)

Ollama, llama.cpp or ComfyUI, one at a time when I need it.
Couple uses cases:
1. Auto format DDOS/Port scan bruteforce attack logs from my public-facing NAS web server (Deprecated by curl auto report to abuseipdb)
2. Agentic coding (llama.cpp)
2a. Automation scripts in bash, python and go
2b. Custom TUI & Web UI with llm connection via OpenAI SDK (Because I don't like openwebui since it becomes clunky)
3. Chat w/ Ollama, picked for easy to test out new models before passing to llama.cpp, also use it for plans and schedules
4. ComfyUI w/ SDXL -> Hunyuan I2V / WAN I2V

(1) before deprecated was running GPT-OSS:20b for its quick response time (on a single 3090) and most obedient to instructions (compared to qwen3)
(2) mostly Gemma 4 31B for complex solutions for a single goal, Qwen3.6 27B for simple solutions to subgoals and for separating complex goals into subgoals.
(3) Gemma 4 31B for plans and schedules, Qwen3 27B for role play testing.
(4) self explanatory

Biggest bottleneck is lack of VRAM (classic r/LocalLLaMA issue). As you can see there are multiple stacks of software here, and I can't run any in parallel since for LLMs they're taking \~48GB VRAM Q4_K_M w/ 256k num_ctx, and for ComfyUI workloads usually they take 28-36GB VRAM.

[-]

Material-Duck-6252@reddit

Is possible to preload models into RAM and route using llama-swap? You have sufficient RAM to keep Qwen3.6 27B + Gemma4 31B + some diffusion models. It takes 3-4 seconds for a 30b size model load from RAM to VRAM and of course the prefill time depending on your context.

[-]

see_spot_ruminate@reddit

GPU (or CPU if you’re brave )

4x 5060ti
7600x3d

RAM / VRAM

64gb ddr5
64gb vram

Models you use the most

qwen3.6 27b fp8 mostly via vllm
nvidia/Gemma-4-31B-IT-NVFP4 the last couple of days since I got mtp working for vllm

Main use case (coding, chat, agents, etc.)

autoerotic fan fiction involving myself, Abraham Lincoln, and an anthropomorphic banana that can say no (safe phrase "four score and seven years ago"), but never does.

Also — what’s the biggest bottleneck you’re hitting right now?

I have been thinking about getting a plx setup to move the cards from the janky motherboard setup to something more "elegant". I have figure out how to set it all up.

[-]

New_Zone5490@reddit

what a weird guy

[-]

Idiopathic_Sapien@reddit

Agentic coding and security analysis.
I-7
48gb ram
Dual rtx 3060 24gb vram (12x2)
4tb nvme
Ollama
IBM granite 4.1 8b

Triages up to 40 findings a minute depending on complexity.

[-]

sisyphus-cycle@reddit

With that setup I’d honestly try out LMstudio or unsloth studio over ollama. If you don’t mind getting a bit more low level, use llama.cpp. You should get a performance boost right away

[-]

Vaguswarrior@reddit

Single 9070xt

[-]

cleversmoke@reddit

RTX 3090 24G eGPU via oculink RTX 2060 12G eGPU via USBC TB4 AMD Radeon 780M iGPU 64GB DDR5 system ram

Windows 11 with llama.cpp OpenCode as harness

Models: - Qwen3.6-27B on the RTX 3090 24G (as primary agent) - DeepSeek-R1-Distill-Qwen-14B on the RTX 2060 12G (as critic subagent)

The AMD iGPU serves as the display and software acceleration GPU, so my 2 RTX are headless.

2 main use cases: - Coding (Python, React, Swift) - Portfolio management

Biggest bottleneck is context limit. I can fit about 128k on the RTX 3090 24G with Q4_K_M, but I am increasingly needing more. I'm upgrading the 2060 to another 3090 this week. This will give me 32-36GB vram for the main agent while keeping the subagent at 12-16GB vram, at a total of 48GB vram between 2x RTX 3090 24G.

However, I absolutely love my set up at the moment. It does everything I need with high quality and at reasonable speeds, at 50 tok/s with MTP.

[-]

MistingFidgets@reddit

I have a dual xeon dell workstation with 32gb of ddr4 that I got for $120 on eBay then added a new 16gb 5060ti. Looking to add another card, maybe an 8 or 10gb older rtx for fun. I run openclaw on top of qwen3.6 35b a3b at ud_iq2_m getting 100ish tokens per second on the GPU. I run a qwen 3.5 4b model on CPU only for background scheduled batch jobs to auto categorize financial transactions that sync into my homebrewed finance app via plaid integration with all my bank and credit card accounts. Openclaw has API access to the finance app so it can ingest, OCR, and archive receipts via telegram and give me details on spending and balances among other general assistant task stuff.

[-]

theChaparral@reddit

Minisforum MS-S1 Max (Strix Halo) 128 GB

using it as a general usage workstation, not a dedicated AI machine.

I'm bouncing between Gemma4 and Qwen3.6 mostly right now

Agents, coding, Image gen with comfyui, general chat use.

I guess the bottleneck is the current lack of newer 120B ish models.

[-]

LivingHighAndWise@reddit

Asus GX10 DXG for larger models running vLLM (NVIDIA 5070 GPU performance with 128GB ofUnified RAM)
-Currently running Qwen 3.6 27b with 256K context window.

Dual NVIDIA GPUs running in Ollama on my local workdsation for smaller models and speed (5090 + 3090 = 60GB VRAM).
-Currently running Qwen 3.6 27b with 64K context window (at about twice the speed of the GX10).

[-]

ttkciar@reddit

My hardware hasn't changed much, and probably won't until after RAMageddon is over, however many years that takes.

I have multiple dual-processor Xeon servers, with a mix of E5-2660v3, E5-2680v3, and E5-2690v4. Most have 256GB DDR4-2133, but two have 128GB DDR4-2133.

One has a 32GB MI50 which keeps Big-Tiger-Gemma-27B-v3 resident, another has a 32GB MI60 which keeps Gemma-4-31B-it resident, both quantized to Q4_K_M.

Mostly those Xeons are busy running GEANT4 and Rocstar simulations (which is what I originally bought them for), but I'll also use them to infer with GLM-4.5-Air or K2-V2-Instruct via pure-CPU (with llama-completion from llama.cpp), so that they aren't evicting the models from my GPUs (which wouldn't speed up these large models much anyway).

I also have some odds and ends, including a Dell T7500 with a E5504 Xeon and 24GB of DDR3, with a second PSU literally duct-taped to it and daisy-chained via ADD2PSU device, to feed my third GPU, a 16GB V340. That V340 keeps Qwen3.5-9B Q4_K_M resident.

My main use-cases for these models are:

GLM-4.5-Air: Non-agentic codegen, physics assistant (mostly critiquing my neutron transport notes and suggesting relevant subjects for further study), and medical assistant (mostly explaining medical journal publications to me).
Gemma-4-31B-it: Wikipedia-backed RAG for general Q&A, creative writing, business writing, language translation, Evol-Instruct pipelines, and sometimes debugger for GLM-4.5-Air's code. Also working toward using it for a technical support IRC chatbot for a channel I moderate.
Big-Tiger-Gemma-27B-v3: Critiques my Reddit activity and provides constructive criticism, also great for persuasion research and violent creative writing (Murderbot Diary fan-fic; non-erotic but very violent). I'm looking forward to TheDrummer giving Gemma-4-31B-it the Big Tiger treatment so it can take over these tasks.
K2-V2-Instruct: Long-context tasks like system log analysis and IRC log analysis, also what my "actlikettk" (self-clone) script uses, though Gemma4 might be taking over that role, not sure yet. It is also very good at Linux technical assistance and cybersecurity. I'd like to use this model for more kinds of tasks, since it's wicked smart and reasonably up-to-date, but it's hellaciously slow on my hardware (0.5 tokens/second at short context, slowing to 0.06 tokens/second at 227K context), so that will need to wait until I have at least 128GB of VRAM to throw at it.
Qwen3.5-9B: Synthetic dataset upcycling and augmentation. The dataset upcycling is mostly following what https://arxiv.org/abs/2510.10681 describes but with low-quality synthetic data instead of web data, and the augmentation is mostly replicating and refining the techniques LLM360 used to generate their TxT360_QA dataset.

When TheDrummer publishes a new Big Tiger based on Gemma 4, that will likely subsume the roles of both Gemma-4-31B-it and Big-Tiger-Gemma-27B-v3, which will leave one of my 32GB GPUs empty. I haven't decided what to do with it, yet. Maybe Olmo-3.1-32B-Instruct to synthesize an ontological syllogism corpus? I've been neglecting that project for a while now, and probably need to rethink it. Modern "reasoning" techniques have gone a different way.

The GPU-resident models are hosted via llama.cpp's llama-server, and I use them via its API endpoints with Python/Perl scripts. The pure-CPU inference models are run via bash scripts which wrap llama.cpp's llama-completion.

[-]

SryUsrNameIsTaken@reddit

Apple M5 Max
128 GB

7900 XTX for some stuff, mostly custom vision models not LLMs on a different rig with 64 gb of DDR4

Main use is exploring alignment and testing out some theories on hallucinations + detection.

Usually just use Qwen3.6-27B. I’m often not concerned about speed. Faster is great and I’d like to mess around with some optimizations, but I haven’t gotten to it yet.

Biggest bottleneck is feeling tired after work.

[-]

Quadrapoole@reddit

2x rtx 6000 pro. 192gb vram. Using minimax nvfp4. Just personal chat bot and coding for fun

[-]

DiscipleofDeceit666@reddit

2x RDNA GPUs 28Gb vram
Rx 6800 Rx 6700xT
Qwen3.6 35B
Coding

For sure llamacpp and rocm stability. Lots of crashes just for existing. But if you can tweak it right, 70 tok/s writes compared to vulkans 30-40 tok/s writes. You get wildly different output depending on the gguf and the vulkan/rocm backend. Deepseek runs 4x faster on Vulkan than rocm for instance. Once all that is optimized and stable, LocalLLaM utopia

[-]

wombweed@reddit

i built out a 3-tier system that allows me to pick and choose based on the speed vs quality tradeoff

big code projects/agentic code review: 2x3090, 256GB DDR4, minimax m2.7 with cpu moe - slow as hell but works for what i use it for

voice assistant/chat ui/quick agentic code edits: 9070, qwen3.6-35b-a5b - good middle ground

session titles/summarizing/classification: 6600XT, gemma-4-e4b - small and fast, to leave the other models' slots open for actual work

[-]

Expensive-Shift8584@reddit

apple m3 ultra

256gb

Qwen 3.5 397b q4, minimax 2.7 q6, gemma 4 and qwen 3.6 q6 or q8

vibe coding simple web games

[-]

BitGreen1270@reddit

Current setup - Lenovo e14 gen 7 with 32gb ram and 780m igpu. Mainly experimenting, optimizing local models. Started working on a (mostly) hand coded orchestrator that interfaces between my local llm and telegram. Built a tool for it so it can get the current time and date.

I do use occasional Gemini paid API for some coding when speed and accuracy matters but it's only for playing around.