Best Local LLMs - Apr 2026
Posted by rm-rf-rm@reddit | LocalLLaMA | View on Reddit | 182 comments
We're back with another Best Local LLMs Megathread!
We have continued feasting in the months since the previous thread with the much anticipated release of the Qwen3.5 and Gemma4 series. If that wasn't enough, we are having some scarcely believable moments with GLM-5.1 boasting SOTA level performance, Minimax-M2.7 being the accessible Sonnet at home, PrismML Bonsai 1-bit models that actually work etc. Tell us what your favorites are right now!
The standard spiel:
Share what you are running right now and why. Given the nature of the beast in evaluating LLMs (untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (how much, personal/professional use), tools/frameworks/prompts etc.
Rules
- Only open weights models
Please thread your responses in the top level comments for each Application below to enable readability
Applications
- General: Includes practical guidance, how to, encyclopedic QnA, search engine replacement/augmentation
- Agentic/Agentic Coding/Tool Use/Coding
- Creative Writing/RP
- Speciality
If a category is missing, please create a top level comment under the Speciality comment
Notes
Useful breakdown of how folk are using LLMs: [)
Bonus points if you breakdown/classify your recommendation by model memory footprint: (you can and should be using multiple models in each size range for different tasks)
- Unlimited: >128GB VRAM
- XL: 64 to 128GB VRAM
- L: 32 to 64GB VRAM
- M: 8 to 32GB VRAM
- S: <8GB VRAM
rm-rf-rm@reddit (OP)
Agentic/Agentic Coding/Tool Use/Coding
baliord@reddit
I use several models for different things; I run almost exclusively llama.cpp, and I use llama-swap to sit in front of my llama-server instances, providing around 32 different model choices. (I test different models regularly.) My go-to has been MoE models since \~GLM-4.6, as I can split them between GPU and CPU, and they handle it much better than dense models.
Right now I'm using GLM-5.1 (unsloth/GLM-5.1-GGUF)at 3 bit quantization, generally in non-reasoning mode, for creative writing. It's also my go-to for anything where I want to talk, but not do tool calls. At 10 t/s generation local, it's just way too slow at that, but it's human-speed for conversation or character-driven stories. It also picks up on character-definitions better than any other model short of Opus. (I've also used GLM-5.1 in the cloud for OpenClaw historically, because it's Opus-level smart, and because I want my agent to adopt the persona that is defined for it. These days I'm trying to use Qwen-122B local more consistently for OpenClaw, unless I need the smarts.)
For agentic use, Qwen 3.5-122B works surprisingly well, although not much of a 'personality'. I've run it at q4 (fully in GPU RAM) at \~50 t/s generation. I haven't needed to push up to q8 for it, and if I need much smarter I go cloud. Now the specific model I'm using there is HauhauCS/Qwen3.5-122B-A10B-Uncensored-HauhauCS-Aggressive. I also use that for image analysis, tagging and processing using mmproj-f16. The q4 isn't as good at image analysis as the q8, though. If I had to pick a model to stick in pure GPU RAM semi-permanently, it'd probably be this one, although I'd bump up to Q8 and let some of it sit in RAM.
I have an embedding model, but I don't really use it that much anymore. I was using Qwen3-Embedding-8B-Q8_0.gguf and a smaller Qwen reranker. I need to get back to this.
My system is a ASUS ESC4000A-E12 with a 32 core EPYC and 384GB of DDR5 RAM, 2xL40S for 96GB of GPU RAM; it sits in a SysRacks rack in my garage.
My basic config for each llama.cpp llama-server call in the llama-swap config expands to:
For non-reasoning, I add:
I customize the
-c {context-length}per model, because if you don't manually set a context length,--fit onwill shrink your context to nothing, in order to fit the models, before it goes to RAM. :rage:I also have a 'limited reasoning' for use cases where I want it to do reasoning, but I want it to not waste all it's time doing it. So I'll limit it to \~2048 tokens of reasoning, and leaving
enable_thinkingalone. (E.g. using the above 50 t/s on Qwen3.5-122B@Q4, that's about 41 seconds of reasoning.)Hope that helps!
rm-rf-rm@reddit (OP)
Excellent post! Would you mind sharing your llama-swap YAML? There's so little references out there and would love to compare and improve mine
baliord@reddit
Sure; it's still crufty, because I comment out stuff that's not currently being used instead of pruning and deleting, and things like that, but it might be valuable for other reasons.
I recently tried to set it up so that I could easily switch between testing ik_llama.cpp versus llama.cpp, but because the parameters are not entirely compatible, I had a lot of weird tricks I had to do to make it mostly work. And then I decided not to use it. 🤣
I'm also currently trying out `
-c 0` instead of manually setting context length per-entry, because it's annoying me to have to look up the context lengths for each model.Anyway, I tossed it up on a gist, with minimal editing. Let me know what you think, and it's okay to say, 'Oh god, don't do it like that...do it like this, instead!' :)
No-Statement-0001@reddit
Thanks for sharing that. I love getting a peek at other peoples configs. llama-swap is like a box of legos and seeing what people craft with the parts is really fun.
rm-rf-rm@reddit (OP)
thanks! it made me realize I should use macros more. Right now im just using it as a simple 1 to 1 look up table rather than groups of params
youcloudsofdoom@reddit
S - 8GB
I'm getting great mileage out of Qwen3.5-35B-A3B-UD-Q4_K_L. With this I'm squeezing around 600 p/p and 30 t/s out of my RTX 4070 Laptop (!) edition, 8GB VRAM. Very usable, and the competency on single coding tasks has been very good so far. I'm currently experimenting with using this in a local Hermes set up, but it's early days yet.
Here's my llama.cpp settings, lots of back and forth on these...
VicFic18@reddit
I'm assuming you're using CPU offloading? that too how large is your context?
youcloudsofdoom@reddit
198k, as you can see in the prams. And yes, offloading, but less of a penalty on an MoE model than on a dense one (best I can get out of the 27B is 9 t/s generation, for example)
stopbanni@reddit
In M size Qwen3.5 9B gives good results, even with such things as browser use with Hermes Agent
Total_Activity_7550@reddit
Qwen3.5-27B. Nothing else that fits into 2xRTX 3090 works for my project. I use Qwen Code.
I also have my personal written todo webapp, it has MCP server. Gemma 31B is on par with Qwen3.5-27B.
dinerburgeryum@reddit
Out of curiosity have you tried split mode tensor yet? I’m having a bear of a time getting it working with 3090+A4000, but 2x 3090 should work way better.
viperx7@reddit
i have a 4090+3090 and it crashes for me every-time the context exceeds 10k. feels like some kind of bug
Total_Activity_7550@reddit
Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.
Far-Low-4705@reddit
since u have two 3090's, u should try the new `-sm tensor` flag, it enables tensor parallelism.
It is buggy, and might not be faster for qwen 3.5 yet, only for older models, but definitely keep an eye on it, it will likely get much better in the future
Total_Activity_7550@reddit
Token generation increased, but prompt processing decreased, as noted by llama.cpp developers. For my use case this isn't beneficial.
Far-Low-4705@reddit
it is still very much experimental, but you should keep an eye on it!
It will almost certianly make a difference for you since you are pretty much the target hardware
Unfortunately for me it makes no difference, but im running on two AMD MI50's so much older and not nvidia.
CBW1255@reddit
Can you post your llama.cpp config for using that exact quant you are using? Thanks.
Total_Activity_7550@reddit
Updated parent comment.
Novel_Law4469@reddit
how do you manage to combine 2 GPUs together ? is it like a resource pool or something ?
Terminator857@reddit
I tested various models including gemma4 q8. Qwen 3.5 122b q4 beat them all in my tests. Wasn't even close.
Embarrassed_Elk_4733@reddit
try qwen3.5 27b and wait for ur feedbacks!
SpicyWangz@reddit
That would’ve too slow on a strix
Safe-Buffalo-4408@reddit
Using it on a Strix. I've tested a bunch of different models but I always go back to the qwen 3.5 27B due to quality. As faster models generate poorer output it has to spend time on fixing that, 27B won't create much errors. So yes it's slower but with Way better results. Worth it in my usecase with agentic coding and also using it as a personal assistant.
SpicyWangz@reddit
It performs better than the 122B for you or minimax m2.7?
cafedude@reddit
I found that Qwen 3.5 122b would tell me stuff like tests were passing when they really weren't, or some failure was pre-existing and not a result of it's changes (when it was). I find Qwen3-coder-next to be better. Also on a strix halo system.
rm-rf-rm@reddit (OP)
did you test it against qwen3.5 27b?
Terminator857@reddit
No, but I've seen others have and reported that gemma 4 wins.
rm-rf-rm@reddit (OP)
not universally
mrtime777@reddit
cyankiwi/MiniMax-M2.7-AWQ-4bit (or cyankiwi/MiniMax-M2.5-AWQ-4bit) on 2xGB10 cluster..
Zc5Gwu@reddit
+1 for minimax 2.7. I'm running with the following command on strix 128gb. Works well for agentic coding if your patient (do some laundry in the meantime). At 30k context getting about 16t/s and 50t/s prefill.
sn2006gy@reddit
works well as agentic? it's TERRIBLE lol.
BUT.. if you don't mind wasting time/electricity go for it. DON'T USE IT WITH AN API IF PEOPLE ARE READING THIS. YOUR COSTS WILL GO TO THE MOON
Local-Cartoonist3723@reddit
Didn’t get a chance to try this yet — you’re happy with it then? Any writeups?
El_90@reddit
Strix halo, 128GB (I can squeeze in 92GB models currently, so rated **XL**)
Roocode in architect mode - Qwen3.5-122B-A10B-Q5_K_M (91GB), in the region of 7t/s
Roocode in coding mode - Qwen3.5-27B-Q5_K_M (20GB), in the region of 12t/s
Sorry I don't have deep testing, but I tried 5-10 other models and there was always lots of back and forth with more changes, errors, mistakes, but with these models I don't feel that, so I just stuck with them
I find 122B slightly better in architect mode, more diagrams, more thorough talking through the requirement, though maybe that's my own bias.
Hobbster@reddit
I don't know Roocode but 7T/s sounds awfully slow on a Strix Halo. I usually run an Unsloth Qwen 3.5 122B A10B Q6_K around 22.5T/s and around 18T/s with 100k context, Bartowski 122B A10B Q6_K_L @ 21.6T/s and 17.3 with 100k context. Llama-server. Both models are somewhat larger, so your Q5 should be faster than that? A lot of performance ready to be discovered in your machine.
awitod@reddit
I am using unsloth/Qwen3.5-35B-A3B-Q5_K_XL and getting excellent results. I am using it over 27b for memory management and speed because I am testing a config that works without any cloud services and seeing how much quality I can get if I load everything at once.
I have ASR, TTS, text2Image, image2image, LLM with vision and embeddings simultaneously.
System: 96 GB RAM, 56 GB VRAM total (RTX 5090 + RTX 4090)
unsloth/Qwen3.5-35B-A3B-Q5_K_XL
mmproj-F16.gguf
llama.cpp config:
ctx-size=262144
threads=16
parallel=5
cache-ram=8192
n-gpu-layers=999
kv-unified=1
jinja=1
cont-batching=1
Using unsloth guide rec's for inference settings
temperature=0.7
top_p=0.8
top_k=20
min_p=0.0
presence_penalty=1.5
repetition_penalty=1.0
thinking toggle via chat_template_kwargs.enable_thinking (off in most but not all agents)
parallel_tool_calls=true <-- VERY IMPORTANT FOR OUR USE CASES
Image stack models/config:
diffusion: flux-2-klein-4b-Q4_K_S.gguf
VAE: full_encoder_small_decoder.safetensors
text model: Qwen3-4B-Q4_K_M.gguf
defaults: steps=4, cfg_scale=1.0, strength=0.75
Other local models in same runtime:
Embeddings: microsoft/harrier-oss-v1-0.6b
ASR: Qwen/Qwen3-ASR-0.6B
TTS: microsoft/VibeVoice-1.5B + Qwen/Qwen2.5-1.5B tokenizer
Far-Low-4705@reddit
is this a setting in llama.cpp or something?
awitod@reddit
It goes in the chat request json. llama.cpp/docs/function-calling.md at master · ggml-org/llama.cpp
RaptorF22@reddit
How do the 2 rtx cards combine their vram? I thought that was only possible with 3090s.
awitod@reddit
It’s well supported by the nvidia drivers and mixing devices is pretty common.
Many things allow you to configure a specific GPU, all GPUs, auto or cpu only and I have been tweaking that for each thing as I go to get the most out of what I have.
It’s kind of like packing a car 😆
b0tm0de@reddit
hello. full encoder small decoder.safetensersor. i downloaded that from official repo. put it in vae folder for comfyui and its just giving error. do you know how that working for u? comfyui v18.5
awitod@reddit
I’m not using comfy, just stablediffuision.cpp which we build with the image.
rm-rf-rm@reddit (OP)
what are you using to run ASR and TTS?
Far-Low-4705@reddit
qwen3 ASR was just added to llama.cpp!
awitod@reddit
rm-rf-rm@reddit (OP)
im asking about the engine youre using to run it... pytorch?
awitod@reddit
Sorry, I was trying to edit and get the links to the libs.
qwen-asr · PyPI
vibevoice · PyPI
puru991@reddit
What t/s are you getting?
awitod@reddit
Here is some normal output.
dinerburgeryum@reddit
Qwen3.5-27B still going strong. I wanted to love Gemma4 but once you’ve gone hybrid attention it’s hard to go back to paying the full-fat Attention cost (even with iSWA). Still haven’t really put the Gemma4 MoE through the wringer, but midsized MoEs lost a lot of trust with me from my experience with Qwen3.5. (Probably time to revisit now that all the parsing PRs for llama.cpp have been merged huh?)
zanar97862@reddit
Even with the updates I'm having iffy results from qwen 3.5 35b, lots of thinking in loops and poor tool usage. I got an opus 4.6 fine-tune from hugginface that makes it much more usable.
boutell@reddit
Oh that sounds cool... was it this one? I tried a quant of it just now but the responses I got were pretty much entirely irrelevant to the question:
https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/discussions/1
guiopen@reddit
There are still some problems with tool calls leaking in reasoning
truthputer@reddit
== Coding Only! - I only use LLMs for coding. Home workstation (built well before the RAM-pocalypse):
== Current main coding LLM: Gemma 4. It runs at high speed with a big context window - Benchmarks:
I launch with:
This hasn't been out long, but I've already noticed it sometimes makes mistakes and isn't as precise as frontier models like Claude. But it's fast and reasonably capable, just verify the work it does.
== I'm not using (until the bugs are fixed): Qwen 3.5 35B-A3B, I discovered the following problems:
== I am experimenting with: MiniMax 2.7 226B-A10B - Benchmarks:
I launch with:
This model is 140GB and obviously overflows the GPU, so CPU is at 100%, but it powers through. Probably not worth the electricity cost, but prompt caching works so once you start a conversation it doesn't feel that slow. I had it convert a Python app to Flutter and it mostly worked - although I had to ask Claude to fix some bugs it was having difficulty with.
sk1kn1ght@reddit
I have the exact same config except for 4090 and one of my ram sticks died. So I am stuck at "7" channels which is the bloody worst
truthputer@reddit
My condolences, that must be the worst trying to find a matching replacement stick in this retail environment.
sk1kn1ght@reddit
Thank you. I gave up on it. Maybe my grandchildren will be able to find one from an old bunker somewhere
SirBardBarston@reddit
Fairly new still. Is Vulkan well supported already? Your stats look great. How much is coming from the 7900XTX, how much from the 512GB RAM?
truthputer@reddit
YMMV, but from reading various forums Vulkan Compute support is maturing and in some situations can have performance on par with ROCm but Vulkan has lower memory usage. Anecdotally maturity generally seems to be: CUDA > Vulkan > ROCm. If it was my choice I would argue for the developers dropping ROCm entirely in favor of pooling efforts to improve Vulkan support.
CPU/GPU usage spit varies depending on the model and stage of the LLM processing, initial prompt ingestion usually loads the GPU full, then as processing gets underway the bottleneck shifts to the CPU. With Gemma 4 it's 2% CPU / 80-100% GPU; with big models that offload many layers to the CPU (like MiniMax) the CPU becomes the bottleneck I've seen 100% CPU and 10% GPU usage.
Also I didn't bother testing with a high performance power profile, this is in Power Saver CPU mode.
Blues520@reddit
I'm using qwen3-coder-next on 48gb vram. Using an unsloth quant with 100k context.
Running in llamacpp with opencode.
-c 100000--flash-attn on--n-gpu-layers 999--n-cpu-moe 24--jinja--temp 1.0--top-p 0.95--min-p 0.01--top-k 40It works well enough but sometimes generates less than optimal code. I end up having to adjust the prompt and restarting the session quite often but it's fast enough that this is not a huge problem. It also has a very clinical feel so it's not amazing for UI work. I recently ran gemma4 for a UI task and it had a much better feel so probably going to alternate between qwen-coder and gemma4.
sanjxz54@reddit
I liked how coder-next worked for me in Claude Code; apex-i-quality quant on 12gb vram (5070) with 56gb ram offloading 200k context runs somewhat fast (150t/s prefill, 20 t/s gen on llama cpp). Worked best for me to ask a free tier frontier model on how to implement something, and pass that to qwen for actual work.
But i recently tried 122b-a10b apex-i-mini and it feels better than coder next. if i had to compare, id say coder next with mcps for web search in claude code is about sonnet 3.5 from cursor v0.4 (or smth) era, while 122b one is closer to sonnet 4 level. speed is 250 t/s prefill and 10 t/s gen. Using those settings in ik llama cpp (self compiled via MSVC):
-c 230000 --fit --peg --cache-type-k q8_0 --cache-type-v q6_0 --k-cache-hadamard --v-cache-hadamard --fit-margin 2048 -np 1 -fa on -mla 3 -t 16 -tb 16 --merge-qkv -b 2048 -ub 1024 --no-mmap --jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --reasoning-budget -1 --repeat-penalty 1.0 --presence-penalty 0.0 --alias "qwen-3-apex" --port 8080 (for 122b. for coder next i used temp 1.0 and top k 40, and standard llama)
Really wanna upgrade to 96+ gb ram and try m2.7 in mini quant, sounds great on paper.
Objective-Stranger99@reddit
GLM 4.7 Flash REAP 23B A3B UD Q5 K XS
the_auti@reddit
GLM 5.1 FP8 on sGLang 4xB300 Cluster 200k Context 32k Output
Have not finetuned this yet as this is an experiment and costs $30/hour to run.
Can run a dozen parallel agents at extreme speed.
Project: Convert 500k line Node / Express / Pug codebase to Go and React.
14 Hours Run Time
170k lines of Go 35k lines of TS
Output on par with Opus 4.6
Will run e2e testing tomorrow but the initial code review (can you call it that?...code glance) using Opus 4.6 is extremely positive.
Please note this setup is not for the faint of heart. It cost close to $500 just to get this running on runpod. It is still not running "Properly" but it was enough for our experiment.
In the coming weeks we will be testing MiniMax as well.
Travnewmatic@reddit
my daily for Hermes, agent-zero, and OpenCode:
also running an embeddings model (on the CPU) for agent-zero:
been working fairly well. open to suggestions!
running a single Nvidia A10, 32G system memory. 9800X3D.
looking forward to getting my second A10 later this year :)
JaconSass@reddit
Using Gemma4:26b on a RTX 3090 for my AI Home Assistant. I occasionally switch back to Qwen3.5 to keep a perspective.
false79@reddit
gemma-4-26B-A4B-it-UD-Q4_K_XL on 7900 XTX (24GB) on llama.cpp
--temp 1.0 \^
--top-p 0.95 \^
--top-k 64 \^
--ctx-size 128000 \^
--chat-template-kwargs "{\"enable_thinking\":false}"
--jinja \^
--verbose
Pros:
- Dunno if it's the best but I find the quallity is higher than gpt-oss-20b.
- Gained Multi-modal
Cons:
- Output quality is higher on this MoE but it is slower than other MoE's I've tested.
-Once in a while, it crashes maybe twice a week?
Ariquitaun@reddit
It crashes on me at least once a day. More if I hook it up to a coding agent
No-Manufacturer-3315@reddit
Glad it’s not me, people say I am crazy Gemma is crashing, running latest models and llama cpp and runtimes
Borkato@reddit
Gemma has a TON of problems a lot of the time. It’s incredibly finicky even after all the updates and template changes
Independent_Solid151@reddit
When did you last download the model and what's your llama.cpp version.
Chupa-Skrull@reddit
It's crashing more for me after the latest round of big updates than it ever did before tbh
Independent_Solid151@reddit
Reduce the number of checkpoints and parallel instances. You can also increase batch size to 2048, while keeping ub at 1024.
aldegr@reddit
Try -cram 4096, or lower like 0. It might crash because the checkpoints are consuming too much system RAM.
false79@reddit
Thx, I've added --cache-ram 4096 to my launcher. I'll try it out and see if that makes any changes.
I find it crashes more often if the context is like 75% or higher filled.
Eyelbee@reddit
Why are you running it on that setup when qwen 3.5 27B exists? Would be significantly higher quality.
false79@reddit
Had tool issue with cline --tui, I quit on qwen.
Gemma 4 worked out of the box with no issues.
Witty_Mycologist_995@reddit
Yes.
h-mo@reddit
For agentic/tool use workloads (the main thing I care about professionally), I've been routing between two models depending on task complexity - a smaller fast model for routing and triage, a larger one for actual reasoning. Running this through Open WebUI with a custom pipe that scores complexity before deciding which model gets the call. For the larger slot, Minimax-M2.7 has been surprisingly capable - genuinely Sonnet-level on structured output and multi-step tool use, which matters more to me than benchmark scores. For the smaller slot, anything in the Qwen3.5 family at Q4 handles classification and short-context tasks cleanly without burning tokens. The key thing I've learned is that for production pipelines, instruction-following consistency matters far more than raw intelligence - a model that follows system prompt constraints 98% of the time beats a smarter model that goes off-script.
rm-rf-rm@reddit (OP)
Speciality
(includes medical, legal, accounting, math etc.)
stopbanni@reddit
Gemma 3 4B is still I think best model in S size which can be good at russian, even knows animal sounds (for some reason, qwen3.5 9b thinks that cow says meow, lol)
UpsetEmotion6660@reddit
Edge AI / IoT inference on constrained devices:
S (<8GB): For always-on sensor inference on MCU-class hardware (STM32N6, ESP32-S3, Syntiant NDP120), you're not running LLMs — you're running quantized TinyML models (keyword spotting, anomaly detection, vibration classification). TensorFlow Lite Micro and ST's NanoEdge AI Studio are the practical tools here.
M (8-32GB): This is where edge inference gets interesting. Running quantized 4B models on Jetson Orin Nano or RPi5 + AI HAT for real-time vision, predictive maintenance, or local NLP. Gemma4 e4b quantized is genuinely usable for on-device agentic tasks in industrial IoT — local decision-making without cloud round trips.
The underappreciated angle for this community: the biggest constraint for edge AI isn't model quality anymore — it's the orchestration layer. How do you push model updates OTA to a fleet of thousands of devices running different hardware? How do you handle inference when connectivity is intermittent? The model is the easy part; the distributed systems around it (connectivity management, fleet OTA, telemetry collection) are where most deployments actually struggle.
For anyone building local-first AI systems that need to work in the field, not just on a desktop — the connectivity and fleet management stack is where to invest your time.
Tyrannas@reddit
Churro OCR quantized Q4_K_M for historical documents OCR https://huggingface.co/mradermacher/churro-3B-GGUF
noddy432@reddit
Would you mind sharing your workflow app for OCR? I'm looking for a good OCR for historical handwritten documents. Thanks.
Tyrannas@reddit
Sure so I downloaded the gguf model and served it locally with llama.cpp llama-server, then I use this basic snippet:
```
I use the stanford-oval profile to get structured XML output but you can also have raw text by changing the profile.
MuDotGen@reddit
How does this do for modern official documents? Languages? Such as Japanese? I'd been using PaddleOCR for that use-case so far, but a 3B Q4 model seems tempting.
Tyrannas@reddit
Haven't tried on modern, but I'm pretty sure you can find better since Churro is a Qwen2.5 fined tunes on 100k historical documents. Maybe look on https://huggingface.co/collections/ggml-org/ocr-models to find other options ?
Traditional-Gap-3313@reddit
Gemma 4 26B for my agentic legal search. Running on vllm on 2x3090 and it's quite fast in both prompt processing and decoding. Can fit 65k context which is enough for most searches, since I do several search agents in parallel and combine the results.
31B won't work in vllm due to broken KV Cache allocation. It will work once that if fixed. Until then, max I could get is 9k context on 2x3090s. (https://github.com/vllm-project/vllm/issues/39133)
Tested Qwen 3.5 35B moe, not that good in the legal text understanding benchmark, even though it works agentically. Qwen 27B has the same problem with kv cache allocation as Gemma 4 31B.
However, tested Gemma 4 31B over openrouter and it's significantly stronger in tool calling then 26B. My tool calls have additional optional parameters besides the query. 26B NEVER EVEN ONCE called it with optional parameters. A few tests I did with openrouter 31B showed that that model understands the tools a lot better. So I'm waiting for the vLLM to fix the bug with kvcache allocation and I'll migrate everything to 31B or Qwen 27B
Borkato@reddit
Gemma 4 36B is god tier at teaching and explaining. It’s even better than qwen.
It’s also relatively uncensored which is super fucking surprising.
cleverusernametry@reddit
36B?
Borkato@reddit
Oh uhhh whatever the big MoE is lol I always forget the params. The 36 A4B or whatever
cmdwedge75@reddit
Gemma4 26B A4B
Borkato@reddit
Yea that thing lmao
rm-rf-rm@reddit (OP)
Creative Writing/RP
stopbanni@reddit
In "S" size my go-to is still Gemma 3 4B. It's multilingual, it has good support in old versions, and available even in openrouter to try it. Tested on quant Q4_0
DragonfruitIll660@reddit
Gemma 4 31B. Even smaller quants feel like a large step up in terms of quality. Doesn't hurt the speed is quick as well on weak hardware. Bit of a positivity bias and the vision seems to struggle a bit relative to writing quality but its good overall (it struggles to read maps a little and doesn't always accurately describe assets/story items.)
Weak-Shelter-1698@reddit
Peak experience for me.
bswillie@reddit
Q4 is blazing fast with generous context on a 5090. Base is fine but having a blast trying every revision of u/TheLocalDrummer's Artemis the moment they drop. Not deleting Skyfall yet... But not sure if I'll ever load it back up either. And I loved that thing.
MasterKoolT@reddit
Especially good with thinking enabled when you tell it explicitly to check for consistency
Randomdotmath@reddit
This doesn't help much, complex characters still need a SOTA model. That said, the writing is genuinely excellent in this weight.
Genebra_Checklist@reddit
Have you tried to store the details on a database and injet it with the prompt? Im doing something simmilar but with factual information
Far-Low-4705@reddit
it is good for writing human sounding emails and not AI slop sounding stuff.
Also just way nicer to read and chat to. unfortunately a little bit sycophantic for me, and my primary use case is engineering/coding, so qwen 3.5 is the default for me
a_beautiful_rhind@reddit
Very good for a 31b and thankfully not so censored. Plus it has tuning potential.
Sadly the base is a bit dumber and it seems like context memory use is on the heavier side. Still, it's great to have a fun model for once.
Borkato@reddit
Waiting for a heretic of skyfall 4.2 😭
-Ellary-@reddit
For what? It has 0 refuses. Heretic just will hit the overall logic making model to be a yes man.
I mean when you ask heretic to perform random cannibal action, every char gladly agrees.
Borkato@reddit
Are you sure? I’ll try it then, I just get annoyed if there’s ever even a single one. Or talking about medical topics or suicide
HopePupal@reddit
my recommendation for the XL category remains MiniMax: M2.7 seems to be about as good as M2.5 (i'm running UD-Q3_K_S). doesn't need to be abliterated/uncensored/whatever. fewer LLM-isms than any of the Qwens i've tested, although far from zero: it loves vanilla and ozone as much as any other model.
DeepOrangeSky@reddit
Are you saying Minimax is relatively uncensored? I haven't ever tried Minimax before, but always assumed it was very censored and probably not so good at writing, since I always thought of it as a coding model. Did it used to be more that way for M2.1 and M2.5 and now it is more loosened up for M2.7? Or was it already fairly uncensored, etc, even for M2.1 and M2.5?
Also, LLM-isms aside, how would you say it compares to Qwen3 235b a22b instruct 2507 (and maybe also Step3.5 Flash 197b) for creative writing ability, not in terms of the prose, necessarily, but more so in terms of understanding themes, interpersonal dynamics, human nuances in difficult/awkward situations, etc (things that require it to be smart/deep, not just good at writing pretty looking sentences, that is)? Are those ones as good/better than it, at that, or is Minimax better than those? I have slow internet and a harsh data cap, so I am trying to be selective about which big models I download in a given month. Also, I'll probably be using them at the bottom of Q3 (xxs) or top of q2, since I'll be using them on a mac with 128gb unified memory.
So far Mistral Large 123b/Behemoth is the strongest local LLM I've used for creative writing on my mac, by a decent margin, but I haven't tried to 200b+ MoEs yet, so, I am curious if some of those at the low quants will be able to dethrone it, or even come close, or not (or maybe in some aspects or something). I assume they won't, and that you need to go up to like DeepSeek/Kimi level to beat Behemoth, but, it should be fun to try some new models out.
Kamal965@reddit
I purchased a $10 plan from Minimax that comes with 1500 requests per day, no token limit. Used it for some creative writing and RP. I have to say that I was VERY impressed. Its writing style is actually very unique compared to everything else I've used. M2.5 seems less censored than 2.7, but both work uncensored with an appropriate system prompt. Every now and then I get a refusal, but if I regenerate once or twice it complies. The model is genuinely a breath of fresh air in terms of writing style.
HopePupal@reddit
for reference, i'm using a 128 GB Strix Halo. any 128 GB unified memory Mac has better memory bandwidth than my Strix and possibly better compute.
i've tested M2.0, M2.1, M2.5, and M2.7 at this point, and they'll all write pretty much whatever, given an appropriate system prompt. the part that really impressed me wasn't the prose but behavior over long context — i wasn't expecting any model to be particularly good at callbacks or B-plots or any sense of the passage of time, or at maintaining separate character voices.
i have not tried Step or that Qwen 3, although i do have Qwen 3.5 397B-A17B quantized to within an inch of its life that i could try tomorrow (every smaller Qwen 3.5 has been pants at writing). the GLMs i can run locally aren't good for writing either (although cloud GLM 5.1 isn't bad). GPT-OSS 120B had moments of competence; i might still be experimenting with it if i hadn't discovered MiniMax was good for more than just code.
i've been generally unimpressed with Mistral models so far and gave up on them a while ago. serious tendency to just echo prompts back to me with minimal additions, even with the bigger ones running non-locally like full-size Mistral Large 3. haven't tried Behemoth, but the few "omg you have to try this" fine-tunes i've tried have been indistinguishable from their parents in that respect. always been surprised that the scene came so far on so little. (same with LLaMA models.)
DeepOrangeSky@reddit
Interesting, maybe I'll give it a try
IrisColt@reddit
Qwen 3.5 27B, heh.
I'd make a case for it. I wouldn’t trust it with established lore or especially nuanced prose as a foundation, but with thinking turned off, it's about as fast as Gemma 4 in the same mode, and its grasp of a story in motion is just as good... arguably better. It's a strong choice for quickly sketching ideas or stress-testing worldbuilding, and it has a sharper, more playful intelligence than the blunt simplicity of Gemma 3, even if it still lacks much of Gemma 4's uncanny human intuition.
Genebra_Checklist@reddit
Gemma 4 26B A4B. I was working on a Gemma 3 fine tunning when Gemma 4 launched. Man, we can't even compare other models for creative writing. Works wonders in pipeline with few shots style exemples.
IrisColt@reddit
Exactly. You can give Gemma 4 an entire piece of work and ask it to jump in at any point in the story to continue the narrative, enabling an alternate but plausible back-and-forth interaction with the characters... and it works remarkably well. Although far from perfect, no other model in its parameter range matches its capabilities.
No-Judgment9726@reddit
Been running Gemma4 on M4 Pro for about a week now, Q4 quant. Honestly more impressed by the instruction following than the raw benchmark scores — it just gets what I mean more often than Qwen3.5 at similar sizes.
GLM-5.1 is interesting too but haven't had time to properly test it yet. Anyone done a side-by-side with Gemma4 on coding tasks specifically?
david_0_0@reddit
the vram tier breakdown is really helpful. wondering how much the context window length varies across these tiers though. feels like when youre comparing the S tier models, a 4k vs 8k context window can massively change what use cases actually work. do the ones in that range handle long-context stuff well or does performance degrade pretty fast?
No-Judgment9726@reddit
GLM-5.1 has been surprisingly strong for its size class. What's interesting is that it's not just benchmark numbers — the actual instruction following quality feels noticeably better than what the evals would suggest.
For anyone running on Apple Silicon, the Gemma4 series has been my go-to recently. The memory efficiency at Q4 is impressive, and MLX support is solid out of the box. Would love to see more on-device agent-oriented models make this list in the future — the gap between cloud and local is shrinking fast.
david_0_0@reddit
the breakdown by vram tier is really useful. curious about something though - when youre evaluating models in the M size range (8-32GB), how much weight do you give to context window vs raw inference speed? feels like that trade-off massively changes depending on the actual task but i dont see much discussion of it
Human-spt2349@reddit
Gemma 4 is good
LordStinkleberg@reddit
Best model & quant for agentic coding on a fresh Mac M5 128GB?
rm-rf-rm@reddit (OP)
GENERAL
Thrumpwart@reddit
Unlimited - Qwen 3.5 122B 8-bit MLX is the best general purpose model for me. Tons of general knowledge, good long context reasoning, and not as slow as you’d think on a Mac.
rm-rf-rm@reddit (OP)
if youve used gpt-oss:120b, how does it compare?
Thrumpwart@reddit
I’ve used it in the past but deleted it awhile ago. Maybe I’ll download it again and check it out.
OrganicHalfwit@reddit
3060ti 8Gb with 32gb system Ram.
Been using Qwen3.5-35B-A3B-Q4 for large text chains with multiple files and comparison. Got upto a context length of 110k, but it's quality was dropping significantly. \~30t/s
For small and fast questions about the models themselves I'm using Qwen3.5-9B-Q4 \~20t/s
I am currently running on ollama with jan.ai ontop so i'm trying to move over to llama.cpp with webui.
All very new to this though and want to get into image, audio, and video gen.
salmon37@reddit
Hey, how do you share gpu ram with system ram? I'm only getting into local llms and I've la 3080 with 10gb ram and I've been able to run 2 bit quantized models with llama.cpp, but I didn't know you could use regular ram with these models
OrganicHalfwit@reddit
So from my understanding is that only MoE models can share between ram. Take the Qwen3.5-35B-A3B, its not actually a 35B parameter model, its 7 different 3B parameter models which are all honed for multiple different tasks. MoE (Mixture of experts) allows the models that aren't being used to wait in system ram while the singular 3B model which is most relevant for the task swaps in and out.
So effectively you have multiple different brains, which are all pretty smart, waiting in the sideline to sub in for one an other. This lowers total speed on call (but tokens generate still quite fast), it also means that there's alot of excess room in your vram so your context length can be much larger.
However, you are still using 3B parameter models so they can be fair stupid. On benchmarks the 9B basically always beats the MoE, but because its so large I can only use that with a small context window of 4k tokens
with 10gb of vram you have just a little bit more wiggle room than I do so you can play around a bit more. Although for huge conversations (100k +) i think the MoE's are good enough.
As to "how to use", my front end jan.ai does automatic allocation, but it's suboptimal so I have to play around a bit. Specifically enabling "keep all experts in CPU" and I have -1 on offloading model layers to GPU, which honestly I cant remember why i did that as it's kinda counter intuitive.
I'm still learning too at this point. Anyway, hope this helped!
salmon37@reddit
Helped a bunch, thanks so much for the detailed response!
Nabushika@reddit
gguf files (/llama.cpp) is designed to be able to split computation between CPU and GPU (although I've heard ik_llama might be faster). Any model you download will spill over into system ram if you don't have enough vram (with appropriate slowdown).
MarkoMarjamaa@reddit
As non-english I have to mention gpt-oss-120b. It's working fairly well in Finnish. Tried Qwen3.5, produced a lot of gibberish.
Then found out about EuroEval, EU-funded test bench for european languages.
https://euroeval.com/leaderboards/
Gemma4 seems to be very good in Finnish, will try that later.
I also ran the test with gpt-oss-120b in Finnish.
https://flow-morewithless.blogspot.com/2026/04/mika-kielimalli-on-paras.html
(and yes, the blog post in in Finnish)
pepediaz130@reddit
Gemma 4 E4B on Mac Mini M4 (16GB) - Benchmarks (oMLX vs Unsloth)
I've been benchmarking the Gemma 4 E4B models on a Mac Mini M4 (16GB) to find the optimal configuration for coding and technical assistance. The following results compare the standard oMLX quants against the Unsloth UD-MLX (Dynamic) versions using the oMLX engine with Paged SSD KV caching.
Performance Comparison (Generation TPS):
Technical Observations:
The oMLX Standard 4-bit is the most efficient choice for a daily driver. It maintains over 20 TPS at 32k context with a minimal memory footprint (\~4.5GB), allowing the system to handle other heavy processes without lag.
The Unsloth UD-MLX 4-bit offers better logical reasoning and native vision support, though it carries a 20% performance penalty. It is the preferred model for vision-centric tasks or complex debugging where precision is prioritized over speed.
Regarding the 8-bit versions (both oMLX and Unsloth), they perform nearly identically. However, on 16GB hardware, they hit a hard limit at high context. As soon as oMLX begins aggressive SSD paging at 32k, speed drops to \~9 TPS, making 4-bit the only practical option for long-context workflows on this machine.
In summary: Use oMLX 4-bit Standard for speed and general coding; switch to Unsloth 4-bit UD for vision and high-level reasoning.
illusionmist@reddit
Can I find oq version on huggingface like other models or do I always have to convert myself? And does it only work on bf16?
Objective-Stranger99@reddit
Qwen3.5 35B A3B UD IQ4 XS for deeper thinking.
Gemma 4 26B A4B UD Q4 K XL for better conversational flow.
Operation_Ivy@reddit
On the M3 Ultra 512 GB, nothing beats Qwen3.5 397B 8 bit quant from Unsloth. Working with structured and unstructured data, chatting, world knowledge - best generalist agent I could find. I compared to GLM 5.1 and Minimax M2.7. GLM was similar quality but much slower. Minimax was faster but lower quality.
Hydroskeletal@reddit
Research and ingestion projects is where I'm working local; for coding I've not seen enough to pull me away from Claude/Codex. But when you're dealing with a flood of data, way too easy to blow your token budget.
A couple of M4 Macs for me and I'm all in on Gemma4 right now, 31B and E4B. Qwen3.5-35b-a3b was just the hands down winner but I kept coming back to Gemma and it depends on what you want to put in.
If you hand both Q35-27b and g4-31b a book and say "Write me a detailed book report", Qwen is going to give you the better, longer report. By default Gemma is lazy. You need to tell it to not spare the thinking budget, make sure you're giving it max tokens for output and really tell it all the things you expect in the book report. Then you take the detailed prompt and give it to both and Gemma will have more details in a more concise format. Qwen will repeat the same ideas phrased differently.
Same thing goes for planning. If you tell Qwen "Give me a plan to do X", Qwen does a better job of intuiting what you want. But be specific about what the plan needs to do, the metrics, goals, outcomes, things to account for etc. and Gemma is better.
Where Gemma absolutely crushes Qwen for me though is my source discrimination tests. Qwen is eager to include crap and then hedge that it might be crap, or is perhaps more "crap curious" meaning it will look at something with a crappy abstract and then after wasting time decide it was in fact crap.
So the workflow is using the big dense Gemma as the 'brain', doing the big data work and then delegating out to a small model for very constrained tasks ("Which of these 3 documents meet criteria X?") and e4b really does quite well at this. I was using gpt-oss20b before and e4b is just strictly better. Caveat is that you really need to use thinking. I tried q3.5-9b but it was often slow enough that it didn't make sense. I should probably do more testing for q3.5 at the 4b size.
CatEatsDogs@reddit
Using three llms through my telegram bots -->n8n instance: 1. Parakeet (hope I correctly typed it) if I want to ask something in voice 2. qwen3.5:35b-a3b-q4_K_M if input was text or recognised speech 3. gemma3:27b-it-qat if input contains image
2 and 3 are running on the server in ollama using 12gb rtx3080. Parakeet is running on separate lenovo 720q fully on cpu.
Qwen is mostly used for translating something into my native language. Gemma is used the same way but with images.
I tested image processing in qwen by I didnt like it. Gemma is "smarter". I tried to post random screenshots from random youtube dron flies and gemma recognized more places successfully.
Also tried to use newest gemma4 26b but struggle to disable thinking in n8n.
Total_Activity_7550@reddit
Qwen3.5 27B and Gemma 31B.
Farmadupe@reddit
I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp
* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.
* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.
other observations about the state of the world...
* 4 bit quants hardware-optimized quants for vllm are massively lobotomized and not worth using IMO. they're barely coherent. llama.cpp quants aren't perfect but they way more usable than you'd think (I think we all owe a massive thanks to unsloth aessedai ubergarm bartowski mradermacher & co for cramming so much quality into tiny tiny quantizations)
* Abliterated models are all totally useless and lobotomized to hell. But Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking wins Best Name Ever points (and insulting the user points too)
* While llama.cpp can be a bit wobbly at times, the love and care that goes into it really shines through. Without it, we'd all be stuck with running tiny model and lobotomized quants on vllm, adn the locallama space for consumer GPUS would be completely different. So a big thankyou to the llama.cpp devs too!
brandybuckferryman@reddit
What are the best coding models (preferably run with CC) to run on a 24GB AMD GPU 7900 XTX and 64 GB system memory?
Skid_gates_99@reddit
Qwen3.5-27B on a single 3090 for most of my agentic work. bartowski Q6_K quant, 64k context, thinking off for tool calls because it wastes tokens reasoning about which function to invoke when the schema already tells it everything it needs to know. Gets me around 20 t/s on generation which is fine for agent loops where the bottleneck is the tool execution anyway.
Tried Gemma 4 26B for a week and went back. Quality is genuinely good when it works but the crashes and the tool call formatting issues killed my trust. I need something I can leave running overnight on a multi step workflow without babysitting it. Qwen has been boring and reliable for that which is exactly what I want.
Have not tried GLM 5.1 yet but the benchmark post from earlier today has me curious. If anyone is running it locally for agentic stuff I would love to hear how the tool calling holds up.
TheLastSpark@reddit
How are you fitting a q6 27B and 64k context? All of it can't fit in vram - right?
apollo_mg@reddit
Probably Turboquant or a variant for KV cache.
TheLastSpark@reddit
But even q4 there's no way unless I'm missing something
Cooproxx@reddit
what kind of agentic work do you do? Curious what’s possible on the 3090
sk1kn1ght@reddit
Some new research has come out examining the new mythos from anthropic. There is a strong consensus based on available data that the model itself is not THAT much better(it's still better but not that much). What instead allowed these improvements is it's loop Iteration and some people have replicated quite amazing results with 27b and 31b with loop Iterations
MrB0janglez@reddit
Agentic/coding: running Qwen3.5-35B-A3B-Q4_K_M on a single 3090. Getting roughly 18t/s which is fast enough to stay interactive. The A3B variant is way more practical than the full 235B for daily use without a multi-GPU rig. Tool calling has been solid with llama.cpp function calling template. Tried Minimax-M2.7 briefly but can't run it locally with my current VRAM. GLM-5.1 is impressive on focused tasks but loses coherence on longer agentic chains in my experience. Qwen3.5 is my daily driver for anything coding related right now.
youcloudsofdoom@reddit
Just FYI I'm getting 35 t/s on a 4060M laptop GPU with that model and quant, so you definitely want to look at your llama.cpp settings.
MrB0janglez@reddit
went back and checked after seeing this -- had -ngl set way too low so it was only offloading a fraction of layers to the GPU. bumped it to 99 and speeds jumped significantly. classic setup mistake, thanks for the nudge.
youcloudsofdoom@reddit
Glad to hear it, happy to help!
mudkipdev@reddit
This is very strange. With a 3090 you should be looking at at least ~60 tokens per second.
sk1kn1ght@reddit
That's true. 4090 with dense I get 45tps while severely undervolted. Your settings or smth must be butchering you.
Internal-Month-4812@reddit
Interesting to compare outputs across models
on identical inputs.
LLaMA 3.3 70B via Groq consistently catches
different issues than GPT or Claude on the
same codebase — each model has systematic
blind spots the others don't share.
Has anyone done structured comparisons of
per-model accuracy on specific task types?
FlightCautious3748@reddit
minimax m2.7 has been the most useful for client work lately, team was skeptical but the throughput on longer context tasks is actually solid for the cost of running it locally
rileyphone@reddit
Is anyone still using base models? For open-ended text generation (like with looms or mikupad). Now that Hyperbolic 405b base is down the only API option is text-davinci-002. I'm back to using Llama 3.1 8b local but there has to be something better that isn't annealed to death.
HopePupal@reddit
i'm not using them, but there are base models available for Gemma 4, Nemotron 3, the smaller versions of Qwen 3.5 (including 35B-A3B but not 27B)
pmttyji@reddit
https://huggingface.co/shb777/Llama-3.3-8B-Instruct-128K - u/FizzarolliAI
__Captain_Autismo__@reddit
Startup founder - 96gb vram ( rtx 6000 pro )
Coding, writing and web dev.
General purpose: Minimax 2.5 reap q4
Web dev: Gemma 4 31b it bf-16
Full tool use through both on my built from scratch agent harness. Manage workflows through my control surface.
Around 80-90% or more of my ai usage is now local and the workflows get better daily.
I don't use small models for anything unless it's something like embeddings. Always throw as much compute as I can at it.
Same_Platypus1629@reddit
Best for article writing? 5070ti 16gb vram, 32gb ram
MajinAnix@reddit
Wrong question, we need best fastest possible model, sweet spot
mrtrly@reddit
Qwen3.5-27B has been my daily driver for agentic coding on a single 3090. Thinking off for tool calls is the move because the reasoning tokens add latency without improving function selection. The 27B quants still punch way above their weight class for structured output.
Series-Curious@reddit
OCR
JournalistLucky5124@reddit
Need recommendations. S = 4gb vram and/or 16gb RAM 🙃🙂
ben_g0@reddit
I'm running Gemma 4 E4B on my phone with 16GB RAM, and it's quite responsive (even when running on the CPU) and surprisingly capable for a small model. I'm not sure it's going to work well on your GPU but it'll likely still be performant enough running on the CPU.
Fuzzy-Layer9967@reddit
Any focus on difference between AMD / arm architectures ?
nerdylicious05@reddit
I would love to hear what people are using with Home Assistant. Tried llama3.1:b with mixed results, but I am new to local llms
MarkoMarjamaa@reddit
There are some smaller models fine-tuned for Home Assistant in Huggingface ( just search for Home Assistant).
I'm using gpt-oss-120b because I'm using HA in Finnish.
thavidu@reddit
Thats a super old model at this point, I haven't used HA in a long time but you should try a more recent release small model, like maybe one of the smaller param Gemma4 models or smaller param Qwen3.5 models
Spirited_Maybe7374@reddit
what's the best model for text summarization? I have an M1 Pro Max with 32GB
pmttyji@reddit
Any recent 4B models are enough for this. Try Qwen3.5-4B for example. If you want more better, Qwen3.5-9B there.
Ki1o@reddit
What's best for coding in a RTX6000 maxq ? I'm running qwen3.5-27b unsloth currently and whilst it works well it large contexts .. I am curious if I should be experimenting with ithe models for this 96 GB VRAM card
sagiroth@reddit
Single 3090 and 32gb RAM still Qwen27B Bartowski/Qwopus best for coding/ agent tool calling with opencode?
Aaronski1974@reddit
Minimax 2.7 unsloth 2bit u m or something. Amazing. Best local model I’ve ever used by far. Getting 40tps and about 15s to process 40k token prompt. Instant once it’s cached, and maybe .5s to first token on an empty cache on a dgx spark. It’s replaced haiku for me. Replaced sonnet too for non-coding. It gets stuff.
zanar97862@reddit
What hardware are you running for those speeds? Even small quants of 2.7 are still huge for local so wondering what it takes to get reasonable performance.
Icy_Lack4585@reddit
GB10. the MSI branded DGX spark. with a little upgraded cooling.
Novel_Law4469@reddit
So far i've tried
gemma4:26b-a4b-it-q4_k_m (approx 20toks)
qwen3:30b-a3b (approx 10-11 toks)
on a 8GB RTX4060, with 48GB RAM machine
basic prompts - no probs.
mid-complex prompts - no too bad either, was able to handle stuff like 'Design a PostgreSQL database schema for a multi-tenant SaaS application' and RLS stuff pretty okayish too.
CodeCatto@reddit
What are the best coding models to run on a 12GB RTX 5070Ti?
tthompson5@reddit
I have a RTX 4070 Ti (also 12GB). I'm NOT a coder, but people on here say Gemma-4-26b codes well, and I got the UD-IQ4_XS quant from unsloth to run on my machine (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) at about 40t/s on startup (slower for longer contexts). It's probably worth a try for you.
I also successfully got it to write a couple of bash scripts for me (not big coding projects) and refactor a simple R script. From my experience with it, it seems reasonably competent. I use llama.cpp to serve the model.
Tuning it for speed versus hogging all the system RAM is still ongoing, but if you want it, I can share my full current start-up script for it. A lot of experimentation and trail-and-error has gone into it, and I'm still experimenting with tuning it. I'm not using the mmproj file to enable vision by default. Unless you need the vision for your use case, it's better to leave it off and save yourself the VRAM/RAM.
Anyway, these are the important flags I'm currently using, and I hope they'll give you a good starting point if you decide to try the model.
--ctx-size 100000
--parallel 1
--cache-type-k "q8_0"
--cache-type-v "q8_0"
--flash-attn on
--swa-checkpoints 2
--cache-ram 2048
--checkpoint-every-n-tokens 16384
--defrag-thold 0.1
--temp 1.0
--top-k 64
--top-p 0.95
--min-p 0.02
--repeat-penalty 1.1
--no-mmap
--cpu-moe 12
--jinja
--chat-template-kwargs '{"enable_thinking":true}'
Screenshot from running it on my machine with the car wash test question:
tidel@reddit
Anyone got any success using vllm? I’m successfully running qwen3.5-35B on llama.cpp and wanted to try vllm just to have a reference, and I can get it to run, tools calls and thinking are painful to get right. And! I’m definitely looking in the wrong place on how to get this right ….
truedima@reddit
Which vllm version were you trying? I havent used 35b a but I use 27b a bunch on vllm. Both as cyankiwi awq quants (int4) and both work fine, with tools and all. Thats on 0.17, 0.16 didn't have support yet and iirc somewhere between 0.17 and 0.19 there might have been a few regressions that crept in (see changelog).
tidel@reddit
I just found this post: https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/ from 5 hours ago and directly testing :)
jinnyjuice@reddit
Please break down more categories for >128 GB. You don't have to label them with 'S' 'M' etc. Just use the number ranges.
AI_Conductor@reddit
For agentic and tool use on constrained hardware, the GGUF quantization path has been the unlock for us. We run a ComfyUI image generation sidecar on a 4GB VRAM Turing GPU (T1000) and had to make hard choices about model formats.
The key insight: on Turing hardware (compute capability 7.5), there is no fp8 execution silicon. Hopper added that. So fp8 models that claim to be lightweight actually run as software-emulated fp16, which is slower than just using a properly quantized GGUF. We run FLUX Schnell via a Q4_K_S GGUF (6.7 GB, fp16 compute path) instead of the fp8 all-in-one checkpoint (17 GB, emulated fp8). The GGUF is both smaller and faster on this hardware.
For the agentic orchestration layer specifically, we use Claude Sonnet via API for the reasoning and tool-calling backbone. But the image generation, TTS, and other media sidecar tasks all run through local models in Docker containers. The architecture is a thin MCP server that routes tool calls to the appropriate backend, whether that is a cloud API or a local container.
Hardware: NVIDIA T1000 (4GB VRAM), 32GB system RAM, WSL2 with 31.2GB envelope. The ComfyUI container gets a 24GB cgroup ceiling to handle FLUX peak memory via CPU offload. Everything else stays under 7GB combined.
Size category: S (under 8GB VRAM). Making it work at this tier requires being very deliberate about quantization format and memory management.
Farmadupe@reddit
I've got a 3090 + 2060 at home, that's enough for models in the 30B range at \~q5 with llama.cpp
* For general agentic work, qwen3.5-27b at \~Q5 is just about on the right side of competent, with some handholding on MCPs that its given. But with a small set of tools, and the right carefully crated prompts, it can do useful stuff independently.
* For batched logging/classification, I switch to the smallest qwen3.5 possible that I can use. qwen3.5-9b and below fit entirely in the 3090 at fp8 so can run under vllm, which is way way faster and less buggy.
* for some tasks, I can switch to 122b or 397b, but they're orders of magntiude slower so don't get used much.
* qwen3.5 has SOTA-level image comprehension. there's no need to pay money for image classification tasks.
* gemma4 31b is roughly comparable to qwen3.5-27b but not quite as good. The only task that it really beats qwen3.5 is video comprehension. I can stuff 100k tokens into context and get better groundings than qwen3.5 series. The default persona is a bit more pliable than qwen3.5, which can be a bit robotic.
* honestly, at 32Gb vram, I don't think I can replace opus/codex for agentic coding yet. I make my paycheck of coding and qwen3.5-27b is too slow and not brainy enough for coding tasks.